Performance Analysis and Comparison of Two Deep Learning Methods for Direction-of-Arrival Estimation with Observed Data

Liu, Shuo; Zhang, Wen; Song, Junqiang; Shi, Jian; Leng, Hongze; Yu, Qiankun

doi:10.3390/electronics15020261

Open AccessArticle

Performance Analysis and Comparison of Two Deep Learning Methods for Direction-of-Arrival Estimation with Observed Data

by

Shuo Liu

,

Wen Zhang

^*

,

Junqiang Song

,

Jian Shi

,

Hongze Leng

and

Qiankun Yu

College of Meteorology and Oceanography, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(2), 261; https://doi.org/10.3390/electronics15020261

Submission received: 1 December 2025 / Revised: 31 December 2025 / Accepted: 5 January 2026 / Published: 7 January 2026

Download

Browse Figures

Versions Notes

Abstract

Direction-of-arrival (DOA) estimation is fundamental in array signal processing, yet classical algorithms suffer from significant performance degradation under low signal-to-noise ratio (SNR) conditions and require computationally intensive eigenvalue decomposition. This study presents a systematic comparative analysis of two backbone networks, a convolutional neural network (CNN) and long short-term memory (LSTM) for DOA estimation, addressing two critical research gaps: the lack of a mechanistic understanding of architecture-dependent performance under varying conditions and insufficient validation using real measured data. Both networks are trained using cross-spectral density matrices (CSDMs) from simulated uniform linear array (ULA) signals. Under baseline conditions (1° classification interval), both CNN and LSTM methods reach an accuracy (ACC) above 98%, in which the error is ±1° for CNN and ±2° for LSTM, only existing in the end-fire direction. Key findings reveal that LSTM maintains above 90% accuracy down to −20 dB SNR, demonstrating superior noise robustness, whereas CNN exhibits better angular resolution. Four performance boundaries are identified: optimal performance is achieved at half-wavelength element spacing; SNR crossover occurs at −20 dB below which accuracy drops sharply; the snapshot threshold of 32 marks the transition from snapshot-deficient to snapshot-sufficient conditions; the array size of 8 is the turning point for the performance variation rate. Comparative analysis against traditional methods demonstrates that deep learning approaches achieve superior resolution ability, batch processing efficiency, and noise robustness. Critically, models trained exclusively on single-target simulated data successfully generalize to multi-target experimental data from the Shallow Water Array Performance (SWAP) program, recovering primary target trajectories without domain adaptation. These results provide concrete engineering guidelines for architecture selection and validate the sim-to-real generalization capability of CSDM-based deep learning approaches in underwater acoustic environments.

Keywords:

underwater acoustic; DOA estimation; SWAP; deep learning; LSTM; CNN

1. Introduction

DOA estimation is a fundamental problem in array signal processing with widespread applications in radar, sonar, wireless communications, and seismology. Over the past several decades, numerous classical algorithms have been developed to address this problem, including subspace-based methods such as Multiple Signal Classification (MUSIC) [1] and Estimation of Signal Parameters via Rotational Invariance Techniques (ESPRIT) [2], the Capon beamformer [3], also known as minimum variance distortionless response (MVDR), and the maximum likelihood (ML) method [4].

Despite their theoretical elegance and widespread adoption, these classical algorithms suffer from three critical limitations. First, their performance degrades significantly under low SNR conditions frequently encountered in underwater acoustic environments, primarily because subspace separation becomes unreliable when noise power approaches or exceeds signal power [5]. Second, the computational burden associated with eigenvalue decomposition or matrix inversion scales cubically with array size, rendering real-time processing prohibitive for large-scale arrays [6]. Third, the computational burden associated with eigenvalue decomposition or matrix inversion scales cubically with array size, rendering real-time processing prohibitive for large-scale arrays [7,8,9].

Deep learning (DL) methods have emerged as a promising alternative that can learn robust feature representations directly from data without explicit modeling of array manifold or noise statistics [10,11]. The evolution of DL-based DOA estimation can be categorized into three main paradigms. FNN, pioneered by Xiao et al. [12], extended by Takeda and Komatani [13] and Ozanich E et al. [14], demonstrated the feasibility of learning DOA mappings from covariance matrices, though they treat the spatial structure as flattened vectors. CNNs, employed by Liu et al. [15] and Li et al. [16], preserve the matrix structure of the CSDM and extract local spatial patterns through convolutional kernels. Recurrent Neural Networks (RNNs/LSTM), employed by Wang et al. [17], Xiang et al. [18], and Xie and Wang [19], model sequential dependencies across array elements through memory mechanisms. Recent advances have also addressed practical challenges, including GNSS jamming scenarios [20], reconfigurable digital architectures for real-time angle estimation [21,22], underwater acoustics scenarios [14,23,24], and uncertainty quantification [25].

However, existing DL methods have not fully resolved the core challenges of traditional methods, and each paradigm carries inherent theoretical limitations. CNN-based methods assume spatial locality and translation invariance, extracting features through fixed-size kernels. While effective under high-SNR conditions, this local receptive field constraint limits their ability to model long-range dependencies when noise corrupts local spatial patterns [15,16]. Conversely, LSTM-based methods treat CSDM rows as sequential inputs, leveraging memory mechanisms to accumulate phase evolution information. While offering potential noise robustness, this sequential processing sacrifices explicit two-dimensional spatial correlation modeling and may compromise angular resolution [18].

Based on the above analysis, two critical research gaps can be identified. First, CNN and LSTM represent distinct feature extraction mechanisms, but existing studies have only evaluated them in isolation. Their condition-dependent performance variations and the underlying physical mechanisms remain unsystematically characterized. This gap deprives researchers of principled guidance for architecture selection, making it essential to clarify the operational conditions where each architecture outperforms and the mechanisms driving such differences. Second, the majority of DL-based DOA methods are validated exclusively through simulations, raising concerns about whether models trained on idealized signals can generalize to the real underwater acoustic environments.

This paper addresses these gaps through a systematic comparative study that bridges algorithmic understanding with practical deployment considerations. The key contributions are as follows:

(1): Systematic investigation of distinct feature extraction paradigms. Comparative analyses are conducted to elucidate how CNN’s local spatial pattern extraction and LSTM’s sequential memory integration yield quantifiably divergent performance characteristics across varying regimes of the SNR, array element count, element spacing, and snapshot number.
(2): Four performance boundaries are identified. Optimal performance is achieved at half-wavelength element spacing; SNR crossover occurs at −20 dB, below which accuracy drops sharply; the snapshot threshold of 32 marks the transition from snapshot-deficient to snapshot-sufficient conditions; the array size of 8 is a turning point for the performance variation rate. Based on a multidimensional analysis encompassing the SNR, array configuration, computational efficiency, and measured data performance, specific recommendations for engineering applications are provided, thereby bridging the gap between algorithmic research and practical deployment.
(3): Validation of sim-to-real generalization capability. On the SWAP experimental dataset, it is demonstrated that models trained exclusively on single-target simulated signals can generalize to multi-target real underwater environments without domain adaptation, thus providing empirical evidence for the cross-domain applicability of CSDM-based deep learning approaches.

The remainder of this paper is organized as follows. Section 2 introduces the model of the array received signals and neural network architecture of CNN and LSTM, as well as the simulation environment, the generation of simulation data, and the information of experimental data. In Section 3, the models’ performance on simulation data and experimental data is analyzed and compared with several traditional methods. Finally, conclusions are presented in Section 4.

2. Model, Data, and Methods

2.1. Signal Model

In this paper, only ULA is discussed. As shown in Figure 1, ULA is composed of M isotropic hydrophone elements, which are arranged in a straight line with a uniform spacing of d between each element. Assuming the sound received by the array is a far-field plane wave generated from a point source, and the frequency is

ω_{0}

, the received signal

y (t)

of ULA at time

t

can be modeled as

y (t) = a (φ) s (t) + n (t)

(1)

where

y (t) = {[y_{1} (t), y_{2} (t), \dots, y_{M} (t)]}^{T}

is the array received vector, the observed noise vector is

n (t) = {[n_{1} (t), n_{2} (t), \dots, n_{M} (t)]}^{T}

, with an additive white Gaussian noise (AWGN) model applied.

a (φ) = {[e^{- j ω_{0} τ_{1} (φ)}, e^{- j ω_{0} τ_{2} (φ)}, \dots, e^{- j ω_{0} τ_{M} (φ)}]}^{T}

is the steering vector when the incident angle is

φ

,

τ_{m} (φ) = \frac{d (m - 1) \cos φ}{c} (m = 1, 2, \dots, M)

represents the delay in time between the m-th sensor and the reference sensor.

In underwater acoustic signal processing, the covariance matrix of the signal is also known as CSDM. Assuming that the signal and noise are uncorrelated, the CSDM of the array observed signal can be written as

R = E [y (t) y^{H} (t)] = A R_{s} A^{H} + σ^{2} I

(2)

where

A = [a (φ_{1}), a (φ_{2}), \dots, a (φ_{D})]

is an

M \times D

steering vector matrix (where D is the number of signal sources).

R_{s}

is the covariance matrix of the sound sources and can be expressed as

R_{s} = E [s (t) s^{H} (t)]

(3)

The CSDM contains sufficient information to distinguish a sound source from different directions and is often used as the input in a DL network.

2.2. Data Information

2.2.1. The Simulation Data

For each DOA angle

φ_{k}

(k = 1, 2, ..., 181), corresponding to angles from 0° to 180° with 1° resolution, we generate time-series signals using Equation (1) for all M sensors. Each simulation runs for 204.8 s at a sampling rate of 1000 Hz, that is, 204,800 sample points in total for each direction. Then, the signals are segmented using sliding windows (window length

N_{w}

: 4096 samples, 50% overlap) and the number of segments for each direction

N_{s}

is 159. For the l-th segment at the DOA angle

φ_{k}

, the sample covariance matrix is computed as

R_{k}^{(l)} = \frac{1}{N_{w}} \sum_{n = 1}^{N_{w}} y_{(n)}^{(l)} y_{(n)}^{(l) H}

(4)

where

y_{(n)}^{(l)} = {[y_{1} (n), y_{2} (n), \dots, y_{M} (n)]}^{T}

is the n-th snapshot in the l-th segment. This process generates a 4D tensor of covariance matrices with dimensions

[M \times M \times 181 \times N_{s}]

, where 181 represents the number of DOA angles and

N_{s}

is the number of segments per angle.

Since the covariance matrices are complex-valued, we decompose each matrix into its real and imaginary components to create a suitable input format for training:

R_{k}^{(l)} = Re (R_{k}^{(l)}) + j \cdot Im (R_{k}^{(l)})

(5)

The input tensor X is constructed as

X \in ℝ^{M \times M \times 2 \times N_{total}}

(6)

where the first channel contains Re(R), the second channel contains Im(R), and

N_{total} = 181 \times N_{s}

is the total number of samples. Each sample is associated with a corresponding DOA label

φ_{k}

∈ {0°, 1°, ..., 180°}, which is encoded as a categorical variable for classification.

The baseline configuration is chosen as

M = 32

and

d =

1.5 m, which matches the geometry of the SWAP array, and the training and testing signal SNR is 20 dB with 4096 snapshots and 1° search grid spacing. Unless stated otherwise, all results reported hereafter are based on models trained and tested under the baseline configuration.

It is worth noticing that in the process of signal simulation generation, the imaginary part cannot be ignored, and a complex number model must be used. This is because complex signals can contain both amplitude and phase information at the same time, which is necessary for analyzing the frequency and phase characteristics of the signal. In array signal processing, such as DOA estimation, the signals collected by the array will produce complex samples due to differences in arrival time and phase. These samples are used to construct the CSDM, which retains the spatial characteristics of the signal. As can be seen from Figure 2, the imaginary part of the CSDM is only symmetric about one diagonal line and has the opposite sign about the other diagonal line, while the real part of the CSDM is symmetric about two diagonal lines. The real part of the CSDM for an incident angle of 170° is the same as that for an incident angle of 10°, which means that using only the real part of the data cannot distinguish between these two angles, and the theoretical upper limit of the test ACC rate is 50%.

2.2.2. The Experimental Data

The observed data was recorded by a 32-element bottom-mounted hydrophone array deployed off the Florida coast at 1 kHz frequency, as part of the SWAP program. The Universal Transverse Mercator (UTM) coordinates of the array hydrophones and the signal data are contained in files, while the targets’ information is unknown. As shown in Figure 3, it is a ULA, but the positions of the elements are offset from the standard straight line. The average spacing between the array elements is calculated to be 1.5 m. Beamforming results are calculated using both the original array shape and the ULA after correction, and the resulting pattern diagrams show minimal differences. Therefore, the ideal ULA can be used to replace the actual array shape.

Since the targets’ moving direction is unknown, the target trajectories are extracted using the CBF algorithm, and the result can be seen in Figure 4.

The left and central parts of the graph feature two continuous bright lines, corresponding to two targets; the middle of the right half of the graph has a pale blue bright straight line, corresponding to the third target; to the right of the third target, there is an inclined pale blue bright line, corresponding to the fourth target. As shown in Figure 5, the four targets are named from left to right as T1, T2, T3, and T4. T1 has a movement angle changing from 92° to 0°, T2 has a movement angle changing from 125° to 93°, T3 has an angle of 132°, and T4 has a movement angle changing from 161° to 146°.

2.3. Neural Network Architecture

In this section, the architectures of CNN and LSTM that are applied to solve the DOA estimation problem are presented in detail. The basics of these neural networks are overwhelmed since readers can find extensive descriptions in the general DL literature [26,27].

2.3.1. CNN-Based DOA Model

Model Formulation:

For the l-th convolutional layer, the output feature map is computed as

Y^{(l)} = f (W^{(l)} * X^{(l - 1)} + b^{(l)})

(7)

where

W^{(l)}

is the convolutional kernel,

*

denotes the convolution operation,

X^{(l - 1)}

is the input from the previous layer,

b^{(l)}

is the bias term, and

f (\cdot)

is the activation function (ReLU).

ReLU (z) = \max (0, z)

(8)

Fully Connected Layer:

Y_{FC} = W_{FC} \cdot X_{FC} + b_{FC}

(9)

Output Layer (SoftMax):

P (class = i | X) = \exp (y_{i}) / Σ_{j} \exp (y_{j})

(10)

Cross-Entropy Loss Function:

L = - Σ_{i} y_{i} \log ({\hat{y}}_{i})

(11)

where

y_{i}

is the true label (one-hot encoded) and

{\hat{y}}_{i}

is the predicted probability for class i.

2.: Network Configuration

As shown in Figure 6, this CNN is composed of 3 convolution layers and 1 fully connected layer. The input layer starts with a matrix of size 32 × 32 × 2, containing two channels of inputs, corresponding to the real and imaginary parts, respectively. The activation function is ReLU, the initial learning rate is 0.0001, the learning rate optimization algorithm is Adam [28] with an initial learning rate of 0.0001, the maximum number of training epochs is 20, and the batch size is 32. An entropy loss function is used.

2.3.2. LSTM-Based DOA Model

LSTM networks are a variant of RNNs designed to capture long-term dependencies in sequential data. The LSTM architecture addresses the vanishing gradient problem through a gating mechanism that regulates information flow.

1.: Model Formulation

At each time step t, the LSTM cell computes

f_{t} = σ (W_{f} \dots [h_{t - 1}, x_{t}] + b_{f})

(12)

i_{t} = σ (W_{i} \dots [h_{t - 1}, x_{t}] + b_{i})

(13)

{\tilde{C}}_{t} = \tanh (W_{C} \dots [h_{t - 1}, x_{t}] + b_{C})

(14)

C_{t} = f_{t} ⊙ C_{t - 1} + i_{t} ⊙ {\tilde{C}}_{t}

(15)

o_{t} = σ (W_{i} \dots [h_{t - 1}, x_{t}] + b_{o})

(16)

h_{t} = o_{t} ⊙ \tanh (C_{t})

(17)

where σ denotes the sigmoid activation function, ⊙ represents element-wise multiplication, x_t is the input vector at time t,

h_{t}

is the hidden state,

C_{t}

is the cell state,

f_{t}

,

i_{t}

, and

o_{t}

are the forget, input, and output gates, respectively,

{\tilde{C}}_{t}

is the candidate cell state, and

W

and

b

denote weight matrices and bias vectors with corresponding subscripts.

For bidirectional processing, forward and backward LSTM layers are employed:

{\vec{h}}_{t} = LST M_{f o r w a r d} (x_{t}, {\vec{h}}_{t - 1})

(18)

{\overset{\leftarrow}{h}}_{t} = LST M_{b a c k w a r d} (x_{t}, {\overset{\leftarrow}{h}}_{t - 1})

(19)

h_{t} = [{\vec{h}}_{t}; {\overset{\leftarrow}{h}}_{t}]

(20)

where [ ; ] denotes concatenation, and

{\vec{h}}_{t}

and

{\overset{\leftarrow}{h}}_{t}

represent the forward and backward hidden states, respectively.

2.: Network Configuration

Define an LSTM network architecture as depicted in Figure 7. Set the input dimension to 32. If both the real and imaginary parts of the data need to be input simultaneously, the real and imaginary parts must be separated, and the imaginary part appended to the real part results in the size of each training sample

32 \times 64 = 2048

and the total number of samples

181 \times 159 = 28, 779

. If only the real part of the data is required, the imaginary part is removed, resulting in the size of each training sample

32 \times 32 = 1024

and the total number of samples

181 \times 159 = 28, 779

. Incorporate a bidirectional LSTM layer comprising 100 hidden units, and ensure the output is the final element of the sequence. Conclude with a fully connected layer of 181 units, succeeded by a SoftMax layer and a classification layer designed for 181 distinct categories.

2.3.3. Performance Metrics

Four metrics are used for performance evaluation: mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), and mean absolute percentage error (MAPE), defined as

E_{MAE} = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - {\hat{y}}_{i}|

(21)

E_{MSE} = \frac{1}{N} {\sum_{i = 1}^{N} (y_{i} - {\hat{y}}_{i})}^{2}

(22)

E_{RMSE} = \sqrt{\frac{1}{N} {\sum_{i = 1}^{N} (y_{i} - {\hat{y}}_{i})}^{2}}

(23)

E_{MAPE} = \frac{1}{N} \sum_{i = 1}^{N} \frac{|y_{i} - {\hat{y}}_{i}|}{y_{i}} \times 100 %

(24)

where

N

is the number of test data samples,

y

is the true direction, and

\hat{y}

is the predicted direction.

2.3.4. Hyperparameter Selection

In this section, through hyperparameter experiments, the optimal CNN kernel size of 5 × 5 and LSTM hidden units of 100 are determined.

Kernel sizes ranging from 3 × 3 to 9 × 9 were evaluated by ACC and RMSE. The CNN model is retrained and tested with the specified kernel. The SNR of the training and testing data is 20 dB, consistent with the baseline configuration. As shown in Figure 8, when using a 5 × 5 kernel, the ACC is the highest and the RMSE is the lowest.

As shown in Figure 9, the performance of LSTM networks with hidden state dimensions from 32 to 200 is tested. The optimal hidden unit size of 100 outstands with the lowest RMSE and considerable ACC.

3. Results

In this section, the experimental results are presented and analyzed. Section 3.1 examines the performance of the CNN (Section 3.1.1) and LSTM (Section 3.1.2) methods on simulation data, focusing on the effects of the SNR, element spacing, and the number of array elements. Section 3.2 evaluates the trained CNN (Section 3.2.1) and LSTM (Section 3.2.2) models using experimental SWAP data and presents the corresponding DOA estimation performance. Section 3.3 compares and discusses the results of the two methods with other traditional methods under various conditions.

3.1. The Results of Simulation Data

3.1.1. The Results of the CNN Method

The test data ACC is 98.871% using CNN. To more succinctly demonstrate, a scatterplot is drawn with the actual values as the horizontal axis and the predicted values as the vertical axis. If the prediction is correct, the scatter points should be on the 45° diagonal line. There are two points that are not on the diagonal line, namely (1, 2) and (181, 180), indicating that there is an error within ±1° (Figure 10) for the prediction of the end-fire angle.

1.: SNR

The following simulation study investigates the impact of the SNR on the directional performance of the CNN algorithm.

An array configuration of ULA is set, and Gaussian white noise is added to the target signal to generate noisy signals with SNRs of −20, −15, −10, −5, 0, 5, 10, 15, and 20 dB. These noisy signals are then input into the well-trained CNN model. The changes in the CNN model’s test ACC and four error metrics (RMSE, MAE, MSE, and MAPE) are shown in Figure 11. The overall trend of the four error metrics is to decrease as the SNR increases, while the ACC increases with the increase in the SNR. During the simulation, it was found that when the SNR is below −15 dB, the ACC is below 0.95 and drops rapidly.

2.: Spacing of elements

The following simulation study investigates the impact of the spacing of elements on the directional performance of the CNN algorithm. An array configuration of ULA is set, and the array spacing is 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.05, 2.1, 2.11, 2.12, 2.13, 2.14, 2.15, 2.16, 2.17, 2.18, 2.19, 2.2, 2.4, 2.6, 2.8, and 3.0 m. Since an urgent change between 2.10 and 2.2 was observed, denser test points were set within this range to explore the subtle structure (Figure 12b). As shown in Figure 12a, when the spacing is half-wavelength (2.14 m), the ACC reaches its highest point. The four-error metrics remain relatively stable when the spacing is smaller than 2.14 m, but rapidly increase between 2.14 and 2.16 m, suggesting that the model performance degrades severely.

3.: Number of elements

The following simulation studies the effect of the number of arrays on the performance of the CNN algorithm. Set the formation shape as ULA, the number of array elements as 4, 6, 8, 16, 32, 64, and 128, generate noiseless signals with a center frequency of 350 Hz, and input them into the trained CNN model. The CNN model test’s ACC and four error indicators (RMSE, MAE, MSE, and MAPE) change with the number of elements, as shown in Figure 13. The ACC rate increases with the increase in the number of elements, while the four error indicators decrease with the increase in the number of elements, and gradually become stable.

When the array element number is 4 (with the input SNR maintained at 20 dB), the DOA result is shown in Figure 14a. The scatterplot of true versus predicted angles exhibits a distinct stepped pattern, indicating that within certain angular ranges, different signal DOAs will be predicted to be the same value, which is essentially due to the reduction in angular resolution. Furthermore, as illustrated in Figure 14b, significant estimation errors frequently occur within the range of −30° to 30°, indicating a degradation in model performance under this condition.

This degradation can be explained by the relationship between the array aperture and the input feature representation. When the number of elements decreases, the size of the CSDM also decreases, the precision of the information stored by the CSDM decreases, and the CSDM generated from adjacent angles becomes more difficult to distinguish. Therefore, under the same iteration times, when the number of elements is not enough to distinguish all the wave arrival directions, the ACC decreases with the decrease in the number of elements. Conversely, when a sufficient number of array elements is used, the CSDM possesses adequate dimensions and information richness to achieve high resolution. As demonstrated in Figure 14c,d for a 64-element array (with the input SNR maintained at 20 dB), the model achieves nearly error-free prediction ACC.

3.1.2. The Results of the LSTM Method

After the training started, the training loss decreased significantly. After about 500 iterations, the test ACC began to increase markedly. Training was terminated after 10,933 iterations, with a final test ACC of 98.90%. The training results are shown in Figure 15a, with error points at (2, 1) and (180, 181). The test error is depicted in Figure 15b, within −2~2°.

1.: SNR

Gaussian white noise is added to the target signal to generate noisy signals with SNRs of −25, −20, −15, −10, −5, 0, 5, 10, 15, and 20 dB. These noisy signals are then input into the well-trained LSTM model. The test ACC of the LSTM model and the four error metrics—RMSE, MAE, MSE, and MAPE—are shown to vary with the SNR in Figure 16a. The pink area in Figure 16a with high data intensity is presented in refined Figure 16b.

The overall trend of the four-error metrics is to decrease as the SNR increases, while the ACC increases with the increase in SNR. When the SNR is below −20 dB, the directional performance of the LSTM model drops sharply.

2.: Spacing of elements

Explore the influence of matrix spacing on the performance of the LSTM algorithm below. The array was set to ULA, with an element spacing of 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.05, 2.1, 2.11, 2.12, 2.13, 2.14, 2.15, 2.16, 2.17, 2.18, 2.19, 2.2, 2.4, 2.6, 2.8, 3.0 m, and a noiseless signal with a center frequency of 350 Hz was generated and input into the trained LSTM model, the LSTM model test’s ACC, and the four error indicators (RMSE, MAE, MSME, MAPE) changed with the element spacing, as shown in Figure 17.

Similarly to the rule that the performance of the CNN model is affected by the array spacing, when the spacing is 2.13 m, which is around half-wavelength (2.14 m), the ACC reaches its highest point. The four error metrics remain relatively stable when the spacing is smaller than 2.14 m, but rapidly increase between 2.14 m and 2.16 m, which suggests that the model performance degrades.

3.: Number of elements

The following simulation studies the effect of the number of arrays on the performance of the LSTM algorithm. Set the formation shape as ULA, the number of array elements as 4, 6, 8, 16, 32, 64, 128, generate noiseless signals with a center frequency of 350 Hz, and input them into the trained LSTM model. The LSTM model test’s ACC and four error indexes RMSE, MAE, MSE, and MAPE change with the number of elements, as shown in Figure 18.

Similarly to the law of the CNN model affected by the number of elements, the ACC rate increases with the increase in the number of elements, and the four error indicators decrease with the increase in the number of elements, and all indicators gradually become stable. When the array element number is 4, the DOA results are shown in Figure 19. The DOA performance is greatly affected by the end-firing direction, and different incidence angles are predicted to be the same angle within a certain range. Compared with CNN, the DOA resolution of LSTM is better when the number of elements is reduced.

3.2. The Results of Experimental Data

In this section, the SWAP data are applied to the trained CNN and LSTM classification model separately, resulting in satisfactory DOA performance. Since only one direction is predicted for each time segment, the methods mentioned above are more suitable for single-target scenario, yet they still exhibit considerable applicability in multi-target tasks such as SWAP.

To quantitatively characterize such applicability, three evaluation metrics, track recovery rate (TRR), detection rate (DR), and mean bearing error (MBE) are proposed in this paper:

TRR = \frac{1}{N_{T}} \sum_{t = 1}^{N_{T}} 1 (|{\hat{θ}}_{t} - θ_{t}^{GT} \leq 5^{\circ}| \times 100 %

(25)

DR = \frac{1}{N_{T}} \sum_{t = 1}^{N_{T}} 1 (|{\hat{θ}}_{t} - θ_{t}^{GT} \leq 10^{\circ}| \times 100 %

(26)

MBE = \frac{1}{N_{valid}} \sum_{t \in N_{valid}} |{\hat{θ}}_{t} - θ_{t}^{GT}| \times 100 %

(27)

where

N_{T}

denotes the total number of time segments obtained by partitioning the SWAP dataset with a sliding window of size 4096 and an overlapping ratio of 50%,

{\hat{θ}}_{t}

represents the estimated angle output by the model for the t-th time segment, and

θ_{t}^{GT}

denotes the corresponding true angle. Time periods where the deviation between the predicted values and the true values falls within ±10° are defined as valid prediction time periods, and the total number of valid prediction time periods is

N_{valid}

. The quantitative evaluation results can be seen in Table 1.

3.2.1. The Results of the CNN Method

The target trajectories can generally be recovered as illustrated in Figure 20, with the performance for Target T1 being particularly prominent at a TRR of 40.83%. For Target T3, the DR reaches 21.14% with a mean bearing error of 6.46°, exhibiting a stable angular offset relative to the ground-truth trajectory. Target T2 is detected, yielding a detection rate of 26.05%, but has a relative large mean error of 6.84°. Target T4 is only partially detected, with a track recovery rate of merely 8.54% and a detection rate of 12.80%.

3.2.2. The Results of the LSTM Method

As shown in Figure 21, the LSTM model is essentially able to recover the target trajectory, particularly notable for Target T1 with 52.75% track recovery rate and a 1.5° mean bearing error, outperforming the CNN method in all three metrics. Target T2 prediction is basically correct with 42.86% detection rate and a 5.00° mean bearing error, also outperforming the CNN method. However, as for T3 and T4, the detection rate is only 2.86% and 6.71% separately, much lower than that for CNN of 21.14% and 12.80%.

3.3. Compare and Contrast

3.3.1. Performance Comparison of CNN and LSTM

As shown in Figure 22a, when the SNR is higher than −20 dB, the MAPE remains stable with slight turbulence; however, when the SNR is lower than −20 dB, the MAPE increases sharply, which means the DOA ability degrades when the SNR decreases. The red line is above the black line when the SNR is larger than −20 dB, which means CNN performs better than LSTM with a smaller error. But the LSTM method shows advantages in ACC under −20 dB; when the SNR is lower than −20 dB, the ACC of LSTM is higher than CNN, as shown in Figure 22b. A physical interpretation of these findings is given: Under high SNR conditions, local spatial patterns are clearly distinguishable, and the local feature extraction capability of CNN is fully effective. At a low SNR, local patterns are submerged by noise, while the statistical characteristics of global sequences remain identifiable; the memory mechanism of LSTM acts as an integration function.

As shown in Figure 23a, CNN shows lower MAPE when the spacing of elements is smaller than half-wavelength, while the MAPE of LSTM is lower when the spacing is larger than half-wavelength. Figure 23b demonstrates that the prediction ACC of CNN is slightly higher than LSTM.

Figure 24a,b present that CNN performs better when the number of elements is larger than or equal to 8, with higher ACC and smaller MAPE. When the number of elements is less than 8, LSTM maintains better performance, maintaining an ACC close to 80% with only 4–8 array elements. An array size of 8 elements serves as a critical threshold: When the number of elements is less than 8, the directional performance degrades drastically with the decrease in the element count; conversely, when the number exceeds 8, the rate of change in directional performance induced by an increase in elements diminishes significantly. It can be inferred that sufficient spatial sampling is indispensable to support the convolution operation, given that CNNs are more sensitive to the number of array elements. Conversely, the fact that LSTMs remain applicable even for miniaturized arrays suggests that sequence modeling imposes a lower requirement on spatial sampling density.

3.3.2. Performance Comparison of Other Algorithms

In this section, a systematic comparative analysis of DOA estimation performance was conducted between deep learning methods (CNN and LSTM) and conventional algorithms, CBF, MUSIC, MVDR, and ESPRIT, under a uniform array configuration (baseline configuration) and simulated signal conditions.

1.: Comparison of Algorithm Resolution

Figure 25 shows the spatial spectra obtained by different methods for a simulated signal impinging from 90° with an SNR of 20 dB under the baseline configuration (search interval of 1° for the deep-learning-based model and 1° for the traditional method). For the CNN and LSTM methods, the plotted curves correspond to the SoftMax probability spectra. In terms of main lobe width, the methods can be ranked as CBF > MVDR > LSTM > MUSIC > CNN, indicating that deep learning methods, especially CNN, show superior resolution ability compared with traditional methods.

2.: Comparison of ACC and RMSE under different SNRs

To compare the performance of CNN- and LSTM-based DOA estimation methods with traditional methods under varying SNRs, the ACC and RMSE of six methods (under the baseline configuration) across SNR levels of −25, −20, −15, −10, 0, 10, 15, and 20 dB are evaluated. Notably, for the conventional methods, the accuracy is defined as follows: a recognition result is considered correct if the angular deviation between the angle corresponding to the spectral peak and the target angle is within 0.1°.

As shown in Figure 26a, the ACC of all methods exhibits a clear upward trend with an increasing SNR. At low SNRs (e.g., −25 dB), the ACC values of traditional methods are relatively low, while the deep learning-based CNN and LSTM achieve significantly higher ACC. At high SNRs, all methods gradually approach 100% accuracy: CNN and LSTM yield slightly lower accuracy than traditional MUSIC, CBF, and MVDR but outperform the ESPRIT method. This indicates the strong noise robustness of deep learning models in DOA tasks.

Figure 26b further quantifies the error characteristics: the RMSE of all methods decreases sharply as the SNR increases. When the SNR is less than −20 dB, LSTM shows a relatively large RMSE; CNN achieves a smaller RMSE than ESPRIT and MUSIC, but a larger RMSE than MVDR and CBF. When the SNR is more than −20 dB, the RMSE of all methods drops rapidly to within 1.7°, and the differences in RMSE between methods tend to diminish as the SNR increases.

Figure 26a,b demonstrate that an SNR of −20 dB serves as a critical threshold: When the SNR is less than −20 dB, the directional performance degrades drastically with a decrease in the SNR; conversely, when the number exceeds −20 dB, the rate of change in directional performance induced by an increase in SNRs diminishes significantly. Notably, traditional methods also basically follow the same trend.

In summary, CNN- and LSTM-based DOA methods balance noise robustness (superior at low SNRs) and acceptable high-SNR accuracy, addressing the limitations of traditional methods, which perform poorly in noisy environments.

3.: Comparison of ACC and RMSE under different snapshots

To compare the performance of CNN- and LSTM-based DOA estimation methods with traditional methods under varying snapshots, the ACC and RMSE of six methods across snapshots of 1, 2, 32, 64, 128, 256, 512, 1024, 2048, 4096, and 8192 and SNRs of 20 dB are evaluated.

The choice of snapshot ranges to be tested is based on signal processing theory and underwater acoustic experience. From a signal processing perspective, these values are powers of two, ensuring computational efficiency in FFT-based processing. Additionally, for reliable covariance matrix estimation, the number of snapshots should sufficiently exceed the number of array elements. However, since single-snapshot DOA estimation has also attracted considerable research attention [23], this study includes scenarios where the number of snapshots (1, 2, 32) is fewer than the number of array elements, as well as the transition from snapshot-deficient to snapshot-sufficient conditions. From an underwater acoustic standpoint, sonar systems commonly employ time windows exceeding 1 s, and the selected range encompasses data lengths frequently adopted in the literature, such as 4096 [29].

As shown in Figure 27, the accuracy of all methods increases and the RMSE decreases as the number of snapshots increases. Snapshots of 32 constitute a critical threshold: when the number of snapshots exceeds 32 (i.e., the number of array elements), the rate of improvement gradually diminishes, and the performance differences among methods become less pronounced with increasing snapshots, except that the ACC of LSTM remains notably lower than the other methods; when the number of snapshots is fewer than the array elements, the directional performance of both conventional and deep learning-based methods degrades drastically, and LSTM and CNN outperform ESPRIT but underperform compared to CBF, MVDR, and MUSIC.

4.: Comparison of Computational Complexity and Real-Time Performance

To evaluate the real-time processing capability of each algorithm, this paper statistically analyzes the inference time and time distribution of different methods on the same hardware platform. Our experimental hardware platform consists of a workstation with an Intel Core i9-12900H CPU running at 2500 MHz and an NVIDIA GeForce RTX 3070Ti Laptop GPU with 8.6 GB of memory.

As shown in Figure 28, when processing individual samples (batch size = 1), CNN and LSTM exhibit higher computational times of approximately 1.5 ms and 2.5 ms per sample, respectively, compared to the conventional methods. This computational overhead can be attributed to the inherent architectural complexity of neural networks, which involve numerous matrix multiplications, nonlinear activation functions, and multi-layer forward propagation operations. In contrast, traditional subspace-based methods such as MUSIC (1.326 ms) and signal processing approaches like MVDR (0.480 ms) and ESPRIT (0.087 ms) demonstrate relatively stable per-sample processing times. However, a significant advantage of deep learning approaches emerges with increasing batch sizes. As the batch size increases from 1 to 128, the per-sample inference time for both CNN and LSTM decreases dramatically, ultimately falling below 0.1 ms per sample. This improvement results from the highly parallelizable nature of neural network computations, which can be efficiently accelerated through GPU-based matrix operations. At batch size 128, both CNN and LSTM achieve superior computational efficiency compared to all conventional methods except ESPRIT. Notably, the conventional algorithms, being fundamentally designed for sequential sample-by-sample processing, cannot leverage batch parallelization to improve throughput. This architectural limitation renders them less suitable for high-throughput applications where multiple signals require simultaneous processing.

The box plot in Figure 29 further illustrates the distribution of inference times across all methods at batch size = 1, confirming the higher median values and greater variability for LSTM and CNN compared to traditional approaches. The LSTM network exhibits the highest median inference time along with the largest interquartile range (IQR), indicating substantial variability in computational latency. This variability stems from the sequential nature of recurrent computations and memory access patterns inherent to LSTM architectures. The CNN demonstrates a lower median value with a narrower IQR, reflecting more consistent computational behavior due to its feed-forward structure. In contrast, the conventional methods display notably compact distributions with minimal variance, as their algorithmic operations involve deterministic matrix decompositions and fixed computational pathways. The presence of outliers (indicated by red markers) across all methods suggests occasional computational fluctuations attributable to system-level factors such as memory allocation and CPU/GPU scheduling overhead.

In summary, while deep learning methods incur higher computational costs for single-sample inference, their capacity for parallel batch processing provides substantial advantages in scenarios demanding high-throughput real-time DOA estimation.

5.: Performance Comparison in Actual Underwater Acoustic Data

The proposed model is trained on narrowband signals. To suppress broadband interference, a band-pass filter is applied when forming the covariance matrix, and only the 350 Hz component is retained. All methods are evaluated using this same filtered covariance matrix.

Since CBF, MVDR, and MUSIC are spectrum-based approaches, their results are presented in terms of power spectra, as shown in Figure 30a–c. In contrast, ESPRIT, LSTM, and CNN directly output bearing estimates; their outputs are therefore visualized as time-bearing heatmaps in Figure 30d–f.

The LSTM method produces a target bearing evolution that is broadly consistent with that of MVDR and MUSIC. Compared with MVDR, LSTM exhibits a few isolated prediction errors at certain bearings, but it still tracks the overall trajectories of targets T1 and T2. However, similar to MVDR and MUSIC, LSTM does not clearly resolve targets T3 and T4. The CNN model, on the other hand, fails to reconstruct the full trajectory of target T2, but it does capture portions of the trajectories of targets T3 and T4. This suggests that the two networks have learned complementary features and that combining their outputs could potentially recover all target tracks. ESPRIT performs worst among the tested methods and reliably resolves only target T1.

In summary, deep learning methods based on CNNs or LSTMs have not only matched but also surpassed traditional approaches on real underwater acoustic data, demonstrating strong adaptability. CBF provides the most complete set of detected targets but suffers from relatively wide beams; LSTM and CNN each have distinct advantages, with LSTM achieving performance comparable to MVDR and MUSIC, while CNN successfully highlights the T3 and T4 targets that are not well captured by LSTM, MVDR, or MUSIC. ESPRIT shows the poorest overall performance in this scenario.

4. Conclusions

This study presented a systematic comparative analysis of CNN and LSTM architectures for DOA estimation in underwater acoustic applications, addressing the mechanistic differences between local spatial feature extraction and sequential phase evolution paradigms, providing empirical evidence for sim-to-real generalization capability. The key findings are summarized as follows.

First, the two architectures exhibit complementary strengths: CNN achieves higher precision under favorable SNR conditions and efficient array size due to its ability to capture local spatial patterns, while LSTM demonstrates superior noise and array size robustness through its memory-based integration mechanism. Second, critical performance boundaries exist at half-wavelength element spacing, 8-element array size, −20 dB SNR crossover, and 32-snapshot minimum, providing actionable design guidelines. Third, systematic comparison with CBF, MVDR, MUSIC, and ESPRIT algorithms demonstrates that deep learning approaches achieve superior spatial resolution, better noise robustness, and higher computational efficiency through batch parallelization. Fourth, the successful validation on SWAP experimental data demonstrates that models trained on single-target simulations can generalize to multi-target real-world scenarios without domain adaptation, recovering primary target trajectories for up to four targets. This result validates the practical applicability of CSDM-based deep learning approaches for underwater DOA estimation.

This study is subject to several limitations that define the scope of the conclusions. First, the proposed models are trained and validated primarily on narrowband signals at a fixed frequency; performance on wideband or frequency-varying scenarios remains to be investigated. Second, the networks are formulated as single-target classifiers with 1-degree angular resolution; explicit multi-target estimation with variable source numbers and finer resolution was not addressed.

Future work will explore multi-target joint estimation, extend to wideband DOA estimation, and investigate hybrid architectures that combine CNN’s spatial precision with LSTM’s temporal robustness. Moreover, it will enhance theoretical profundity and develop physical interpretability of end-to-end learning, and employ scientific optimization strategies to mitigate the empirically driven reliance in the process of network architecture design.

Author Contributions

Conceptualization, W.Z. and S.L.; methodology, W.Z.; formal analysis, S.L.; investigation, S.L.; resources, J.S. (Junqiang Song), J.S. (Jian Shi) and H.L.; data curation, Q.Y.; writing—original draft preparation, S.L.; writing—review and editing, S.L.; supervision, J.S. (Junqiang Song), J.S. (Jian Shi) and H.L.; All authors have read and agreed to the published version of the manuscript. S.L. and W.Z. contributed equally to this work.

Funding

This research received no external funding.

Data Availability Statement

The reader can inquire about all the related data from the first author (liushuo@nudt.edu.cn) and the corresponding author (zhangwen06@nudt.edu.cn). The authors would like to thank NEAR-lab of Portland State University (http://nearlab.ece.pdx.edu/, accessed on 23 February 2020) and (https://nearlab.ece.pdx.edu/projects, accessed on 4 January 2026) where the related information of SWAP data originally come from.

Acknowledgments

We are grateful to the reviewers for their meticulous and insightful review. Their thoughtful comments on the initial draft have helped us strengthen the rigor of our arguments and refine the structure and presentation of the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Schmidt, R.O. Multiple Emitter Location and Signal Parameter Estimation. IEEE Trans. Antennas Propag. 1986, 34, 276–280. [Google Scholar] [CrossRef]
Roy, R.; Paulraj, A.; Kailath, T. ESPRIT—A Subspace Rotation Approach to Estimation of Parameters of Cisoids in Noise. IEEE Trans. Acoust. Speech Signal Process. 1986, 34, 1340–1342. [Google Scholar] [CrossRef]
Capon, J. High-Resolution Frequency-Wavenumber Spectrum Analysis. Proc. IEEE 1969, 57, 1408–1418. [Google Scholar] [CrossRef]
Stoica, P.; Nehorai, A. MUSIC, Maximum Likelihood, and Cramér-Rao Bound. IEEE Trans. Acoust. Speech Signal Process. 1989, 37, 720–741. [Google Scholar] [CrossRef]
Van Trees, H.L. Optimum Array Processing: Part IV of Detection, Estimation, and Modulation Theory; John Wiley & Sons: Hoboken, NJ, USA, 2002. [Google Scholar]
Krim, H.; Viberg, M. Two Decades of Array Signal Processing Research: The Parametric Approach. IEEE Signal Process. Mag. 1996, 13, 67–94. [Google Scholar] [CrossRef]
Friedlander, B.; Weiss, A.J. Direction Finding in the Presence of Mutual Coupling. IEEE Trans. Antennas Propag. 1991, 39, 273–284. [Google Scholar] [CrossRef]
Wax, M.; Kailath, T. Detection of Signals by Information Theoretic Criteria. IEEE Trans. Acoust. Speech Signal Process. 1985, 33, 387–392. [Google Scholar] [CrossRef]
Shan, T.J.; Wax, M.; Kailath, T. On Spatial Smoothing for Direction-of-Arrival Estimation of Coherent Signals. IEEE Trans. Acoust. Speech Signal Process. 1985, 33, 806–811. [Google Scholar] [CrossRef]
Bialer, O.; Garnett, N.; Tirer, T. Performance Advantages of Deep Neural Networks for Angle of Arrival Estimation. In Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 3907–3911. [Google Scholar]
Liu, Z.M.; Zhang, C.; Yu, P.S. Direction-of-Arrival Estimation Based on Deep Neural Networks with Robustness to Array Imperfections. IEEE Trans. Antennas Propag. 2018, 66, 7315–7327. [Google Scholar] [CrossRef]
Xiao, X.; Zhao, S.; Zhong, X.; Jones, D.L.; Chng, E.S.; Li, H. A Learning-Based Approach to Direction of Arrival Estimation in Noisy and Reverberant Environments. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 2814–2818. [Google Scholar]
Takeda, R.; Komatani, K. Sound Source Localization Based on Deep Neural Networks with Directional Activate Function Exploiting Phase Information. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 405–409. [Google Scholar]
Ozanich, E.; Gerstoft, P.; Niu, H. A Feedforward Neural Network for Direction-of-Arrival Estimation. J. Acoust. Soc. Am. 2020, 147, 2035–2048. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Chen, H.; Wang, B. DOA Estimation Based on CNN for Underwater Acoustic Array. Appl. Acoust. 2021, 172, 107594. [Google Scholar] [CrossRef]
Li, X.; Chen, J.; Bai, J.; Ayub, M.S.; Zhang, D.; Wang, M.; Yan, Q. Deep Learning-Based DOA Estimation Using CRNN for Underwater Acoustic Arrays. Front. Mar. Sci. 2022, 9, 1027830. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, X.; Wang, D. Robust Speaker Localization Guided by Deep Learning-Based Time-Frequency Masking. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 178–188. [Google Scholar] [CrossRef]
Xiang, H.; Chen, B.; Yang, M.; Xu, S.; Li, Z. Improved Direction-of-Arrival Estimation Method Based on LSTM Neural Networks with Robustness to Array Imperfections. Appl. Intell. 2021, 51, 4420–4433. [Google Scholar] [CrossRef]
Xie, Y.; Wang, B. Data-Driven DOA Estimation Methods Based on Deep Learning for Underwater Acoustic Vector Sensor Array. Mar. Technol. Soc. J. 2023, 57, 16–29. [Google Scholar] [CrossRef]
Choudhary, R.; Varshney, A.; Dahiya, S. DoA Estimation of GNSS Jamming Signal Using Tripole Vector Antenna and Random Forest Regression. In Proceedings of the 2024 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS), Guwahati, India, 15–18 December 2024; pp. 1–6. [Google Scholar]
Florio, A.; Avitabile, G.; Talarico, C.; Coviello, G. A Reconfigurable Full-Digital Architecture for Angle of Arrival Estimation. IEEE Trans. Circuits Syst. I Regul. Pap. 2024, 71, 1443–1455. [Google Scholar] [CrossRef]
Florio, A.; Coviello, G.; Talarico, C.; Avitabile, G. Adaptive DDS-PLL Beamsteering Architecture Based on Real-Time Angle-of-Arrival Estimation. In Proceedings of the 2024 IEEE 67th International Midwest Symposium on Circuits and Systems (MWSCAS), Springfield, MA, USA, 11–14 August 2024; pp. 628–631. [Google Scholar]
Ozanich, E.; Gerstoft, P.; Niu, H. A Deep Network for Single-Snapshot Direction of Arrival Estimation. In Proceedings of the 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP), Pittsburgh, PA, USA, 13–16 October 2019. [Google Scholar]
Chen, H.; Zhang, J.; Jiang, B.; Cui, X.; Zhou, R.; Zhang, Y. Multi-Source Underwater DOA Estimation Using PSO-BP Neural Network Based on High-Order Cumulant Optimization. China Commun. 2023, 20, 212–229. [Google Scholar] [CrossRef]
Khurjekar, I.D.; Gerstoft, P. Uncertainty Quantification for Direction-of-Arrival Estimation with Conformal Prediction. J. Acoust. Soc. Am. 2023, 154, 979–990. [Google Scholar] [CrossRef] [PubMed]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Chollet, F. Deep Learning with Python; Manning Publications: Shelter Island, NY, USA, 2017. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Yang, Z.; Nie, W.; Ye, L.; Cheng, G.; Yan, Y. Reliable Underwater Multi-Target Direction of Arrival Estimation with Optimal Transport Using Deep Models. J. Acoust. Soc. Am. 2024, 156, 2119–2131. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Diagram of a uniform linear array.

Figure 2. CSDMs of the simulation signal. (a) 10° incidence angle, real part; (b) 10° incidence angle, imaginary part; (c) 170° incidence angle, real part; (d) 170° incidence angle, imaginary part.

Figure 3. Array shape information.

Figure 4. Bearing trajectory diagram using the MVDR method.

Figure 5. Trajectories of four targets.

Figure 6. The architecture of CNN.

Figure 7. The architecture of LSTM.

Figure 8. Influence of kernel size on ACC and RMSE. (a) Comparison of ACC; (b) Comparison of RMSE.

Figure 9. Influence of hidden unit on ACC and RMSE. (a) Comparison of ACC; (b) comparison of RMSE.

Figure 10. Training result of CNN. (a) Diagram of real direction and predicted direction of each sample. (b) The error of each sample.

Figure 11. The variation in five evaluation metrics for the CNN model with different SNRs.

Figure 12. Effect of the spacing of elements of the CNN model. (a) General results; (b) refined results.

Figure 13. Influence of the number of elements on DOA estimation performance of the CNN model.

Figure 14. The CNN model training result. (a) Scatterplot of real and predicted DOAs when the number of elements is 4; (b) test error when the number of elements is 4; (c) scatterplot of real and predicted DOAs when the number of elements is 64; (d) test error when the number of elements is 64.

Figure 15. Training result of LSTM. (a) Diagram of real direction and predicted direction of each sample; (b) the error of each sample.

Figure 16. The variation in five evaluation metrics for the LSTM model with different SNRs. (a) General result; (b) refined result.

Figure 17. Effect of the spacing of elements of the LSTM model. (a) General result; (b) refined result.

Figure 18. Influence of the number of elements on the DOA estimation performance of the LSTM model.

Figure 19. The LSTM model training result when the number of elements is 4.

Figure 20. CNN model SWAP experimental data directional results (a) Direct output results (b) Introduce actual trajectory (red dots) for comparison.

Figure 21. LSTM model SWAP experimental data directional results (a) Direct output results. (b) Introduce actual trajectory (red dots) for comparison.

Figure 22. Comparison of the variation in MAPE and ACC with SNR. (a) Change in MAPE (b) Change in ACC.

Figure 23. Comparison of the variation in ACC with spacing of elements. (a) Change in MAPE (b) Change in ACC.

Figure 24. Comparison of the variation in ACC with number of elements. (a) Change in MAPE (b) Change in ACC.

Figure 25. Spectrum comparison under baseline configuration.

Figure 26. Comparison of ACC and RMSE under different SNRs. (a) Change in ACC. (b) Change in RMSE.

Figure 27. Comparison of ACC and RMSE under different snapshots. (a) Change in ACC under 20 dB; (b) change in RMSE under 20 dB.

Figure 28. Batch inference efficiency.

Figure 29. Inference time distribution.

Figure 30. Methods performance on SWAP data. (a) CBF; (b) MVDR; (c) MUSIC; (d) ESPRIT; (e) LSTM; (f) CNN.

Table 1. Quantitative evaluation on SWAP experimental data.

Target	Method	Track Recovery Rate (%)	Detection Rate (%)	Mean Bearing Error (°)
T1	CNN	40.83	45.41	1.64
T1	LSTM	52.75	55.96	1.5
T2	CNN	4.20	26.05	6.84
T2	LSTM	18.49	42.86	5.00
T3	CNN	13.71	21.14	6.46
T3	LSTM	1.71	2.86	4.20
T4	CNN	8.54	12.80	4.05
T4	LSTM	4.27	6.71	4.64

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, S.; Zhang, W.; Song, J.; Shi, J.; Leng, H.; Yu, Q. Performance Analysis and Comparison of Two Deep Learning Methods for Direction-of-Arrival Estimation with Observed Data. Electronics 2026, 15, 261. https://doi.org/10.3390/electronics15020261

AMA Style

Liu S, Zhang W, Song J, Shi J, Leng H, Yu Q. Performance Analysis and Comparison of Two Deep Learning Methods for Direction-of-Arrival Estimation with Observed Data. Electronics. 2026; 15(2):261. https://doi.org/10.3390/electronics15020261

Chicago/Turabian Style

Liu, Shuo, Wen Zhang, Junqiang Song, Jian Shi, Hongze Leng, and Qiankun Yu. 2026. "Performance Analysis and Comparison of Two Deep Learning Methods for Direction-of-Arrival Estimation with Observed Data" Electronics 15, no. 2: 261. https://doi.org/10.3390/electronics15020261

APA Style

Liu, S., Zhang, W., Song, J., Shi, J., Leng, H., & Yu, Q. (2026). Performance Analysis and Comparison of Two Deep Learning Methods for Direction-of-Arrival Estimation with Observed Data. Electronics, 15(2), 261. https://doi.org/10.3390/electronics15020261

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Performance Analysis and Comparison of Two Deep Learning Methods for Direction-of-Arrival Estimation with Observed Data

Abstract

1. Introduction

2. Model, Data, and Methods

2.1. Signal Model

2.2. Data Information

2.2.1. The Simulation Data

2.2.2. The Experimental Data

2.3. Neural Network Architecture

2.3.1. CNN-Based DOA Model

2.3.2. LSTM-Based DOA Model

2.3.3. Performance Metrics

2.3.4. Hyperparameter Selection

3. Results

3.1. The Results of Simulation Data

3.1.1. The Results of the CNN Method

3.1.2. The Results of the LSTM Method

3.2. The Results of Experimental Data

3.2.1. The Results of the CNN Method

3.2.2. The Results of the LSTM Method

3.3. Compare and Contrast

3.3.1. Performance Comparison of CNN and LSTM

3.3.2. Performance Comparison of Other Algorithms

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI