DACL-Net: A Dual-Branch Attention-Based CNN-LSTM Network for DOA Estimation

Xu, Wenjie; Yi, Shichao

doi:10.3390/s26020743

Open AccessArticle

DACL-Net: A Dual-Branch Attention-Based CNN-LSTM Network for DOA Estimation

by

Wenjie Xu

¹

and

Shichao Yi

^2,3,*

¹

School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212003, China

²

School of Science, Jiangsu University of Science and Technology, Zhenjiang 212003, China

³

Zhenjiang Jizhi Ship Technology Co., Ltd., Zhenjiang 212003, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(2), 743; https://doi.org/10.3390/s26020743

Submission received: 10 December 2025 / Revised: 20 January 2026 / Accepted: 20 January 2026 / Published: 22 January 2026

(This article belongs to the Section Communications)

Download

Browse Figures

Versions Notes

Abstract

While deep learning methods are increasingly applied in the field of DOA estimation, existing approaches generally feed the real and imaginary parts of the covariance matrix directly into neural networks without optimizing the input features, which prevents classical attention mechanisms from improving accuracy. This paper proposes a spatio-temporal fusion model named DACL-Net for DOA estimation. The spatial branch applies a two-dimensional Fourier transform (2D-FT) to the covariance matrix, causing angles to appear as peaks in the magnitude spectrum. This operation transforms the original covariance matrix into a dark image with bright spots, enabling the convolutional neural network (CNN) to focus on the bright-spot components via an attention module. Additionally, a spectrum attention mechanism (SAM) is introduced to enhance the extraction of temporal features in the time branch. The model learns simultaneously from two data branches and finally outputs DOA results through a linear layer. Simulation results demonstrate that DACL-Net outperforms existing algorithms in terms of accuracy, achieving an RMSE of 0.04° at an SNR of 0 dB.

Keywords:

DOA estimation; acoustic vector array; signal processing; attention; deep learning

1. Introduction

Direction-of-arrival (DOA) estimation is a cornerstone of array signal processing, with extensive applications in radar, wireless communications, electronic countermeasures, acoustic direction finding, and astronomy. It aims to precisely determine the incident angles of signals impinging on an antenna array, which is an indispensable prerequisite for subsequent operations such as target localization and tracking [1,2,3,4]. DOA estimation plays a pivotal role in state-of-the-art systems, including MIMO communications, intelligent transportation networks, UAV collaborative operations, and 6G infrastructure [5,6,7]. For example, in MIMO systems, it enables enhanced beamforming and optimized spatial multiplexing [8,9,10], while in radar, it strengthens the ability to detect and track multiple targets simultaneously. Consequently, the development of DOA estimation methods that achieve high precision and low computational complexity, particularly under adverse operating conditions, remains a central focus of academic and industrial research [11,12].

Over decades of research, a wealth of classical DOA estimation approaches have been devised. Conventional beamforming (CBF), one of the earliest developed techniques, estimates signal directions through the construction of spatial beam patterns [13,14,15]. Nevertheless, its effectiveness is constrained by the Rayleigh resolution criterion and vulnerability to noise when operating in low signal-to-noise ratio (SNR) scenarios [16,17,18]. To address these drawbacks, super-resolution subspace-based algorithms were proposed [19,20,21,22,23,24,25], including multiple signal classification (MUSIC) [21,22,23] and estimation of signal parameters via rotational invariance techniques (ESPRIT) [24,25]. MUSIC capitalizes on the orthogonality between signal and noise subspaces, but it entails substantial computational costs and exhibits inferior performance when dealing with coherent sources. In contrast, ESPRIT enhances efficiency by utilizing the rotational invariance property across subarrays, though it still encounters constraints under specific operating conditions. Additional methods, such as minimum variance distortionless response (MVDR) and maximum likelihood (ML) estimation, provide higher accuracy or theoretical optimality. However, they are frequently deemed impractical due to their heavy computational requirements, or sensitivity to noise and limitations in the number of snapshots [26,27,28,29,30,31].

While traditional DOA estimation methods can deliver satisfactory performance under specific operating conditions, they are plagued by prominent limitations. These drawbacks encompass compromised estimation precision in low-SNR environments, insufficient snapshot numbers, scenarios involving coherent signals, and intricate propagation channels. Furthermore, these approaches are burdened with excessive computational complexity and high sensitivity to parameter configuration factors that severely impede their practical implementation in real-world systems [32,33]. In recent years, deep learning (DL) has emerged as a transformative paradigm in various domains, including computer vision, natural language processing, and speech recognition. Its ability to automatically learn hierarchical feature representations from raw data has spurred interest in applying DL techniques to DOA estimation tasks [34,35,36,37,38]. DL-based methods can effectively capture complex nonlinear relationships within the data, thereby enhancing robustness against noise and multipath effects. For instance, convolutional neural networks (CNNs) have been employed to extract spatial features from array covariance matrices, while recurrent neural networks (RNNs) and long short-term memory (LSTM) networks have been utilized to model temporal dependencies in sequential signal data [39,40,41,42]. Hybrid architectures combining CNNs and LSTMs have also been proposed to leverage both spatial and temporal information for improved DOA estimation accuracy [43,44]. Despite the progress made, existing methods typically overlook the optimization of the covariance matrix, leading neural networks to learn the nonlinear mapping from the complex covariance matrix directly to DOAs. This increases the learning burden on the model and makes it difficult to integrate with existing classical attention mechanisms, thereby limiting the performance of neural networks [39,40,41,42,43,44]. Although the array covariance attention (ACA) mechanism has been proposed to enhance DOA estimation performance in non-ideal noise environments, it overlooks the optimized representation of covariance matrices, thus failing to fully exploit the potential of classical attention mechanisms in DOA estimation [45].

This paper proposes a dual-branch attention-based CNN-LSTM network (DACL-Net) for DOA estimation. The data flows into two parallel computational branches. The first branch is the spatial branch, which optimizes the covariance matrix through a two-dimensional Fourier transform (2D-FT) so that angles appear as peaks in the magnitude spectrum. Spatial features are then extracted via residual blocks. In this process, the coordinate attention (CA) mechanism enhances the spatial–local perception capability, enabling the network to focus more on peak regions [46]. The second branch is the temporal branch. Since time series are more susceptible to noise, we incorporate a spectrum attention mechanism (SAM) for the long short-term memory (LSTM) network [47], improving the noise robustness of the temporal branch. Finally, the outputs of the two branches are fused through a linear layer to produce the DOA estimation result. The main contributions of this paper are summarized as follows:

1.: DACL-Net introduces a novel 2D-FT-based input representation that transforms the array covariance matrix into the spatial frequency domain. This transformation effectively converts the original covariance matrix into a dark image with bright spots, where each DOA corresponds to a distinct peak in the magnitude spectrum. By leveraging this representation, the model enables classical computer vision attention mechanisms to focus on these peak regions, thereby improving feature discriminability and DOA estimation accuracy. This 2D-FT preprocessing serves as the cornerstone of our approach, integrating physical prior knowledge into the DL framework and reducing the network’s learning burden.
2.: A lightweight adaptive filtering module, the SAM, is employed as a preprocessor for the LSTM branch. SAM adaptively suppresses noise components in the time-domain signals through a learnable frequency-domain mask, while residual connections preserve crucial phase information. This architecture offers a novel paradigm for temporal modeling in array signal processing and can be extended to other related tasks.
3.: We propose an improved cross-entropy (CE) loss function, angle-weighted cross entropy (AWCE), that assigns higher weights to training samples corresponding to edge angles based on a sine-based weighting scheme. This mechanism enhances the model’s focus on challenging marginal samples, thereby improving overall estimation consistency across the entire angular range. The weighting strategy is general and can be incorporated into other loss functions beyond the one used in this work.

This paper is structured as follows. Section 2 formulates the uniform linear array (ULA) signal model and analyzes the information embedded within the spatial spectrum. Section 3 provides a detailed description of the proposed DACL-Net architecture. In Section 4, the performance of the proposed framework is evaluated through simulated experiments, where its advantages and limitations are discussed in comparison with existing methods. Section 5 concludes the paper.

2. Signal Model

Consider a ULA composed of N sensor elements arranged along a straight line with an inter-element spacing of d. Assume that M far-field narrowband signals from distinct directions impinge on the array, as depicted in Figure 1. Each signal is assumed to be uncorrelated with the others and additive noise.

Let T denote the number of snapshots collected by each array element. The signal received by the n-th element at time t is provided by the following:

x_{n} (t) = \sum_{i = 1}^{M} s_{i} (t) e^{- j \frac{2 π}{λ} (n - 1) d sin θ_{i}} + n_{n} (t), n = 1, 2, \dots, N,

(1)

where

λ = c / f

is the wavelength of the signal, with c being the speed of light and f the signal frequency. The term

s_{i} (t)

represents the complex envelope of the i-th signal at time t,

θ_{i}

is the direction of arrival of the i-th signal, and

n_{n} (t)

is additive white Gaussian noise with zero mean and known variance.

The received signal vector across the array at time t can be expressed as follows:

x (t) = {[x_{1} (t), x_{2} (t), \dots, x_{N} (t)]}^{T} .

(2)

This leads to the compact matrix form:

x (t) = A (θ) s (t) + n (t),

(3)

where

s (t) = {[s_{1} (t), s_{2} (t), \dots, s_{M} (t)]}^{T}

is the signal vector,

n (t)

is the noise vector, and

A (θ) = [a (θ_{1}), a (θ_{2}), \dots, a (θ_{M})]

is the array manifold matrix. The steering vector

a (θ_{i})

for a signal arriving from angle

θ_{i}

is defined as follows:

a (θ_{i}) = {[1, e^{- j \frac{2 π}{λ} d sin θ_{i}}, e^{- j \frac{4 π}{λ} d sin θ_{i}}, \dots, e^{- j \frac{2 π}{λ} (N - 1) d sin θ_{i}}]}^{T} .

(4)

In practical scenarios, the array covariance matrix

R = E [x (t) x {(t)}^{H}]

is estimated using the sample covariance matrix:

\hat{R} = \frac{1}{T} \sum_{t = 1}^{T} x (t) x {(t)}^{H},

(5)

where

{(\cdot)}^{H}

denotes the conjugate transpose. The spatial spectrum is then constructed by scanning over a predefined angular grid

Θ = {θ_{1}, θ_{2}, \dots, θ_{L}}

. For each candidate angle

θ

, the steering vector

a (θ)

is used to compute the beam output amplitude:

Z (θ) = |a {(θ)}^{H} \hat{R} a (θ)| .

(6)

The resulting spatial spectrum

{Z (θ_{1}), Z (θ_{2}), \dots, Z (θ_{L})}

exhibits prominent peaks near the true DOAs. This characteristic forms the foundation of subspace-based methods, where the identification of these spectral peaks is essential for accurate direction finding.

To facilitate supervised learning, the ground-truth DOA information is encoded as a binary label vector

Y = {[y_{1}, y_{2}, \dots, y_{L}]}^{T}

:

y_{i} = \{\begin{matrix} 1, & if θ_{i} is a true DOA \\ 0, & otherwise \end{matrix} .

(7)

This labeling scheme enables the network to distinguish signal components from noise during training and enhances its ability to accurately localize source directions in the spatial spectrum.

3. Proposed Method

This paper presents a DL-based DOA estimation method. The proposed architecture employs a dual-branch design. One branch performs feature optimization via a 2D-FT, followed by residual blocks equipped with a CA module. The other branch consists of an LSTM network and SAM. SAM adaptively filters out noise from the input, and LSTM extracts temporal features. We first introduce each component individually, then describe the integrated architecture composed of these components, and finally present an optimized CE loss function.

3.1. Two-Dimensional Fourier Transform

For a ULA receiving M far-field signals, the ideal covariance matrix can be expressed as

R = \sum_{i = 1}^{M} σ_{i}^{2} a (θ_{i}) a {(θ_{i})}^{H} + σ_{n}^{2} I

, where

a (θ_{i})

is the steering vector. Under the far-field narrowband assumption, the steering vector has a Vandermonde structure:

a (θ_{i}) = {[1, e^{- j \frac{2 π}{λ} d sin θ_{i}}, \dots, e^{- j \frac{2 π}{λ} (N - 1) d sin θ_{i}}]}^{T}

. The 2D-FT of the outer product

a (θ_{i}) a {(θ_{i})}^{H}

essentially computes the following:

F_{i} (u, v) = \sum_{m = 0}^{N - 1} \sum_{n = 0}^{N - 1} e^{- j \frac{2 π}{λ} d (m sin θ_{i} - n sin θ_{i})} e^{- j 2 π (\frac{u m}{N} + \frac{v n}{N})} .

(8)

This results in energy concentration around the spatial frequencies

(u, v)

satisfying the following:

\frac{u}{N} \approx \frac{d sin θ_{i}}{λ} and \frac{v}{N} \approx - \frac{d sin θ_{i}}{λ},

(9)

Forming a localized high-energy region (peak) in the magnitude spectrum

| F (u, v) |

. Areas without signal sources correspond to noise-only components, which under the white Gaussian noise assumption yield relatively flat and low-magnitude responses, appearing as dark regions.

The mathematical formulation of the 2D-FT applied to the covariance matrix

\hat{R}

is provided by the following:

F (u, v) = \sum_{m = 0}^{M - 1} \sum_{n = 0}^{N - 1} \hat{R} (m, n) \cdot e^{- j 2 π (\frac{u m}{M} + \frac{v n}{N})},

(10)

where

\hat{R} (m, n)

denotes the element at the m-th row and n-th column of the estimated covariance matrix, M and N represent its dimensions, and

u, v

are the spatial frequency indices. The magnitude spectrum

| F (u, v) |

is then computed. In this spectral representation, each DOA corresponds to a localized high-energy region, while areas without signal sources remain relatively dark. This structured representation allows subsequent convolutional layers and attention modules to effectively localize and emphasize the angular information, thereby reducing the learning burden of the network and improving feature discriminability. The tensor formed by concatenating the phase spectrum

∠ F (u, v)

and the magnitude spectrum

| F (u, v) |

serves as the input to the spatial branch to preserve complete spatial frequency information.

3.2. Coordinate Attention

The CA module is integrated into the residual block to enhance the model’s capability to capture directional features of sound sources. This module decomposes the input feature map into a pair of direction-aware feature vectors by performing pooling operations separately along the two spatial dimensions. These vectors are then encoded into a pair of attention maps, with each map capturing contextual information from one directional perspective of long-range spatial dependencies in the input feature map. Through this decomposition transformation, the module effectively captures remote dependencies in one spatial direction while preserving precise positional information in the other. Finally, the resulting two attention maps are applied to the input feature map to emphasize feature information relevant to the target sound source’s direction, thereby improving the accuracy and robustness of DOA estimation. The core algorithm is outlined in Algorithm 1.

Algorithm 1 Coordinate Attention (CA)

Input:: Input feature map $X \in R^{C \times H \times W}$
Output:: Output feature map $Y \in R^{C \times H \times W}$

1:: # Global average pooling along height and width directions
$z_{h} = \frac{1}{W} \sum_{j = 1}^{W} X (:, :, j)$ # Height descriptor $z_{h} \in R^{C \times H}$
$z_{w} = \frac{1}{H} \sum_{i = 1}^{H} X (i, :, :)$ # Width descriptor $z_{w} \in R^{C \times W}$
2:: # Concatenate and transform via shared 1D convolution
$z = Concat (z_{h}, z_{w})$ # $z \in R^{C \times (H + W)}$
$f = δ (Conv 1 D (z))$ # $f \in R^{\frac{C}{r} \times (H + W)}$
3:: # Split and apply sigmoid activation
$f_{h}, f_{w} = Split (f, [H, W])$
$g_{h} = σ ({Conv 1 D}_{h} (f_{h}))$ # Height attention $g_{h} \in R^{C \times H}$
$g_{w} = σ ({Conv 1 D}_{w} (f_{w}))$ # Width attention $g_{w} \in R^{C \times W}$
4:: # Apply attention weights to input features
$Y (i, j, :) = X (i, j, :) \times g_{h} (i, :) \times g_{w} (j, :)$

3.3. Spectrum Attention Mechanism

Since the original time-domain signals are more affected by noise than the covariance matrix, the SAM module is designed to adaptively remove noise from the input signals in the frequency domain. It operates by transforming the input signal into the frequency domain using the discrete Fourier transform (DFT). A learnable mask is then applied to weight different frequency components, allowing the network to emphasize informative frequencies while suppressing noise. Finally, the inverse DFT (IDFT) is used to convert the filtered frequency-domain representation back into the time domain. The core algorithm is outlined in Algorithm 2.

Algorithm 2 Spectrum Attention Mechanism (SAM)

Input:: Captured signal sequence $x^{n}$
Output:: $x_{s a m}^{n}$

1:: Initialize: All-ones learnable array $m a s k^{n}$
2:: # Transform the input series into frequency domain
$s p^{n} \leftarrow D F T (x^{n})$
3:: # Element-wise multiply $s p e c t r u m$ by $m a s k$
$m a s k e d_s p^{n} \leftarrow s p^{n} \cdot m a s k^{n}$
4:: # Transform the $s p e c t r u m$ back into the time domain
$x_{s a m}^{n} \leftarrow I D F T (m a s k e d_s p^{n})$

Since the SAM operates on the entire frequency domain of the signal, it can lead to the loss of phase information when transformed back to the time domain via IDFT. To mitigate this, we integrate the module into the network using the following residual connection scheme:

x_{f i l t e r e d}^{n} = x^{n} + x_{s a m}^{n} .

(11)

This design preserves the crucial phase information contained in the original input. Furthermore, by allowing the filter to indirectly learn the frequency components of the noise rather than a direct mapping to the clean signal, this skip connection facilitates an easier learning process.

3.4. CNN-LSTM

The CNN-LSTM hybrid architecture can be structured in either serial or parallel configurations. Our model employs a parallel design to integrate spatial and temporal features, fusing these complementary representations to enhance DOA estimation accuracy. Below we provide separate introductions to the CNN and LSTM components.

3.4.1. Convolutional Neural Network

In its fundamental form, a CNN comprises stacked convolutional layers that progressively extract hierarchical features from input data. As illustrated in Figure 2, these layers are often interleaved with downsampling operations to enhance computational efficiency and expand receptive fields. In our architecture, the conventional convolutional stack is replaced by residual blocks to facilitate deeper network design while maintaining training stability.

(1): Convolutional layer

The feature extraction operation using convolutional kernels is mathematically expressed as follows:

\begin{matrix} Z^{l + 1} (i, j) & = [Z^{l} \otimes w^{l + 1}] (i, j) + b \\ = \sum_{k = 1}^{K_{l}} \sum_{x = 1}^{f} \sum_{y = 1}^{f} [Z_{k}^{l} (s_{0} i + x, s_{0} j + y) w_{k}^{l + 1} (x, y)] \\ + b (i, j) \in {0, 1, \dots, L_{l + 1}}, \\ L_{l + 1} & = \frac{L_{l} + 2 p - f}{s_{0}} + 1, \end{matrix}

(12)

where

Z^{l}

and

Z^{l + 1}

denote the input and output of the

(l + 1)

-th layer, respectively. K indicates the channel count,

Z (i, j)

represents the pixel value at position

(i, j)

, b denotes the bias term,

w^{l + 1}

corresponds to the convolutional kernel weights at layer

l + 1

,

L_{l + 1}

specifies the output size,

s_{0}

and f indicate the stride and kernel size, respectively, and p refers to the padding size.

(2): Residual block

To overcome the limitations of plain CNNs in deep architectures, we employ residual blocks that incorporate skip connections. These connections enable direct feature propagation across layers, alleviating gradient vanishing and enabling the construction of deeper networks. The core operation of a residual block is formulated as follows:

y = F (x, {W_{i}}) + x,

(13)

where

x

and

y

are the input and output vectors of the block, and

F (x, {W_{i}})

represents the residual mapping to be learned. In our design, the residual path consists of convolutional layers and CA modules, which enhance the spatial feature discriminability crucial for DOA estimation. Downsampling is achieved by utilizing convolutional layers with a stride of 2 within these residual blocks, eliminating the need for separate pooling layers.

3.4.2. Long Short-Term Memory Network

Recurrent neural networks have evolved through numerous architectural innovations, with LSTM representing one of the most significant developments. The LSTM unit incorporates four key components: a forget gate

f_{t}

with parameters

{W_{x f}, W_{h f}, W_{c f}, b_{f}}

, an update gate

i_{t}

with parameters

{W_{x i}, W_{h i}, W_{c i}, b_{i}}

, an output gate

o_{t}

with parameters

{W_{x o}, W_{h o}, W_{c o}, b_{o}}

, and a candidate state component

g_{t}

with parameters

{W_{x c}, W_{h c}, W_{c c}, b_{c}}

. The architectural diagram appears in Figure 3.

Let

x_{t}

,

c_{t}

, and

h_{t}

represent the input, cell state, and hidden state of the current timestep, while

h_{t - 1}

and

c_{t - 1}

denote the hidden state and cell state from the previous timestep.

The computational procedures for the gates and states at time t are formulated as follows:

\{\begin{matrix} f_{t} = σ (W_{x f} \cdot x_{t} + W_{h f} \cdot h_{t - 1} + W_{c f} \cdot c_{t - 1} + b_{f}) \\ i_{t} = σ (W_{x i} \cdot x_{t} + W_{h i} \cdot h_{t - 1} + W_{c i} \cdot c_{t - 1} + b_{i}) \\ g_{t} = tanh (W_{x c} \cdot x_{t} + W_{h c} \cdot h_{t - 1} + W_{c c} \cdot c_{t - 1} + b_{c}) \\ c_{t} = i_{t} g_{t} + f_{t} c_{t - 1} \\ o_{t} = σ (W_{x o} \cdot x_{t} + W_{h o} \cdot h_{t - 1} + W_{c o} \cdot c_{t - 1} + b_{o}) \\ h_{t} = o_{t} tanh (c_{t}) \end{matrix} .

(14)

In DOA estimation, the signals received by the sensor array not only exhibit spatial correlation but also demonstrate temporal correlation. LSTM, through its gating mechanism, can effectively capture this temporal dependency, thereby enhancing estimation stability under low SNR conditions.

3.5. Integrated Architecture

By integrating the aforementioned components in the manner illustrated in Figure 4, our overall architecture is constructed. Specifically, the input data flows into two parallel branches. The first branch uses a 2D-FT to refine the input covariance matrix. The magnitude and phase information of refined data are then processed through residual blocks embedded with CA modules. The second branch extracts temporal features via an LSTM based on SAM. The outputs of these two branches are then fused by linear layers to produce the final DOA estimation results.

The described fusion structure is named DACL-Net, which offers the following advantages:

(1): DACL-Net is built upon a feature transformation, namely the two-dimensional Fourier transform, which converts cross-correlation information into spatial power distribution. Angles manifest as peaks in the spatial power distribution, resembling bright spots in an image. This enables classical attention mechanisms from the image domain to be effectively utilized. Essentially, this integrates physical prior knowledge from array signal processing into the neural network, significantly reducing the training burden and improving estimation accuracy.
(2): Existing DL-based DOA estimators typically consider only spatial correlation features while neglecting temporal sequence characteristics. DACL-Net, based on a spatio-temporal feature extraction baseline model, integrates dual-branch features, thereby enhancing the robustness and accuracy of DOA estimation.
(3): Although DACL-Net incorporates attention mechanisms and a dual-branch architecture, its overall parameter count remains within a reasonable range. Specifically, the SAM module includes only learnable mask parameters; the CA module achieves lightweight attention computation via one-dimensional convolutions; and the convolutional layers within the residual blocks all employ small-sized kernels. Compared with existing deep CNN and Transformer-based models, DACL-Net maintains high accuracy while exhibiting lower computational complexity and memory usage, making it more suitable for deployment on real-time processing platforms.

3.6. Loss Function

In classification tasks, the CE loss is typically employed, and its gradient computation is more straightforward. Compared to the mean squared error (MSE) loss commonly used in regression, it is generally easier to optimize during training and often leads to faster convergence. In the classification task of DOA estimation, angles are discretized into finite categories. The core challenge is that the fitting accuracy of the model for edge angle samples is significantly lower than that for middle angles due to the degradation of array manifold characteristics, reduced SNR, and other issues. To address the problem, this paper designs an AWCE loss function, which enhances the model’s learning attention to edge angle samples by assigning adaptive weights to different angle categories.

3.6.1. Basic Cross Entropy Loss

The standard cross entropy loss for classification tasks is defined as follows:

L_{C E} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} log (p_{i, c}),

(15)

where N is the number of samples in a batch, C is the total number of discrete angle categories,

y_{i, c} \in {0, 1}

is the one-hot encoded ground-truth label of sample i corresponding to category c, and

p_{i, c}

is the probability predicted by the model that sample i belongs to category c. The standard cross-entropy loss assigns equal weights to all angle categories and cannot specifically improve the fitting effect of edge angles, so an angle-dependent weighting mechanism needs to be introduced.

3.6.2. Design of the Angle Weighting Mechanism

Considering the angular characteristics of the ULA, this paper proposes that the weight function is proportional to the absolute value of the sine of the angle, with the following specific form:

w (θ) = 1 + a \cdot | sin θ |,

(16)

where

θ

is the ground-truth DOA angle of the sample, and

a \geq 0

is the weight adjustment coefficient. The core characteristics of this weighting mechanism are as follows: (1) when

θ = 0 °

(middle angle),

| sin θ | = 0

, and the weight

w (θ) = 1

, which is consistent with the weight of the standard cross entropy loss. (2) When

θ \to \pm 90 °

(edge angles),

| sin θ | \to 1

, and the weight

w (θ) = 1 + a

, which amplifies the loss contribution of edge samples and makes the model prioritize learning the features of such samples. (3) The adjustment coefficient a can flexibly control the degree of weight enhancement for edge angles: a larger a leads to a more significant weight difference between edge angles and middle angles, and the model has a higher fitting priority for edge samples.

For the discrete angle category c, its corresponding angle value is

θ_{c}

, so the weight of category c can be expressed as

w_{c} = 1 + a \cdot | sin θ_{c} |

, and the weight matrix

W \in R^{C}

is composed of

w_{c}

corresponding to all categories.

3.6.3. Angle-Weighted Cross Entropy Loss Function

Combining the above weighting mechanism, the final loss function is defined as follows:

L_{A W C E} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} w_{c} \cdot y_{i, c} log (p_{i, c}) .

(17)

To avoid the scale change in loss values introduced by weights, normalization processing can be applied to the weights as follows:

{\tilde{w}}_{c} = \frac{w_{c}}{\frac{1}{C} \sum_{c = 1}^{C} w_{c}}, L_{A W C E} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} {\tilde{w}}_{c} \cdot y_{i, c} log (p_{i, c}) .

(18)

The normalized weights ensure that the overall scale of the loss function is consistent with the standard cross entropy loss, while retaining the weighting effect of edge angles.

4. Simulation Results

4.1. Dataset Generation

In our experiments, acoustic vector sensors (AVS) are employed as the sensing method, and the ULA model described in Section 2 is adopted for dataset acquisition. Specifically, the signal type is a single-frequency signal, and the noise is set to additive white Gaussian noise. The sound velocity v is set to 1500 m/s, and the wavelength

λ

is 1 m. The array configuration consists of 10 sensors with an inter-element spacing of

\frac{λ}{2}

. Under these conditions, two sources impinge on the ULA with random angular separations of

1 °

,

2 °

,

3 °

,

4 °

, or

5 °

, covering a DOA range of

[- 90 °, 90 °]

, thereby establishing an extremely close-spaced scenario. To enhance data diversity, the SNR is varied from −20 dB to 20 dB in 5 dB increments. The number of snapshots is

T = 500

. We generated a complete dataset comprising 18,000 samples, covering various angular separations and noise levels. For model parameter updates, the Adam optimizer is employed with an initial learning rate of 0.0001. The batch size is set to 1800, and the training process is conducted over 1000 epochs. The program is implemented using PyTorch 2.1.2 and executed on a hardware platform equipped with an Intel(R) Core(TM) i9-14900K CPU @ 3.20 GHz and an NVIDIA GeForce RTX 4090 GPU.

4.2. Performance of DOA Estimation Model

Based on the aforementioned data-generation method, the test set was constructed in the same manner. Far-field narrowband independent signal samples with an identical SNR of 0 dB and an angular separation of 1° were selected from the test set. The DOA estimates for each test sample were computed using both existing algorithms and our proposed method [22,24,39,42,44,48]. To ensure a fair and comprehensive comparison, all DL baseline models were trained and optimized under identical experimental conditions. This included the same dataset split, the same optimizer (Adam) with an identical initial learning rate, and the same number of training epochs, so that each model could achieve its best possible performance. For classical algorithms such as MUSIC and ESPRIT, we adopted widely recognized standard implementations with optimal parameter settings. For MUSIC, the true number of signal sources was provided, and the eigenvalue decomposition method was employed. For methods requiring angular search, we used a fine grid that matched the classification resolution of the neural network to ensure comparable angular resolution across all approaches. The simulation results are presented in Figure 5a–h. The solid line indicates the actual DOA, while the estimated DOA is illustrated by the colored blocks.

As shown in Figure 5a,b, conventional algorithms such as MUSIC and ESPRIT perform poorly at 0 dB SNR, exhibiting highly unstable DOA results, especially near the grid boundaries. It can be observed from Figure 5c,d that the iterative-based algorithms IMLSE and ILSSE achieve improved performance compared to conventional methods, owing to their iterative optimization strategies that enhance robustness against noise and improve estimation stability under low SNR conditions. Figure 5e–h demonstrates that DL-based algorithms achieve the best prediction performance, benefiting from their powerful end-to-end feature learning capability to extract discriminative spatial-spectral features directly from the data. In comparison, our proposed DACL-Net yields predictions closest to the ideal line, achieving high estimation accuracy for both central and edge samples. This balanced performance can be attributed to the proposed AWCE loss function. By assigning higher weights to the training losses of edge-angle samples based on the sine weighting mechanism, the AWCE loss effectively addresses the inherent difficulty in classifying marginal angles due to degraded array manifold characteristics. This weighting strategy ensures that the model allocates sufficient learning capacity to these challenging cases, thereby improving overall estimation consistency across the entire angular range. Additionally, Figure 6a–h displays the prediction errors of each method on the test set samples in the form of scatter plots. These results more clearly demonstrate the superior performance of DACL-Net in low SNR environments. At 0 dB SNR, DACL-Net achieves an RMSE below 0.04°, outperforming other models.

4.3. Statistical Performance Analysis

To verify the statistical performance of each algorithm, the root mean square error (RMSE) is expressed as follows:

RMSE = \sqrt{\frac{1}{K M} \sum_{k = 1}^{K} \sum_{m = 1}^{M} {({\hat{θ}}_{m, k} - θ_{m})}^{2}},

(19)

where K denotes the number of Monte Carlo trials, M represents the number of signal sources,

{\hat{θ}}_{m, k}

is the estimated DOA of the m-th source in the k-th trial, and

θ_{m}

is the corresponding ground-truth direction.

4.3.1. Impact of Signal-to-Noise Ratio on Root Mean Square Error

In this experiment, we systematically evaluate the robustness of various algorithms under different SNR conditions. The SNR values for all test samples range from −20 dB to 20 dB with a 5 dB increment, resulting in nine distinct SNR scenarios. To comprehensively assess the estimation error performance under different snapshot conditions, we conduct separate experiments with fixed snapshot numbers of 50, 100, 200, and 500.

As illustrated in Figure 7a–d, all methods exhibit decreasing RMSE trends as the number of snapshots increases. However, our proposed DACL-Net demonstrates more significant improvement, which can be attributed to its LSTM architecture and the accompanying SAM that effectively leverages abundant temporal information. Table 1 shows the specific data in Figure 7d. Notably, under challenging low-SNR conditions (SNR = −5 dB), DACL-Net maintains superior performance with about 1° RMSE values. This enhanced robustness stems from the integrated adaptive noise filtering module and the attention mechanism that strengthens spatio-temporal feature extraction, enabling more reliable DOA estimation in adverse signal environments.

The superior performance of DACL-Net, and learning-based methods in general, over classical algorithms such as MUSIC and ESPRIT under low SNR conditions can be explained by several key mechanisms. First, classical subspace-based methods rely on accurate estimation of the signal and noise subspaces through eigenvalue decomposition of the sample covariance matrix. In low SNR regimes, the noise subspace becomes dominant, and its orthogonality to the signal subspace is compromised, leading to degraded spectral peaks and increased estimation errors. In contrast, DACL-Net does not depend on explicit subspace decomposition. Instead, it learns a direct mapping from the input data to DOA estimates through hierarchical feature extraction. The 2D-FT preprocessing step transforms the covariance matrix into a spatial frequency representation where signal directions manifest as localized energy peaks, effectively enhancing the SNR in the feature domain. Additionally, the SAM module acts as an adaptive frequency-domain filter, suppressing noise components in the temporal branch, while the CA module in the spatial branch focuses attention on relevant peak regions. This data-driven approach allows the network to capture complex, nonlinear relationships between the received signals and source directions, which are often obscured by noise in classical methods. Furthermore, the model is trained on a diverse dataset encompassing a wide range of SNRs and angular configurations, enabling it to generalize to challenging low-SNR scenarios that are problematic for traditional algorithms. Therefore, DACL-Net’s ability to integrate spatial and temporal features, coupled with attention-guided noise suppression, provides a principled explanation for its robustness in low SNR environments. However, at an SNR of −20 dB, the accuracy of the proposed model deteriorates to a level similar to that of the standard CNN-LSTM baseline.

4.3.2. Impact of Signal-to-Noise Ratio on Estimation Accuracy

The experimental parameters in this investigation remain consistent with the previous experiment, while a new evaluation metric is adopted. We define a prediction as correct only when both

θ_{1}

and

θ_{2}

are accurately estimated. The estimation accuracy is calculated as the proportion of correctly predicted samples within the test set. Figure 8a–d presents the estimation accuracy of DACL-Net compared with other benchmark algorithms. The results demonstrate that our method achieves superior performance under low SNR conditions, attaining the highest angular classification accuracy among all evaluated approaches. Table 2 shows the specific data in Figure 8d. Notably, DACL-Net achieves an accuracy of over 95% at an SNR of 0 dB.

4.4. Ablation Study

To verify the effectiveness of each core component in the proposed DACL-Net, we conduct systematic ablation experiments. The baseline model is a standard CNN-LSTM hybrid network without SAM, 2D-FT optimization, a CA module, or AWCE loss. Four ablation variants are designed by sequentially removing individual components, and all models are trained and tested under the same experimental settings (SNR range: −20 dB to 20 dB, snapshots

T = 500

, angular separation: 1°–5°). The RMSE and estimation accuracy at key SNR points (−10 dB, 0 dB, 10 dB) are adopted as evaluation metrics to quantify the contribution of each component.

4.4.1. Ablation Variants Definition

1.: DACL-Net (Full model): Integrated with SAM, 2D-FT, CA module, and AWCE loss.
2.: Variant 1 (w/o SAM): Removed the SAM.
3.: Variant 2 (w/o 2D-FT): Removed the 2D-FT optimization.
4.: Variant 3 (w/o CA): Retained 2D-FT but removed the CA module.
5.: Variant 4 (w/ CE Loss): Replaced AWCE loss with standard CE loss.

4.4.2. Ablation Experimental Results

The performance of all ablation variants is summarized in Table 3 and Table 4. All values are averaged over 10 Monte Carlo trials to ensure statistical reliability.

Compared to the full model, Variant 1 (w/o SAM) shows a 49.4% RMSE increase and a 31.4% accuracy decrease at −10 dB SNR. This confirms that the adaptive noise filtering capability of SAM effectively suppresses noise interference in low SNR environments, laying a foundation for subsequent feature extraction. Variant 2 (w/o 2D-FT) exhibits the most significant performance degradation among all variants, with RMSE increased by approximately 110.8% and accuracy decreased by approximately 45.8% at −10 dB SNR. This underscores the pivotal role of the 2D-FT input transformation in optimizing spatial feature representation and forming the dark image with bright spots that enable effective attention mechanism operation. Variant 3 (w/o CA) performs worse than the full model, with RMSE increased by 8.5% and accuracy decreased by 7.2% at −10 dB SNR. This verifies that the CA module further refines the spatially optimized features from 2D-FT, improving the ability to capture directional information of target sources and enhance peak localization. Variant 4 (w/ CE Loss) has higher RMSE and lower accuracy than the full model, with RMSE increased by 40.0% and accuracy decreased by 25.7% at −10 dB SNR. This demonstrates that the sine-based weighting mechanism of AWCE loss effectively enhances the attention to edge-angle samples, alleviating the problem of edge sample misclassification caused by array manifold degradation.

In conclusion, all core components of DACL-Net play important roles in improving DOA estimation performance. The 2D-FT transformation proves to be the most critical component for feature optimization, while SAM provides essential noise robustness in challenging environments. The CA module and AWCE loss further refine spatial feature extraction and training efficiency.

4.5. Computational Efficiency Evaluation

To evaluate the computational efficiency of the proposed DACL-Net, we additionally constructed a test set containing 10,000 samples with a fixed snapshot number of 500, employing varying angles and SNR levels. All randomly generated samples were processed using different methods. The evaluation is conducted on the same hardware platform described in Section 4.1, ensuring consistency across all methods. The total prediction time is recorded. The results are summarized in Figure 9 and Table 5.

Compared to the other three DL-based methods, DACL-Net requires longer training time due to its more complex architecture. This is mainly attributed to its LSTM branch, which takes raw long-sequence signals as direct input. While traditional methods bypass the training phase altogether, experimental results confirm that well-trained DL models can achieve efficient DOA estimation even with moderate computational resources. Although DACL-Net’s structural complexity leads to slightly longer inference time compared to the original CNN-LSTM, it still maintains substantially faster computation than DCNN and Res-CNN. This enables our method to deliver both real-time performance and high measurement accuracy.

4.6. Generalization Ability with Multiple Sources

To evaluate the generalization capability of DACL-Net in scenarios with more than two sources, we conducted additional experiments involving three, four, and five far-field uncorrelated narrowband signals. The array configuration remains the same as described in Section 4.1, with

N = 10

sensors and inter-element spacing

d = λ / 2

. The DOAs of the sources are randomly generated within

[- 90 °, 90 °]

with a minimum angular separation of

1 °

to simulate closely spaced sources. Three representative SNR levels are considered: −10 dB, 0 dB, and 10 dB. The number of snapshots is fixed at

T = 500

. For each source count, 2000 test samples are generated.

The performance is evaluated using estimation accuracy, where a prediction is considered correct only if all source DOAs are correctly estimated. The results are compared with two representative DL baselines: Res-CNN and CNN-LSTM. The results are summarized in Table 6.

The results demonstrate that DACL-Net consistently outperforms both Res-CNN and CNN-LSTM across all source counts and SNR levels, especially under low SNR conditions. As the number of sources increases, all methods exhibit performance degradation due to increased spatial interference and higher model complexity. However, DACL-Net shows better robustness, with a smaller drop in accuracy compared to the baselines. This indicates that the proposed dual-branch architecture with attention mechanisms effectively extracts and fuses spatio-temporal features even in more challenging multi-source scenarios.

4.7. Physical Interpretability

A key advantage of DACL-Net lies in its enhanced physical interpretability compared to purely data-driven deep learning approaches for DOA estimation. By incorporating the 2D-FT preprocessing step, the model explicitly leverages the known structure of the array covariance matrix under the far-field narrowband assumption. This transformation maps the original complex-valued correlation data into a spatial frequency domain representation where signal directions manifest as distinct spectral peaks, analogous to bright spots on a dark image. This representation is not arbitrary. It directly corresponds to the spatial Fourier transform of the array manifold, a well-established concept in array signal processing. Consequently, the subsequent CNN and attention modules operate on a physically meaningful feature space, allowing the network to focus on regions of high spatial energy corresponding to potential source directions. This design choice bridges the gap between classical signal processing theory and deep learning, providing a clearer pathway to understand how the network arrives at its estimates. Furthermore, the SAM module’s frequency-domain masking operation can be interpreted as an adaptive noise suppressor, learning to attenuate frequency bins dominated by noise while preserving signal components. This imbues the temporal branch with a degree of interpretability regarding its noise robustness.

4.8. Limitations

Despite the strong performance of DACL-Net, several limitations warrant consideration. First, the current model design is based on a ULA geometry and a far-field narrowband signal model. Its performance under near-field conditions, with broadband signals, or on arbitrary array geometries has yet to be verified. Second, although its computational cost is lower than that of many iterative classical methods, it remains higher than some DL baselines during training. This is primarily due to the dual-branch structure and additional attention modules. For application scenarios with strict constraints on training time or device learning capabilities, this could pose a challenge. Third, the current evaluation of the method is based on simulation experiments. The model’s performance on measured data requires further validation. Overcoming these bottlenecks is crucial for advancing DACL-Net toward robust and general-purpose DOA estimation in practical systems.

5. Conclusions

This paper presents DACL-Net, a dual-branch attention-based CNN-LSTM network for DOA estimation. The spatial branch employs a 2D-FT to optimize the covariance matrix, causing angular information to appear as peaks in the magnitude of the spatial frequency spectrum. This representation allows the attention mechanisms, commonly used in computer vision, to effectively guide the neural network towards these peaks, thereby enhancing feature discriminability and improving DOA estimation accuracy. The SAM serves as an adaptive filter in the temporal branch, effectively mitigating the impact of noise on time-series signals. The deep features extracted from the two branches are fused through a linear layer to output the final DOA estimation results. Experimental results demonstrate the superior performance of DACL-Net, especially in low SNR environments.

This work primarily focuses on the ULA under far-field narrowband assumptions. Extending the proposed framework to more general array geometries, such as uniform circular arrays (UCAs) or non-uniform arrays, and adapting it to near-field or wideband scenarios, present important directions for future research. The current performance evaluation is based on simulations. Thus, validation with real-world measured data is a crucial next step to assess the model’s practicality and robustness.

Author Contributions

Conceptualization, W.X.; methodology, S.Y.; software, W.X.; validation, W.X.; formal analysis, W.X.; investigation, W.X.; resources, S.Y.; data curation, W.X.; writing—original draft preparation, W.X.; writing—review and editing, S.Y.; visualization, W.X.; supervision, S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Relevant information and codes are available from the corresponding author if required.

Acknowledgments

The authors would like to thank the referees for their useful suggestions, which have significantly improved the paper.

Conflicts of Interest

Author Shichao Yi was employed by the company Zhenjiang Jizhi Ship Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Chowdhury, M.W.T.S.; Mastora, M. Performance Analysis of MUSIC Algorithm for DOA Estimation with Varying ULA Parameters. In Proceedings of the 2020 23rd International Conference on Computer and Information Technology (ICCIT), Dhaka, Bangladesh, 19–21 December 2020; pp. 1–5. [Google Scholar] [CrossRef]
Rayar, V.; Naik, U.; Manage, P. A Survey on DoA Measurement using ULA and UCA for Wireless Sensor Network Applications. In Proceedings of the 2020 3rd International Conference on Intelligent Sustainable Systems (ICISS), Thoothukudi, India, 3–5 December 2020; pp. 1145–1149. [Google Scholar] [CrossRef]
Shi, W.; Li, Y.; Huang, Z. Sparse Linear Array Designed via ULA Fitting. In Proceedings of the 2022 IEEE 10th Asia-Pacific Conference on Antennas and Propagation (APCAP), Xiamen, China, 4–7 November 2022; pp. 1–2. [Google Scholar] [CrossRef]
Florio, A.; Avitabile, G.; Talarico, C.; Coviello, G. A Reconfigurable Full-Digital Architecture for Angle of Arrival Estimation. IEEE Trans. Circuits Syst. I Regul. Pap. 2024, 71, 1443–1455. [Google Scholar] [CrossRef]
Kassir, H.A.; Rekanos, I.T.; Lazaridis, P.I.; Yioultsis, T.V.; Kantartzis, N.V.; Antonopoulos, C.S.; Karagiannidis, G.K.; Zaharis, Z.D. DOA Estimation for 6G Communication Systems. In Proceedings of the 2023 12th International Conference on Modern Circuits and Systems Technologies (MOCAST), Athens, Greece, 28–30 June 2023; pp. 1–4. [Google Scholar] [CrossRef]
Balamurugan, N.; Mohan, S.; Adimoolam, M.; John, A.; reddy G, T.; Wang, W. DOA tracking for seamless connectivity in beamformed IoT-based drones. Comput. Stand. Interfaces 2022, 79, 103564. [Google Scholar] [CrossRef]
Chen, C.B.; Lo, T.Y.; Chang, J.Y.; Huang, S.P.; Tsai, W.T.; Liou, C.Y.; Mao, S.G. Precision Enhancement of Wireless Localization System Using Passive DOA Multiple Sensor Network for Moving Target. Sensors 2022, 22, 7563. [Google Scholar] [CrossRef] [PubMed]
Liu, B.; Chen, B.; Yang, M.; Xu, H. DOA estimation using sparse Bayesian learning for colocated MIMO radar with dynamic waveforms. In Proceedings of the 2020 IEEE 11th Sensor Array and Multichannel Signal Processing Workshop (SAM), Hangzhou, China, 8–11 December 2020; pp. 1–4. [Google Scholar] [CrossRef]
Ma, Y.; Cao, X.; Wang, X. Enhanced DOA Estimation for MIMO radar in the Case of Limited Snapshots. In Proceedings of the 2020 IEEE 11th Sensor Array and Multichannel Signal Processing Workshop (SAM), Hangzhou, China, 8–11 December 2020; pp. 1–5. [Google Scholar] [CrossRef]
Tang, H.; Zhang, Y.; Luo, J.; Zhang, Y.; Huang, Y.; Yang, J. Sparse DOA Estimation Based on a Deep Unfolded Network for MIMO Radar. In Proceedings of the IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; pp. 5547–5550. [Google Scholar] [CrossRef]
Takahashi, Y.; Yamada, H.; Yamaguchi, Y. Array calibration techniques for DOA estimation with arbitrary array using root-MUSIC algorithm. In Proceedings of the 2011 IEEE MTT-S International Microwave Workshop Series on Innovative Wireless Power Transmission: Technologies, Systems, and Applications, Kyoto, Japan, 12–13 May 2011; pp. 235–238. [Google Scholar] [CrossRef]
Guan, H.; Ding, S.; Dai, W.; Tan, X.; Long, Y.; Liang, J. Low complexity DOA estimation based on weighted noise component subtraction for smart-home application. Appl. Acoust. 2025, 231, 110490. [Google Scholar] [CrossRef]
Schmidt, R. Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag. 1986, 34, 276–280. [Google Scholar] [CrossRef]
Xu, Z.; Li, H.; Yang, K. A Modified Differential Beamforming and Its Application for DOA Estimation of Low Frequency Underwater Signal. IEEE Sens. J. 2020, 20, 8890–8902. [Google Scholar] [CrossRef]
Florio, A.; Coviello, G.; Talarico, C.; Avitabile, G. Adaptive DDS-PLL Beamsteering Architecture based on Real-Time Angle-of-Arrival Estimation. In Proceedings of the 2024 IEEE 67th International Midwest Symposium on Circuits and Systems (MWSCAS), Springfield, MA, USA, 11–14 August 2024; pp. 628–631. [Google Scholar] [CrossRef]
Chai, Y. Advanced Techniques in Adaptive Beamforming for Enhanced DOA Estimation. In Proceedings of the 2024 International Wireless Communications and Mobile Computing (IWCMC), Ayia Napa, Cyprus, 27–31 May 2024; pp. 269–273. [Google Scholar] [CrossRef]
Lin, T.; Zhou, X.; Zhu, Y.; Jiang, Y. Hybrid Beamforming Optimization for DOA Estimation Based on the CRB Analysis. IEEE Signal Process. Lett. 2021, 28, 1490–1494. [Google Scholar] [CrossRef]
Singh, M.; Wajid, M. Comparative Analysis of Conventional and Adaptive beamforming for Linear Array. In Proceedings of the 2021 6th International Conference on Signal Processing, Computing and Control (ISPCC), Solan, India, 7–9 October 2021; pp. 576–580. [Google Scholar] [CrossRef]
Paulraj, A.; Reddy, V.; Shan, T.; Kailath, T. Performance Analysis of the Music Algorithm with Spatial Smoothing in the Presence of Coherent Sources. In Proceedings of the MILCOM 1986—IEEE Military Communications Conference: Communications-Computers: Teamed for the 90’s, Monterey, CA, USA, 5–9 October 1986; Volume 3, pp. 41.5.1–41.5.5. [Google Scholar] [CrossRef]
Zoltowski, M.; Haber, F. A vector space approach to direction finding in a coherent multipath environment. IEEE Trans. Antennas Propag. 1986, 34, 1069–1079. [Google Scholar] [CrossRef]
Ateşavcı, C.S.; Bahadırlar, Y.; Aldırmaz-Çolak, S. DoA Estimation in the Presence of Mutual Coupling Using Root-MUSIC Algorithm. In Proceedings of the 2021 8th International Conference on Electrical and Electronics Engineering (ICEEE), Antalya, Turkey, 9–11 April 2021; pp. 292–298. [Google Scholar] [CrossRef]
Vesa, A.; Simu, C. Performances of Uniform Sensor Array Antenna in case of DoA estimation using the MUSIC Algorithm. In Proceedings of the 2022 International Symposium on Electronics and Telecommunications (ISETC), Timisoara, Romania, 10–11 November 2022; pp. 1–4. [Google Scholar] [CrossRef]
Khichar, S.; Santipach, W.; Wuttisittikulkij, L. Covariance Matrix Reconstruction to Improve DoA Estimation Using Subspace Method in Low SNR Regime. IEEE Access 2025, 13, 26695–26706. [Google Scholar] [CrossRef]
Roy, R.; Kailath, T. ESPRIT-estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process. 1989, 37, 984–995. [Google Scholar] [CrossRef]
Ottersten, B.; Viberg, M.; Kailath, T. Performance analysis of the total least squares ESPRIT algorithm. IEEE Trans. Signal Process. 1991, 39, 1122–1135. [Google Scholar] [CrossRef] [PubMed]
Job, M.; Suchit Yadav, R. High Resolution DOA Estimation of Narrowband Signal for MUSIC, MVDR and Beamscan Algorithm. In Proceedings of the 2023 11th International Symposium on Electronic Systems Devices and Computing (ESDC), Sri City, India, 4–6 May 2023; Volume 1, pp. 1–5. [Google Scholar] [CrossRef]
Propastin, A.; Prokhorenko, V. Determining The DOA of Jamming Signals Using Root-Music and MVDR Algorithms for Planar Elliptical Digital Antenna Array. In Proceedings of the 2023 5th International Youth Conference on Radio Electronics, Electrical and Power Engineering (REEPE), Moscow, Russia, 16–18 March 2023; Volume 5, pp. 1–6. [Google Scholar] [CrossRef]
Shen, C.C.; Jhang, W. Joint CFO and DOA Estimation Based on MVDR Criterion in Interleaved OFDMA/SDMA Uplink. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2024, E107.A, 1066–1070. [Google Scholar] [CrossRef]
Liang, N.; Wang, S. Iterative Maximum Likelihood DOA Estimator Aided by Magnitude Measurements. In Proceedings of the 2022 IEEE 8th International Conference on Computer and Communications (ICCC), Chengdu, China, 9–12 December 2022; pp. 1556–1560. [Google Scholar] [CrossRef]
Zhao, Y.; Xin, J.; Wu, S. DOA Estimation Method Based on Maximum Likelihood for Nest Array Via Sparse Representation. In Proceedings of the 2021 CIE International Conference on Radar (Radar), Haikou, China, 15–19 December 2021; pp. 358–361. [Google Scholar] [CrossRef]
Zhu, Z.; Qiu, W. Maximum Likelihood DOA Estimation using a Clustering Based Hybrid Optimization Algorithm. In Proceedings of the 2024 7th International Conference on Information Communication and Signal Processing (ICICSP), Zhoushan, China, 21–23 September 2024; pp. 80–84. [Google Scholar] [CrossRef]
Fang, W.; Yu, D.; Wang, X.; Xi, Y.; Cao, Z.; Song, C.; Xu, Z. A Deep Learning Based Mutual Coupling Correction and DOA Estimation Algorithm. In Proceedings of the 2021 13th International Conference on Wireless Communications and Signal Processing (WCSP), Changsha, China, 20–22 October 2021; pp. 1–5. [Google Scholar] [CrossRef]
Hassan, H.; Maud, A.R.; Amin, M. Deep Learning Based DOA Estimation in Low SNR and Multipath Scenarios. In Proceedings of the 2024 IEEE International Symposium on Phased Array Systems and Technology (ARRAY), Boston, MA, USA, 15–18 October 2024; pp. 1–6. [Google Scholar] [CrossRef]
Zheng, R.; Sun, S.; Liu, H.; Chen, H.; Li, J. Interpretable and Efficient Beamforming-Based Deep Learning for Single-Snapshot DOA Estimation. IEEE Sens. J. 2024, 24, 22096–22105. [Google Scholar] [CrossRef]
Tian, Q.; Cai, R.; Qiu, G.; Luo, Y. Distributed Source DOA Estimation Based on Deep Learning Networks. Signal Image Video Process. 2024, 18, 7395–7403. [Google Scholar] [CrossRef]
Song, J.; Zhao, Z.; Yang, K.; Cao, L.; Wang, D.; Fu, C. Deep learning-enhanced atomic norm minimization for DOA estimation of coherent and incoherent sources using coprime array. Meas. Sci. Technol. 2024, 36, 016163. [Google Scholar] [CrossRef]
Su, Y.; Wang, X.; Li, L. A Novel Off-Grid Deep Learning framework for DOA Estimation. In Proceedings of the 2024 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), Zhuhai, China, 22–24 November 2024; pp. 1–4. [Google Scholar] [CrossRef]
Dharma Theja ch, L.; Nagaraju, L.; Kumar Puli, K. Classification based DOA estimation using ANN and CNN Models. In Proceedings of the 2022 IEEE Microwaves, Antennas, and Propagation Conference (MAPCON), Bangalore, India, 12–16 December 2022; pp. 1470–1473. [Google Scholar] [CrossRef]
Liu, Y.; Chen, H.; Wang, B. DOA estimation based on CNN for underwater acoustic array. Appl. Acoust. 2021, 172, 107594. [Google Scholar] [CrossRef]
Yan, K.; He, J.; Zhang, S.; Ding, H.; Wu, Y. A Recurrent Neural Network Based Method for Biologically Inspired DOA Estimation in the Presence of Array Imperfections. In Proceedings of the 2024 IEEE/CIC International Conference on Communications in China (ICCC), Hangzhou, China, 7–9 August 2024; pp. 224–229. [Google Scholar] [CrossRef]
Xiang, H.; Chen, B.; Yang, M.; Xu, S.; Li, Z. Improved direction-of-arrival estimation method based on LSTM neural networks with robustness to array imperfections. Appl. Intell. 2021, 51, 4420–4433. [Google Scholar] [CrossRef]
Wu, L.; Fu, Y.; Yang, X.; Xu, L.; Chen, S.; Zhang, Y.; Zhang, J. Research on the multi-signal DOA estimation based on ResNet with the attention module combined with beamforming (RAB-DOA). Appl. Acoust. 2025, 231, 110541. [Google Scholar] [CrossRef]
Tian, Q.; Cai, R.; Luo, Y.; Qiu, G. DOA Estimation: LSTM and CNN Learning Algorithms. Circuits Syst. Signal Process. 2025, 44, 652–669. [Google Scholar] [CrossRef]
Zhao, Y.; Fan, X.; Liu, J. Robust DOA Estimation via a Deep Learning Framework with Joint Spatial–Temporal Information Fusion. Sensors 2025, 25, 3142. [Google Scholar] [CrossRef]
Liu, K.; Wang, X.; Yu, J.; Ma, J. Attention based DOA estimation in the presence of unknown nonuniform noise. Appl. Acoust. 2023, 211, 109506. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13708–13717. [Google Scholar] [CrossRef]
Zhou, S.; Pan, Y. Spectrum Attention Mechanism for Time Series Classification. In Proceedings of the 2021 IEEE 10th Data Driven Control and Learning Systems Conference (DDCLS), Suzhou, China, 14–16 May 2021; pp. 339–343. [Google Scholar] [CrossRef]
Liao, B.; Chan, S.C.; Huang, L.; Guo, C. Iterative Methods for Subspace and DOA Estimation in Nonuniform Noise. IEEE Trans. Signal Process. 2016, 64, 3008–3020. [Google Scholar] [CrossRef]

Figure 1. Uniform linear array used in this work.

Figure 2. Basic structure of convolutional blocks.

Figure 3. The structure of LSTM.

Figure 4. Our proposed integrated architecture.

Figure 5. DOA estimation performance when the DOA is set to

θ_{1}, θ_{2} \in [- 90 °, 90 °]

with SNR being 0 dB and the number of snapshots being T = 500. (a) DOA estimate of MUSIC. (b) DOA estimate of ESPRIT. (c) DOA estimate of IMLSE. (d) DOA estimate of ILSSE. (e) DOA estimate of DCNN. (f) DOA estimate of Res-CNN. (g) DOA estimate of CNN-LSTM. (h) DOA estimate of DACL-Net.

Figure 5. DOA estimation performance when the DOA is set to

θ_{1}, θ_{2} \in [- 90 °, 90 °]

with SNR being 0 dB and the number of snapshots being T = 500. (a) DOA estimate of MUSIC. (b) DOA estimate of ESPRIT. (c) DOA estimate of IMLSE. (d) DOA estimate of ILSSE. (e) DOA estimate of DCNN. (f) DOA estimate of Res-CNN. (g) DOA estimate of CNN-LSTM. (h) DOA estimate of DACL-Net.

Figure 6. DOA estimation performance when the DOA is set to

θ_{1}, θ_{2} \in [- 90 °, 90 °]

with SNR being 0 dB and the number of snapshots being T = 500. (a) DOA error of MUSIC. (b) DOA error of ESPRIT. (c) DOA error of IMLSE. (d) DOA error of ILSSE. (e) DOA error of DCNN. (f) DOA error of Res-CNN. (g) DOA error of CNN-LSTM. (h) DOA error of DACL-Net.

Figure 6. DOA estimation performance when the DOA is set to

θ_{1}, θ_{2} \in [- 90 °, 90 °]

with SNR being 0 dB and the number of snapshots being T = 500. (a) DOA error of MUSIC. (b) DOA error of ESPRIT. (c) DOA error of IMLSE. (d) DOA error of ILSSE. (e) DOA error of DCNN. (f) DOA error of Res-CNN. (g) DOA error of CNN-LSTM. (h) DOA error of DACL-Net.

Figure 7. RMSE of DOA estimates. (a) Snapshots

T = 50

. (b) Snapshots

T = 100

. (c) Snapshots

T = 200

. (d) Snapshots

T = 500

.

Figure 7. RMSE of DOA estimates. (a) Snapshots

T = 50

. (b) Snapshots

T = 100

. (c) Snapshots

T = 200

. (d) Snapshots

T = 500

.

Figure 8. Accuracy of DOA estimates. (a) Snapshots

T = 50

. (b) Snapshots

T = 100

. (c) Snapshots

T = 200

. (d) Snapshots

T = 500

.

Figure 8. Accuracy of DOA estimates. (a) Snapshots

T = 50

. (b) Snapshots

T = 100

. (c) Snapshots

T = 200

. (d) Snapshots

T = 500

.

Figure 9. Execution time comparison for various methods.

Table 1. RMSE of DOA estimates given 500 snapshots, with SNR varying from −20 dB to 20 dB.

Methods	−20 dB	−15 dB	−10 dB	−5 dB	0 dB	5 dB	10 dB	15 dB	20 dB
MUSIC	66.79	55.08	48.37	18.55	8.74	2.01	0.22	0.01	0.01
ESPRIT	70.46	56.98	50.66	15.51	5.14	4.01	0.35	0.01	0.01
IMLSE	38.01	35.08	30.37	23.55	5.74	1.92	0.55	0.01	0.01
ILSSE	49.80	44.43	15.92	14.17	2.74	0.12	0.01	0.01	0.01
DCNN	6.79	6.08	5.17	4.75	0.24	0.12	0.02	0.01	0.01
Res-CNN	6.90	5.99	5.37	3.55	0.27	0.01	0.01	0.01	0.01
CNN-LSTM	5.79	5.08	5.22	3.12	0.18	0.01	0.01	0.01	0.01
DACL-Net	5.99	2.42	1.55	0.98	0.04	0.01	0.01	0.01	0.01

Table 2. Accuracy of DOA estimates given 500 snapshots, with SNR varying from −20 dB to 20 dB.

Methods	−20 dB	−15 dB	−10 dB	−5 dB	0 dB	5 dB	10 dB	15 dB	20 dB
MUSIC	0%	0%	0%	0%	10%	55%	98%	100%	100%
ESPRIT	0%	0%	0%	0%	15%	18%	97%	100%	100%
IMLSE	0%	0%	0%	0%	20%	80%	100%	100%	100%
ILSSE	0%	0%	0%	0%	21%	80%	100%	100%	100%
DCNN	8%	36%	55%	57%	92%	100%	100%	100%	100%
Res-CNN	17%	40%	62%	72%	90%	100%	100%	100%	100%
CNN-LSTM	20%	42%	60%	80%	99%	100%	100%	100%	100%
DACL-Net	22%	50%	65%	85%	99%	100%	100%	100%	100%

Table 3. RMSE comparison of ablation variants. (Unit: °).

Model Variant	SNR = −10 dB	SNR = 0 dB	SNR = 10 dB
DACL-Net (Full Model)	4.37 ± 0.06	0.04 ± 0.03	0.01 ± 0.01
Variant 1 (w/o SAM)	6.53 ± 0.09	0.15 ± 0.05	0.01 ± 0.01
Variant 2 (w/o 2D-FT)	9.21 ± 0.11	0.10 ± 0.06	0.02 ± 0.02
Variant 3 (w/o CA)	4.74 ± 0.08	0.04 ± 0.06	0.02 ± 0.02
Variant 4 (w/ CE Loss)	6.12 ± 0.07	0.13 ± 0.03	0.05 ± 0.04

Table 4. Estimation accuracy comparison of ablation variants. (Unit: %).

Model Variant	SNR = −10 dB	SNR = 0 dB	SNR = 10 dB
DACL-Net (Full Model)	65.3 ± 1.2	97.0 ± 0.2	99.9 ± 0.1
Variant 1 (w/o SAM)	44.8 ± 1.5	85.2 ± 0.4	99.9 ± 0.1
Variant 2 (w/o 2D-FT)	35.4 ± 1.8	90.6 ± 0.5	99.1 ± 0.7
Variant 3 (w/o CA)	60.6 ± 1.4	96.1 ± 0.5	98.2 ± 0.4
Variant 4 (w/ CE Loss)	48.5 ± 1.3	88.4 ± 0.8	95.8 ± 0.8

Table 5. Time consumption comparison across different DOA estimation methods.

Time(s)	DACL-Net	DCNN	Res-CNN	CNN-LSTM	IMLSE	ILSSE	MUSIC	ESPRIT
Training	698.1286	578.3883	660.3481	685.4442	/	/	/	/
Testing	0.7006	1.1223	1.8123	0.5144	758.3347	529.6822	352.6552	151.0608

Table 6. Estimation accuracy with multiple sources under different SNR conditions.

Method	3 Sources			4 Sources			5 Sources
	−10 dB	0 dB	10 dB	−10 dB	0 dB	10 dB	−10 dB	0 dB	10 dB
Res-CNN	28.5%	85.2%	99.8%	15.1%	70.3%	98.5%	8.7%	52.4%	95.2%
CNN-LSTM	35.2%	88.6%	99.9%	18.9%	75.8%	99.1%	12.3%	60.5%	96.8%
DACL-Net	42.6%	94.5%	99.9%	25.4%	82.7%	99.4%	16.8%	68.9%	97.5%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, W.; Yi, S. DACL-Net: A Dual-Branch Attention-Based CNN-LSTM Network for DOA Estimation. Sensors 2026, 26, 743. https://doi.org/10.3390/s26020743

AMA Style

Xu W, Yi S. DACL-Net: A Dual-Branch Attention-Based CNN-LSTM Network for DOA Estimation. Sensors. 2026; 26(2):743. https://doi.org/10.3390/s26020743

Chicago/Turabian Style

Xu, Wenjie, and Shichao Yi. 2026. "DACL-Net: A Dual-Branch Attention-Based CNN-LSTM Network for DOA Estimation" Sensors 26, no. 2: 743. https://doi.org/10.3390/s26020743

APA Style

Xu, W., & Yi, S. (2026). DACL-Net: A Dual-Branch Attention-Based CNN-LSTM Network for DOA Estimation. Sensors, 26(2), 743. https://doi.org/10.3390/s26020743

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DACL-Net: A Dual-Branch Attention-Based CNN-LSTM Network for DOA Estimation

Abstract

1. Introduction

2. Signal Model

3. Proposed Method

3.1. Two-Dimensional Fourier Transform

3.2. Coordinate Attention

3.3. Spectrum Attention Mechanism

3.4. CNN-LSTM

3.4.1. Convolutional Neural Network

3.4.2. Long Short-Term Memory Network

3.5. Integrated Architecture

3.6. Loss Function

3.6.1. Basic Cross Entropy Loss

3.6.2. Design of the Angle Weighting Mechanism

3.6.3. Angle-Weighted Cross Entropy Loss Function

4. Simulation Results

4.1. Dataset Generation

4.2. Performance of DOA Estimation Model

4.3. Statistical Performance Analysis

4.3.1. Impact of Signal-to-Noise Ratio on Root Mean Square Error

4.3.2. Impact of Signal-to-Noise Ratio on Estimation Accuracy

4.4. Ablation Study

4.4.1. Ablation Variants Definition

4.4.2. Ablation Experimental Results

4.5. Computational Efficiency Evaluation

4.6. Generalization Ability with Multiple Sources

4.7. Physical Interpretability

4.8. Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI