1. Introduction
Direction-of-arrival (DOA) estimation is a fundamental problem in array signal processing, with critical applications in radar, sonar, wireless communications, and speech processing. Accurately determining the directions of signal sources is essential for target localization, beamforming, and interference suppression in both civilian and military applications. With the evolution of wireless communication systems towards higher frequencies and larger-scale antenna arrays, the demand for accurate and computationally efficient DOA estimation algorithms has become increasingly urgent [
1].
Traditional high-resolution DOA estimation algorithms, such as MUSIC [
2] and ESPRIT [
3], are based on the subspace decomposition of the array covariance matrix. These methods achieve excellent angular resolution when the sample covariance matrix is accurately estimated and the signal and noise subspaces are well separated. However, their performance degrades considerably in adverse scenarios, such as low-SNR environments, short observation intervals, and coherent multipath propagation [
4]. Specifically, MUSIC requires both an accurate estimate of the number of sources and a reliable covariance matrix, and thus becomes vulnerable when noise contamination obscures the subspace structure [
5]. ESPRIT eliminates spectral searching, but its dependence on rotational invariance makes it sensitive to array calibration errors and mutual coupling effects [
6]. Although a number of improved methods have been developed, including covariance reconstruction and denoising strategies prior to subspace decomposition [
7], they still inherit the limitations of the subspace-based framework and often increase computational burden.
Most of the above discussion concerns conventional passive ULA-based DOA estimation, which is also the main focus of this work. In a passive ULA system, the received signals are collected directly by a physical linear array, and the associated steering vector is determined solely by the geometry of the receive array. By contrast, MIMO radar employs both transmit and receive arrays, and after waveform separation, the received data can be interpreted through a virtual array model formed jointly by the transmit and receive apertures, typically described using a Kronecker-product-based steering formulation [
8]. As a result, ULA-based DOA estimation and MIMO radar DOA estimation differ not only in array representation, but also in signal modeling and data structure. Specifically, the former is established directly on the spatial samples of a physical receive array, whereas the latter is formulated on the basis of a virtual array generated from the transmit-receive configuration. This distinction has led to different model designs and signal processing strategies in the DOA literature. Nevertheless, despite these differences in array formulation, both ULA and MIMO DOA estimation ultimately aim to infer angular information from structured array observations, which has motivated the development of data-driven methods that can learn effective representations directly from measurement data.
In recent years, deep learning has emerged as a powerful alternative paradigm for DOA estimation, offering improved robustness and significantly reduced computational complexity after an offline training phase [
9,
10,
11]. Liu et al. proposed a complex-valued convolutional network-based DOA estimation method that vectorizes the upper triangular elements of the covariance matrix as input to the network, enabling direct regression of multiple source angles [
12]. This approach avoids complex feature engineering and offers low computational latency when the network scale is appropriate. However, this input processing method flattens structured data into a global vector, disrupting the inherent spatial proximity relationships between array elements within the covariance matrix. Furthermore, as the number of array elements increases, the number of network parameters grows quadratically, making the model prone to overfitting and difficult to scale to large arrays [
13].
Convolutional neural networks have been increasingly applied to DOA estimation due to their ability to preserve spatial structure [
14,
15,
16,
17,
18]. Addressing DOA estimation in dynamic scenarios, Burghal et al. proposed a sequential modeling method based on recurrent neural networks (RNNs), utilizing RNNs to model the temporal dependencies between snapshots for tracking moving targets [
19]. This approach effectively leverages temporal information across multiple snapshots, demonstrating better tracking performance compared to traditional methods in moving target scenarios. However, RNNs suffer from vanishing gradient problems when processing long sequences, making it difficult to capture dependencies between distant snapshots [
20]. Moreover, this method simply concatenates the array data from each snapshot as input, failing to fully utilize the spatial structure information between elements within the same snapshot, which limits its estimation accuracy under low SNR conditions. The recently introduced Mamba architecture offers linear-time sequence modeling with powerful long-range dependency capture capabilities, providing a promising alternative for sequential processing [
21,
22,
23].
To better position our work within the existing literature,
Table 1 summarizes representative deep learning-based DOA estimation methods along with their advantages and disadvantages.
To address the limitations of traditional subspace methods, which suffer severe performance degradation under low SNR and limited snapshot conditions, the drawbacks of fully-connected network-based methods that disrupt spatial proximity and incur excessive parameters, and the shortcomings of RNN-based methods that suffer from vanishing gradients and neglect spatial structure information, this paper proposes a hybrid deep learning framework integrating ResNet, Mamba, and MLP for DOA estimation with uniform linear arrays. This framework processes the covariance matrix as structured 2D input through a ResNet branch, leveraging its local connectivity to preserve spatial proximity between elements and control the parameter count, thereby addressing the flaws of fully-connected approaches. It performs sequential modeling along the array dimension through a Mamba branch, utilizing a selective state space mechanism to capture long-range dependencies and phase progression patterns, avoiding the vanishing gradient problem of RNNs and compensating for the neglect of spatial structure information. Finally, an MLP layer performs nonlinear fusion and regression on the features extracted by both branches, achieving complementary enhancement of local spatial features and global sequential features. This enables the model to simultaneously utilize both types of information in low SNR scenarios, where signal characteristics are severely corrupted by noise, resulting in more robust DOA estimation performance. The main contributions of this paper are summarized as follows:
A Novel ResNet-Mamba Hybrid Architecture for DOA Estimation: We propose a hybrid deep learning framework that synergistically integrates a ResNet branch for spatial feature extraction and a Mamba branch for sequential feature modeling. The ResNet branch processes the covariance matrix as a 2D image to capture local spatial correlations between array elements, while the Mamba branch treats the array data as a sequence along the sensor dimension to model long-range dependencies and phase progression patterns. This dual-branch design enables comprehensive feature extraction that neither architecture could achieve independently, providing a principled solution for DOA estimation that leverages both spatial and sequential inductive biases.
Feature Fusion with Layer Normalization Optimization: To effectively combine the complementary features from both branches, we design a feature fusion mechanism incorporating layer normalization before concatenation. This normalization step stabilizes training by ensuring that features from different branches have comparable scales and distributions, preventing one branch from dominating the gradient flow. The fused 1024-dimensional representation preserves both local spatial correlations (from ResNet) and global sequential patterns (from Mamba), enabling the subsequent MLP regressor to learn optimal combinations for accurate DOA prediction. This Wdesign provides an effective solution for multi-branch feature fusion in DOA estimation.
Comprehensive Experimental Validation and Benchmarking: We conduct extensive experiments on simulated datasets with a 10-element uniform linear array across a wide range of SNR conditions (−5 dB to 10 dB) and snapshot settings. The proposed fusion model is systematically compared against traditional methods (MUSIC, ESPRIT) and deep learning baselines. Results demonstrate that our model achieves superior accuracy, particularly in the low SNR regime, with significant RMSE improvements compared to existing approaches. We further validate the model’s robustness through ablation studies and snapshot efficiency evaluation, providing comprehensive empirical evidence for the effectiveness of the proposed approach.
3. Experiments and Results
The overall architecture of the proposed hybrid deep learning framework is illustrated in
Figure 1.
Detailed illustration of
Figure 1: The proposed framework consists of the following main modules: 1. Input Data and Preprocessing: The received signals from the uniform linear array (ULA) first undergo covariance matrix estimation, producing a complex covariance matrix of size
. The real and imaginary parts are then separated and stacked to form a 3D tensor of size
. For the ResNet branch, this tensor is bilinearly interpolated to a fixed size of
. For the Mamba branch, the tensor is reshaped into a 2D sequential format of size
. 2. ResNet Feature Extraction (Spatial Features): The ResNet module extracts spatial features from the interpolated covariance matrix. The first convolutional layer with batch normalization and ReLU activation, followed by max pooling, transforms the input to
. It then passes through three stages of bottleneck residual blocks, with feature map dimensions progressing as:
. A global average pooling layer then reduces the spatial features to a 768-dimensional vector. 3. Mamba Block (Sequential Features): The Mamba block processes sequential features through multiple Mamba layers. The input sequence of size
is first projected to
via a linear layer. After passing through four stacked Mamba blocks, global mean pooling and max pooling are applied across the sequence dimension, and their results are concatenated to produce a 256-dimensional feature vector. 4. Feature Concatenation and Fusion: The outputs from the ResNet module (768 dimensions) and the Mamba module (256 dimensions) are each normalized by layer normalization, then concatenated to form a 1024-dimensional fused feature vector. 5. MLP Regression Head: The fused features are passed through an MLP regression head consisting of two hidden layers (1024 → 512 → 256) with LayerNorm, GELU activation, and Dropout for regularization. The final linear layer (256 → 2) outputs the DOA estimates. 6. Output: The network outputs the estimated DOA angles directly as regression values without spectral peak searching.
3.1. Implementation Details for Reproducibility
To ensure the reproducibility of our proposed method, we provide detailed specifications of the training dataset composition and model parameter counts.
3.1.1. Training Dataset Composition
The dataset consists of 60,000 simulated samples generated using a 10-element uniform linear array with half-wavelength spacing. The key parameters are as follows:
Number of sources: Variable
SNR range: −5 dB to 10 dB (values include −5, 0, 5, and 10 dB)
Number of snapshots: 256 per sample
DOA angles: Randomly sampled from the range [−60°, 60°]
Training/validation split: 90% for training and 10% for validation
Batch size: 32
Number of epochs: 100
Learning rate: 0.0001 with cosine annealing scheduler
Optimizer: AdamW with weight decay of
3.1.2. Model Parameter Counts
We conducted experiments on a large-scale dataset containing 60,000 samples. The input data were generated using a ten-element uniform linear array with half-wavelength spacing. The detailed model configurations and trainable parameter counts for each module of the proposed fusion model are summarized in
Table 2,
Table 3,
Table 4, and
Table 5, respectively.
3.2. Results for Single Source Scenario
To evaluate the performance of the proposed model in single-source scenarios, we conducted comprehensive experiments across four different SNR levels (−5, 0, 5, and 10 dB) and computed the corresponding RMSE of the estimated DOAs. The results are presented in
Figure 2.
Furthermore, to evaluate the performance under different snapshot conditions, we conducted experiments with four distinct snapshot settings (1024, 512, 256, and 128) and computed the corresponding RMSE. The results, presented in
Figure 3, clearly demonstrate the impact of snapshot number on estimation accuracy.
3.3. Results for Two-Source Scenario
The performance of the proposed model in two-source scenarios was evaluated through comprehensive experiments across four different SNR levels (−5, 0, 5, and 10 dB), with the corresponding RMSE of the estimated DOAs computed accordingly.
Figure 4 summarizes these results, clearly demonstrating the relationship between SNR and estimation accuracy.
To visually examine the estimation behavior, we plot the scatter distribution of the estimated DOAs for 400 two-source samples at −5, 0, 5, and 10 dB, as depicted in
Figure 5.
The impact of snapshot number on estimation accuracy is examined in
Figure 6, which reports the RMSE under four distinct snapshot settings (1024, 512, 256, and 128). The results confirm that the proposed method maintains robust performance even with limited snapshots.
To establish a theoretical benchmark, we compute the Cramér-Rao Lower Bound (CRLB) for the two-source scenario following [
24]. The CRLB provides a lower bound on the variance of any unbiased estimator, serving as a fundamental performance limit for DOA estimation. For the uniform linear array with
sensors,
snapshots, and two uncorrelated far-field sources, the CRLB for the
i-th DOA parameter is given by:
where
is the noise variance,
L is the number of snapshots,
is the
array manifold matrix,
is the matrix of steering vector derivatives with respect to each DOA,
is the orthogonal projector onto the noise subspace,
is the source covariance matrix, and
is the array covariance matrix. The RMSE of any unbiased estimator satisfies
. The CRLB curves, obtained by averaging over multiple random realizations of source angles. To comprehensively evaluate the performance of different methods in two-source scenarios, we conducted comparative experiments from both estimation accuracy and reliability perspectives.
Figure 7 compares the RMSE of different methods across various SNR conditions, including MUSIC, ESPRIT, IQ-ResNet, the method by Zheng et al. [
25], the proposed method, and the derived CRLB, highlighting the superior estimation accuracy of our approach. Unless otherwise specified, the minimum angular separation between the two sources is set to
, and all subsequent data generation follows this configuration. In addition, we assess the reliability of angle estimation using a strict success criterion that requires both DOA estimates to have absolute errors below
. The resulting accuracy curves are presented in
Figure 8.
To evaluate the performance of different modules, we conducted an ablation experiment. Specifically, we compared the full model against two single-branch variants: (1) ResNet-only; and (2) Mamba-only. All variants were trained and tested under identical conditions. The result shows in
Figure 9.
MUSIC is a spectral peak search method, and its angle estimation performance is limited by the predefined search grid and the requirement of prior knowledge of the number of sources. When two sources are closely spaced, the spectral peaks tend to merge, leading to significant estimation errors. In contrast, the proposed deep learning method also requires prior knowledge of the number of sources but learns to directly regress the DOA angles from the covariance matrix. To evaluate the resolution capability of both methods under challenging conditions, we conducted closely-spaced source estimation experiments. Specifically, we randomly generated 100 two-source scenarios for each of five angular separations:
,
,
,
, and
, with both source angles uniformly sampled from
while maintaining the specified separation between them. The SNR was uniformly distributed across the range of −5 dB to 10 dB, and 1024 snapshots were used for each sample. MUSIC was applied with a
search grid.
Table 6 reports the RMSE averaged over the 100 scenarios for each angular separation, across all SNR conditions. The results demonstrate that the proposed method maintains a consistently low RMSE even at very small separations, whereas MUSIC suffers from severe performance degradation due to merging spectral peaks at closely spaced angles.
It should be noted that the current model is specifically trained on data generated from a 10-element uniform linear array (ULA), and therefore cannot be directly applied to ULAs with a different number of elements without retraining. This is because both the ResNet branch (which expects a fixed input size after interpolation) and the Mamba branch (which operates on a fixed sequence length of 10) are inherently tied to the number of array elements used during training. However, the proposed architecture is general and can be readily adapted to other element counts by simply adjusting the model structure and retraining on data generated from the corresponding array setup. To validate this flexibility, we trained separate instances of the proposed model for 9, 10, 11, and 12 array elements. As shown in
Figure 10, the RMSE performance improves consistently as the number of array elements increases, which is expected due to the greater spatial diversity provided by larger arrays. This demonstrates that the proposed framework can be successfully extended to different element counts with minimal architectural modifications.
3.4. Results for Three-Source Scenario
Three-source RMSE results are shown in
Figure 11.
The estimation characteristics of the proposed model in three-source scenarios are further evaluated through the scatter distribution of the estimated DOAs at four different SNR levels (−5, 0, 5, and 10 dB), as shown in
Figure 12. These scatter plots provide an intuitive visualization of the estimation performance under varying noise conditions, clearly illustrating how the estimated angles converge toward the true values as SNR increases. Even at low SNR levels, the estimates form distinct clusters without significant bias, demonstrating the robustness and reliability of the proposed method.
For the three-source scenario, the influence of snapshot number on estimation accuracy is illustrated in
Figure 13, which presents the RMSE results across four snapshot settings (1024, 512, 256, and 128). The results demonstrate that the proposed method maintains robust performance even with limited snapshot numbers.
Figure 14 compares the RMSE of different methods in three-source scenarios, including MUSIC, ESPRIT, IQ-ResNet, and the proposed methods by Zheng et al. [
25] and this work, highlighting the superior accuracy of our method.
Figure 15 presents the success rates under a strict criterion requiring all three DOA estimates to have absolute errors below 0.5°, confirming that our method maintains consistently higher success rates across all SNR levels.
4. Discussion
The experimental results validate the effectiveness of the proposed ResNet-Mamba-MLP fusion framework for the DOA estimation task. The method demonstrates robust performance across a SNR range of −5 dB to 10 dB, maintaining a low Root Mean Square Error (RMSE) even under low SNR conditions. This showcases the model’s reliable estimation capability when signals are severely contaminated by noise. Furthermore, the RMSE analysis in single-source scenarios indicates that the proposed method achieves stable angle estimation under various SNR conditions.
Scatter plot analysis reveals the estimation characteristics of the model in multi-source scenarios. In the two-source case, under low SNR conditions, the estimated values are concentrated near the diagonal line without systematic bias, and they gradually converge as the SNR increases. This confirms the model’s ability to extract precise angular information from noisy observations. The scatter distribution in the three-source scenario further validates the generalization capability of the proposed method: even under a low SNR of −5 dB, the estimates for the three sources still form clear clusters, without the source loss or angular confusion commonly observed in traditional methods. As the SNR increases, the estimated points gradually converge towards the true values, and the scatter distribution becomes more compact. This indicates that the dual-branch architecture can effectively manage the mutual interference among multiple sources, maintaining good angular resolution even in complex signal environments.
Comparative analysis shows that the performance advantage of the proposed method is most pronounced in the most challenging scenarios. Under low-to-medium SNR conditions, the proposed method significantly outperforms both traditional methods like MUSIC and ESPRIT, as well as existing deep learning baselines, with a particularly notable reduction in estimation error. Only in the high-SNR region, such as when the SNR reaches 10 dB and the covariance estimate is sufficiently accurate, do the classical methods achieve performance comparable to the proposed method.
5. Conclusions
In this paper, a ResNet-Mamba hybrid deep learning framework for DOA estimation has been proposed. A dual-branch collaborative architecture was designed to jointly model spatial structure and sequential dependencies. The covariance matrix was processed as a 2D image by the ResNet branch to extract local spatial correlations, while sequential modeling along the array dimension was performed by the Mamba branch to capture long-range dependencies and phase progression patterns via its selective state-space mechanism. The complementary features were fused after layer normalization and passed to an MLP for DOA regression. To address the performance degradation of traditional methods in low SNR environments, robust estimation under noise-contaminated conditions was achieved by fusing local spatial features with global sequential dependencies. Extensive experiments on a 10-element uniform linear array dataset demonstrated that the proposed model significantly outperforms MUSIC, ESPRIT, and deep learning baselines in low SNR and limited snapshot scenarios, with substantial gains in estimation accuracy and success probability. Under higher SNR conditions, classical methods remained competitive, while comparable performance was maintained by the proposed framework. Evaluation based on strict success criteria further validated the method’s reliability across SNR levels, and sensitivity analysis confirmed robust performance under varying array configurations. The success of the proposed approach was attributed to the complementary nature of the dual-branch architecture, validated by ablation studies showing that the fusion of local spatial features and global sequential dependencies yields synergistic gains. Training stability and information preservation were ensured by the layer-normalized feature fusion mechanism, and favorable computational efficiency was demonstrated. Future work will focus on incorporating physical constraints, validating the model on real-world experimental data, and developing lightweight variants for edge deployment.