TSNetIQ: High-Resolution DOA Estimation of UAVs Using Microphone Arrays

Zhu, Kequan; Jin, Tian; Xie, Shitong; Liu, Zixuan; Sun, Jinlong

doi:10.3390/app15158734

Open AccessArticle

TSNetIQ: High-Resolution DOA Estimation of UAVs Using Microphone Arrays

by

Kequan Zhu

^1,†,

Tian Jin

^1,†,

Shitong Xie

¹,

Zixuan Liu

¹ and

Jinlong Sun

^1,2,*

¹

School of Communications and Information Engineering, University of Posts and Telecommunications, Nanjing 210003, China

²

Tongding Interconnection Information Company, Ltd., Suzhou 215233, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(15), 8734; https://doi.org/10.3390/app15158734

Submission received: 24 June 2025 / Revised: 17 July 2025 / Accepted: 22 July 2025 / Published: 7 August 2025

(This article belongs to the Section Electrical, Electronics and Communications Engineering)

Download

Browse Figures

Versions Notes

Abstract

With the rapid development of unmanned aerial vehicle (UAV) technology and the rise of the low-altitude economy, the accurate tracking of UAVs has become a critical challenge. This paper considers a deep learning-based localization scheme that combines microphone arrays for audio source reception. The microphone array is utilized to capture sound source reception from various angles. The proposed TSNetIQ combines elaborately designed Transformer and convolutional neural networks (CNN) modules, and the raw in-phase (I) and quadrature (Q) components of the audio signals are used as input data. Hence, the direction of arrival (DOA) estimation is treated as a regression problem. Experiments are conducted to evaluate the proposed method under different signal-to-noise ratios (SNRs), sampling frequencies, and array configurations. The results demonstrate that TSNetIQ can effectively estimate the direction of the sound source, outperforming conventional architectures trained with the same dataset. This study offers superior accuracy and robustness for real-time sound source localization in UAV applications under dynamic scenarios.

Keywords:

DOA estimation; deep regression; Transformer; microphone array

1. Introduction

In recent years, with the rapid development of the emerging low-altitude economy, angular position information of aerial vehicles such as unmanned aerial vehicles (UAVs) are drawing attention in various systems and services. Due to the poor adaptability of active radars in complex urban electromagnetic environments and the small radar scattering cross-sectional area of ordinary UAVs, which is difficult to capture, passive localization technology has gradually become the preferred solution for UAV localization [1]. Among the passive schemes, angle estimation based on sound source is regarded as a key method. Sound source localization using microphone array technology [2] has become an important way to obtain critical information in complex environments [3], which provides effective support for real-time localization by collecting signals through an array composed of multiple sensors, and then calculating the incidence angle of the sound source [4].

Traditional direction of arrival (DOA) estimation techniques can be categorized into four types: beamforming methods [5,6], subspace decomposition methods, sparse reconstruction-based methods, and minimum variance-based methods.

Beamforming methods are fundamentally constrained by the Rayleigh resolution limit, which restricts their ability to resolve closely spaced sources [7]. To overcome this limitation and enhance the estimation accuracy, a class of algorithms based on eigenvalue decomposition (EVD) [8] has been developed, namely subspace decomposition methods including the well-known multiple signal classification (MUSIC) and the estimation of signal parameters via rotational invariance techniques (ESPRIT) [9,10]. The MUSIC algorithm conducts EVD [11] on the covariance matrix of the received signals to separate the signal subspace and the noise subspace, which are mutually orthogonal [12]. By exploiting this orthogonality, MUSIC constructs a spatial spectrum in which the peaks correspond to the directions of arrival. The spectrum is searched to identify these peaks, thereby yielding high-resolution DOA estimates. In contrast, the ESPRIT algorithm [13] leverages a specially designed array geometry composed of two identical subarrays with a known and constant displacement between corresponding elements. This configuration ensures that the phase shift in an incoming plane wave across the subarrays is invariant to rotation. By solving the resulting generalized eigenvalue problem, the DOAs of the incident signals can be accurately estimated without the need for a spectral search [14].

Beyond subspace-based techniques such as MUSIC and ESPRIT, various other classes of algorithms [15] have been proposed to further improve DOA estimation performance under challenging conditions, such as a low signal-to-noise ratio (SNR), limited snapshots, or coherent sources.

Sparse reconstruction-based methods, inspired by compressive sensing theory, exploit the inherent sparsity of signal sources in the angular domain [16]. By discretizing the angle space into a fine grid, DOA estimation is formulated as a sparse signal recovery problem [17]. Techniques such as the L1-norm minimization and orthogonal matching pursuit have been widely adopted in this context. These methods can provide super-resolution DOA estimates even with a limited number of array elements or snapshots. However, their performance heavily depends on the grid resolution, and they may suffer from basis mismatch errors when the true DOA does not align with the predefined grid.

Minimum variance-based methods [18], such as the minimum variance distortionless response (MVDR) beamformer, aim to suppress interference and noise while maintaining unity gains in the desired signal direction [19]. MVDR constructs an adaptive spatial filter that minimizes the output power subject to a distortionless constraint in the steering direction. While MVDR offers improved resolution over conventional beamforming, its performance degrades in the presence of array calibration errors or coherent sources.

With high computational capacity and advanced modeling capabilities, deep learning technology has shown significant research potential in the field of sound source localization [20]. Its advantage lies in that it can automatically extract effective information from a large amount of raw data without relying on artificial design features. Meanwhile, under strong noise and reverberation conditions, deep learning models can achieve a robust estimation of sound source location through data-driven methods, which is superior to traditional methods [21]. The main methods include classification models, Transformer and hybrid architectures, and end-to-end raw signal modeling.

Currently, most deep learning-based DOA estimation methods commonly use classification models for angle prediction [22]. For example, some studies have successfully achieved effective prediction of the true angle of arrival using deep neural networks (DNN) trained with the lower triangular element of the covariance matrix as an input feature [23]. With the widespread application of convolutional neural networks (CNN), researchers can further learn spectral features through CNN to improve model performance and training efficiency [24]. Subsequently, various CNN-based depth frameworks have been proposed, such as DeepMUSIC [25], which divides DOA space into multiple sub-regions and estimates them separately. In parallel, regression-based networks have gained traction for their continuous angle estimation, eliminating quantization errors inherent in classification approaches. Techniques like multi-layer perceptron (MLP) regression heads and end-to-end convolutional regression architectures demonstrate sub-degree precision in 2D azimuth-elevation estimation [26], proving especially valuable for full-spatial field analysis. To improve the estimation accuracy, some researchers propose to extract and splice the real part, imaginary part and phase information of covariance matrix of multi-channel signals as inputs, respectively, to enhance the DOA estimation ability under low-SNR conditions [27].

With the widespread adoption of Transformer networks, researchers have also attempted to apply them to the DOA field [28], contributing to the achievement of super-resolution DOA estimation [29], namely Transformer and hybrid architectures, where self-attention mechanisms capture long-range dependencies across sensor arrays, enabling joint spatial–spectral feature learning. This has contributed to the achievement of super-resolution DOA estimation, with recent variants like cross-attention Transformers achieving real-time performance while maintaining millimeter-level accuracy in near-field scenarios [30].

However, the above methods inevitably rely on the traditional DOA processing flow in the data preprocessing stage, which not only increases the calculation overhead, but also may lead to the loss of some feature information, thus affecting model performance. In this regard, some studies have proposed to directly input the original in-phase (I) and quadrature (Q) signals into the neural network for modeling to reduce the preprocessing cost, namely end-to-end raw signal modeling, and improve the estimation accuracy and achieved good results [31]. Notably, these end-to-end approaches have catalyzed advancements in lightweight 2D DOA systems for UAVs. Mature frameworks now integrate time-domain convolutional networks (TCNs) with spatial covariance features, enabling real-time azimuth and elevation tracking on embedded platforms [32]. Such systems compensate for UAV ego-noise through adversarial training and leverage multi-array spatial diversity to resolve front-back ambiguities—critical for navigation in complex urban environments [33].

Based on end-to-end raw signal modeling, this paper proposes a new neural network model named TSNetIQ. This study takes the original

I Q

audio signals as input and models the DOA estimation task as a regression problem so that the model can output continuous angle values between 0 and 360°, avoiding the error problem caused by angle discretization in traditional classification methods. In addition, the regression method model is lighter and has a higher angular resolution than the classification model. In terms of network structure design, the advantages of Transformer and CNN are combined, and a squeeze-and-excitation (SE) module is introduced to enhance the modeling abilities of inter-channel information [34], thereby further improving the accuracy of DOA estimation. Pyroomacoustics Python 3.12.9 package [35] is used to construct a realistic acoustic simulation environment for synthesizing training datasets, thereby enhancing the credibility of the experimental results. Meanwhile, this article systematically analyzes the performance of the proposed network under different SNRs, sampling frequencies, and numbers of elements, and the results show that the proposed method has a stronger generalization ability and smaller estimation error than traditional CNN and ResNet trained based on the same input mode.

The main contributions of this study are as follows:

-: A novel DOA estimation neural network architecture based on original $I Q$ input, named TSNetIQ, is proposed. The architecture combines the structural advantages of Transformer and CNN, and combines an SE module to enhance the feature extraction ability, which can significantly improve the accuracy of DOA estimation.
-: By reformulating the DOA problem as a regression task, the proposed algorithm directly predicts continuous angle values between 0 and 360°, avoiding boundary classification errors. This strategy can offer higher resolution and a smaller model size.
-: A dynamic simulation environment is constructed by rotating a microphone array in an anechoic chamber to simulate sound source reception at various angles. This procedure generates diverse datasets with random angles, enhancing the validity and interpretability of the training process.

2. System Model and Problem Formulation

2.1. Signal Model

DOA estimation is usually achieved by antenna array reception. Consider a uniform circular array (UCA) composed of

M

antenna elements with element spacing; its structure is shown in Figure 1. Suppose that the origin of coordinates is set at the center,

O

, and the angle between the line between the

m th

array element and the center and the x-axis is

ϕ_{m} = 2 π m / M

. Let a narrow-band plane wave be incident on the UCA in the direction

(ϕ, θ)

, where the azimuth angle

ϕ \in [0, 2 π]

of the signal is the angle projected from the x-axis counterclockwise to the signal incident direction on the array plane, and the pitch angle

θ \in [0, π / 2]

is the angle between the z-axis and the signal incident direction.

There are

L (L < M)

independent narrowband signals from different directions

θ_{1}, θ_{2}, \dots, θ_{L}

indicated on the array. Taking the first array element as a reference, the

i th

signal can be expressed as

s_{i} (n)

; thus, the signal received by the

m th

array element is given by

x_{m} (n) = \sum_{i = 1}^{L} s_{i} (n) e^{- j \frac{2 π}{λ} (m - 1) R \cos (ϕ_{i} - ϕ_{m})} + δ_{m} (n)

(1)

where

λ

denotes the wavelength corresponding to the signal carrier frequency and

δ_{m} (n)

represents the noise added to the

m th

array element.

2.2. Problem Formulation

Most deep learning-based DOA estimation methods formulate angle prediction as a classification task by discretizing the angular space. However, these approaches fail to capture the continuity of angles, particularly near category boundaries, leading to discretization errors and limited generalization. To address this issue, this paper frames DOA estimation as a regression problem, directly predicting continuous angles between 0° and 360°. This eliminates discretization artifacts and improves accuracy.

Conventional inputs for deep learning DOA models include raw phase features, handcrafted features, hybrid representations, or covariance matrices. This article evaluated multiple input representations (raw signals, phase-based features, STFT/FFT spectrograms) and observed consistently lower accuracy compared to

I Q

components. The

I Q

representation inherently preserves amplitude and phase information, enabling the learning of signal envelopes and instantaneous phases. Crucially, it avoids (1) information loss from Fourier transforms numerical errors, and (2) complex preprocessing (e.g., covariance matrix computation). Thus,

I Q

components are adopted as inputs to enhance feature richness while streamlining processing.

Specifically, this paper proposes a regression model based on the

I Q

components of the signal, which directly outputs continuous angle values through deep learning.

The angle

θ

is defined as a continuous value representing all possible directions of arrival, satisfying

θ \in [0, 360 °)

(2)

Our objective is to formulate a regression problem that maps the

I Q

components of signal data to a continuous, real angle

θ

via a neural network.

Given the complex-valued time-domain signal samples collected across

M

sensors at a given sampling frequency

f_{s}

and a fixed sampling time

T

, our goal is to learn a mapping from the

I Q

components of signal input to continuous-angle variables

θ \in [0, 360 °)

.

Formally, the task is to learn a function defined as

f_{Θ} : ℝ^{2 M \times (f_{s} \times T)} \to ℂ^{1}

(3)

where

ℂ^{1}

denotes the unit circle manifold representing the continuous-angle space, and

f_{Θ}

is a nonlinear function approximated by a deep neural network with learnable parameters

Θ

.

Deep learning is used to approximate this nonlinear

f_{Θ} : ℝ^{2 M \times (f_{s} \times T)} \to ℂ^{1}

mapping

\hat{θ} = f_{Θ} (X_{I Q})

(4)

where

\hat{θ}

is the predicted DOA angle. Thus, the problem definition is completed.

3. Proposed DOA Estimation Method

3.1. Preprocessing and IQ Extraction

Unlike traditional DOA estimation methods based on covariance matrices, this paper adopts a strategy of directly using the

I Q

components of the received signals as the input. For the original signal

x_{m} (n)

, which is a discrete signal received by each array element, its Q signal

I_{m} (n)

can be obtained using the discrete Hilbert transform, defined as

I_{m} (n) = \frac{2}{π} \sum_{k = - \infty, k \neq 0}^{\infty} x_{m} (n - k) \cdot \frac{\sin^{2} (π k / 2)}{k}

(5)

Furthermore, the analytic signal

x_{a}

can be derived from the formula as follows:

x_{a} (n) = R_{m} (n) + j I_{m} (n)

(6)

where

R_{m} (n)

represents I signal, which is numerically equal to the original signal. After that, the I and Q components are obtained by

R_{m} (n) = ℜ (x_{a} (n)), I_{m} (n) = ℘ (x_{a} (n))

(7)

Then, the

I Q

components of all array elements are sequentially concatenated to form the input matrix, which is explained as

M_{I Q} = [\begin{array}{l} R_{1} (0) R_{1} (1) \dots R_{1} (T - 1) \\ R_{2} (0) R_{2} (1) \dots R_{2} (T - 1) \\ ⋮ ⋮ ⋮ ⋮ \\ R_{M} (0) R_{M} (1) \dots R_{M} (T - 1) \\ I_{1} (0) I_{1} (1) \dots I_{1} (T - 1) \\ I_{2} (0) I_{2} (1) \dots I_{2} (T - 1) \\ ⋮ ⋮ ⋮ ⋮ \\ I_{M} (0) I_{M} (1) \dots I_{M} (T - 1) \end{array}] \to θ \in [0, 360 °)

(8)

This matrix

M_{I Q}

has a dimension of

2 M \times (f_{s} \times T)

. Using this approach, directional information is directly extracted from the raw signals.

Following the extraction of the

I Q

components, a dedicated preprocessing pipeline is applied to prepare the data for neural network input. First, a fourth-order Butterworth bandpass filter is applied to preserve frequency components near 2000 Hz while removing out-of-band frequencies, thereby eliminating external noise and interference as the acoustic source emits a pure 2000 Hz sinusoidal tone. Subsequently, the denoised signal was transformed into its analytic representation, using the Hilbert transform to extract the

I Q

components. The real and imaginary parts are concatenated to form a

2 M \times (f_{s} \times T)

matrix. Prior to network input, all angle data are normalized to a range from 0 to 1. This normalization uniformly scales the input features, mitigating the influence of potential variations in signal amplitude across different recordings and ensuring stable gradient behavior during neural network optimization. Finally, the

I Q

components and the normalized angles are concatenated to form an input feature matrix for the model.

3.2. DOA Estimation Using TSNetIQ

CNN is a deep learning model that is specifically designed to process data with grid-like structures. It was widely applied to feature extraction from data such as images and audio. CNN extracts local features through convolutional layers and uses parameter sharing and local connection mechanisms to significantly reduce model parameters, thereby improving computational efficiency and generalization ability.

The neural network in this study is designed with CNN as the main body, proposing an angle regression model, named TSNetIQ, which integrates the CNN and Transformer mechanisms. The goal is to extract spatiotemporal features from multi-channel

I Q

signals and achieve high-precision direction estimation. The overall architecture consists of five 2D convolutional layers, an SE attention module, positional encoding, a Transformer encoder, and a fully connected output layer. Unlike traditional CNN architectures, the model avoids excessive use of pooling layers, except for the pooling layer in the SE module and the average pooling layer at the tail of the model, as pooling layers would degrade the performance of our model. A three-layer 2D convolutional structure was also tested, but it exhibited an insufficient feature extraction capability, leading to inferior performance compared to the five-layer 2D convolutional structure. Taking the case of 4 array elements

M

,

f_{s} = 4200 Hz

and

T = 3 s

as an example, the specific structure of the TSNetIQ network is as follows.

As shown in Figure 2, the overall architecture consists of five 2D convolutional layers, an SE attention module, positional encoding, a Transformer encoder, and a fully connected output layer. The front end of the network is composed of five 2D convolutional layers, which extract the local spatial features of the signal layer by layer. The first layer uses a convolution kernel with a size of (8,7) to enhance joint time–frequency perception, while the remaining four convolutional layers adopt progressively smaller kernel sizes ((1,5), (1,3), (1,3), (1,1)), combined with a 1 × 2 stride to control the down-sampling rate. Each convolutional layer is followed by an activation layer and a normalization layer, where the normalization layer uses the BatchNorm2d function, and the activation layer uses the ReLU function to accelerate convergence and improve nonlinear expression capability, which is described as follows.

ReLU (x) = \max (0, x)

(9)

After the fifth convolutional layer, an SE attention module is introduced to enhance the sensitivity of the TSNetIQ model to channel features. This module generates attention weights for each channel through global average pooling and a two-layer fully connected network, achieving the adaptive recalibration of features, reinforcing effective features, and suppressing redundant information.

In the context of deep learning for DOA, modeling the relationships between channels is considered crucial. To capture sequence dependencies in the high-level features extracted by CNN, a Transformer structure is introduced after the convolutional layers, which consists of a position encoding layer and two Transformer encoder layers. First, the output of the convolutional layers is reshaped, and learnable positional encoding vectors are added to explicitly incorporate the relative ordering of temporal features and facilitate the modeling of sequential dependencies across time steps. The feature sequence is then input into two Transformer encoder layers; this consists of 8 attention heads, 256 embedding dimensions and 512 hidden layer sizes to effectively capture long-range dependencies, integrate global contextual features across the entire sequence, and learn continuous feature representations for precise value estimation.

The output of the Transformer encoder, which captures global temporal dependencies across the input sequence, is aggregated through a mean pooling operation to obtain a fixed-dimensional representation. This pooled feature is then passed through a linear layer, followed by a Sigmoid function defined as

sigmoid (θ) = \frac{1}{1 + e^{- θ}}

(10)

to produce the final output. The inclusion of the Sigmoid function serves a critical role by constraining the model’s prediction between 0 and 1, which corresponds to the normalized angular range from 0 to 360°. This bounded output ensures consistency with the preprocessed training targets and prevents the model from generating physically invalid angle predictions (e.g., negative degrees or values exceeding 360°). Furthermore, by maintaining the output within a normalized and differentiable range, the Sigmoid function facilitates stable gradient propagation during backpropagation, thereby improving the convergence and generalization performance of the regression model.

This method not only normalizes the output range but also enhances the stability of model training and promotes better convergence.

In the regression model, the Huber loss function is used as the loss function of the model, with the following expression:

L_{δ} (Δ θ_{i, n}) = \{\begin{matrix} \frac{1}{2} {(Δ θ_{i, n})}^{2}, | Δ θ_{i, n} | \leq δ \\ δ | Δ θ_{i, n} | - \frac{1}{2} δ^{2}, otherwise. \end{matrix}

(11)

where

Δ θ_{i, n}

is a circle angle difference defined as

Δ θ_{i, n} = {\hat{θ}}_{i, n} - θ_{i, n} - 360 \times ⌊\frac{{\hat{θ}}_{i, n} - θ_{i, n} + 180}{360}⌋

(12)

where

θ_{i, n}

and

{\hat{θ}}_{i, n}

represent the true angle and the estimated angle of the

i th

signal in the

n th

sample, respectively.

δ

denotes a positive factor.

The Huber loss function combines the advantages of the mean squared error (MSE) and mean absolute error (MAE). During gradient descent, the Huber loss tends to behave like MSE, while in the presence of outliers, it tends to behave like MAE. This dynamic adjustment provides an optimization objective that balances accuracy and robustness in DOA regression tasks, making it particularly suitable for scenarios with noise or outlier angles. More detailed network code and simulation code can be accessed via the link provided in the Supplementary Materials at the end of the paper.

4. Simulation

4.1. Simulation Settings

The total procedure of the simulation is shown in Figure 3 and the specifications of the circular microphone array hardware are presented in Table 1. Pyroomacoustics Python package was used to perform the simulation and generate datasets. This library is a Python package for room acoustic simulation and audio signal processing, primarily used to create room impulse response (RIR) simulations and conduct various audio-processing experiments in these simulated environments. This experiment first simulated a 3D anechoic chamber, then added a single sound source at [2, 0, 1.5] and a microphone array at [2, 2.5, 0.05]. The 0° azimuth reference line of the microphone array is aligned with the sound source. The number of array elements in the microphone array

M

was set to 4, 6, 8, 10, and 12. The audio added to the sound source was a 5 s sinusoidal signal generated using the scipy package, with a frequency of 2000 Hz, a default sampling rate

f_{s} = 4200 Hz

, and a fixed sampling time

T = 3 s

. The positions of the sound source and microphone array are reasonably configured, with the element spacing fixed at 0.02 m, and Gaussian white noise was added to the anechoic chamber to maintain an SNR of 5 dB. Adjustments when training under different SNRs can be made manually. The dataset generation process is initiated by simulating the scenario in which the sound source emits signals and the microphone array receives them.

4.2. Datasets Generation

This paper simulated scenarios where the source signal arrives from various directions by rotating the microphone array. In each loop, a random rotation angle was set for the microphone array, which also represents the true angle of arrival

θ

of the source signal. To reduce data volume, a 3 s segment of the sinusoidal signal was extracted as our audio data, stored in .npy files. Each data entry is an

2 M \times (f_{s} \times T)

matrix. The angle of arrival

θ

was used as the file name to facilitate data preprocessing during model training. This simulation generated 50,000 random datasets as the training set through a loop, and the distribution of the training set is shown in Figure 4. Our data was evenly distributed within the range from 0 to 360°, and there was no situation when there were too many datasets in any direction. This experiment generated 10,000 random datasets as validation sets, as the true arrival angle

θ

of our source signal has a resolution of 0.1°.

4.3. Training Settings

The model uses the Adamw optimizer with an initial learning rate of 0.0002, combined with the ReduceLROnPlateau dynamic learning rate adjustment strategy, which automatically reduces the learning rate when model performance plateaus, enhancing training stability and final performance. Training was conducted over 200 epochs with a batch size of 256. The ResNetIQ model was reimplemented based on the neural network architecture diagram from the original paper, with an initial learning rate of 0.0001, and other settings consistent with our regression model. The CNN model, excluding the SE module and Transform Encoder module, shared the same settings as our regression model, with an initial learning rate of 0.0001. All models were trained using a GeForce RTX4090 GPU rented through the AutoDL cloud computing platform, provided by Shituo Cloud (Nanjing) Technology Co., Ltd., Nanjing, China.

4.4. Performance Metrics

This paper simulates single-source DOA estimation and used the root mean square error (RMSE) to evaluate model performance:

RMSE = \sqrt{\frac{1}{L N} \sum_{i = 0}^{L - 1} \sum_{n = 1}^{N} {|Δ θ_{i, n}|}^{2}}

(13)

where

N = 1000

represents the number of test samples.

4.5. Possible Extension to Multi-Source 2D DOA Estimation

All preceding simulations were implemented based on single-source 1D DOA estimation, which exclusively considers azimuth angle estimation without accounting for pitch angle. However, in practical scenarios where UAVs operate in three-dimensional space, DOA estimation of UAVs predominantly relies on multi-source 2D DOA estimations. This subsection proposes potential methods to extend the current single-source 1D DOA estimation model to a multi-source 2D DOA estimation framework. For simulation settings, multi-source model training requires incorporating multiple sound sources into the established 3D anechoic chamber while other configurations remain essentially unchanged. Regarding dataset generation, the method of rotating the microphone array employed in the 1D model is superseded by varying the 3D coordinates of sound sources within the chamber, which means that modifying source positions inherently alters their azimuth angles, pitch angles, horizontal distances, and vertical heights, thereby generating distinct datasets with corresponding target vectors. Crucially, the multi-source 2D DOA estimation model may demand substantially more training data than its 1D counterpart. Furthermore, the target variable transitions from a single azimuth angle to a combined azimuth-pitch vector, with the model’s final output being the predicted azimuth and pitch angles. This extension achieves multi-source 2D DOA estimation, thereby enabling a more authentic simulation of the DOA estimation of UAVs.

5. Results and Evaluation

5.1. Model Training Results

The loss curves after training are presented in Figure 5. Despite slight fluctuations in the early epochs, all three models demonstrated effective convergence following the learning rate adjustment. TSNetIQ exhibits the fastest and most stable convergence behavior. A sharp decline in loss is observed during the initial training stages, and the loss stabilizes around epoch 50, fluctuating within the range of 1.68 to 1.83. This indicates that the model is capable of rapidly capturing essential features while avoiding overfitting, reaching a minimal loss in fewer epochs. In contrast, ResNetIQ shows slower convergence, with a stable loss plateau emerging after approximately 100 epochs, fluctuating between 2.39 and 2.75. This suggests that the model requires more time for parameter adaptation and converges less efficiently compared to TSNetIQ. Although the CNN model demonstrates relatively fast initial convergence with moderate fluctuation, its final loss remains high, stabilizing between 4.15 and 4.49, indicating lower accuracy and larger variability. Overall, TSNetIQ outperforms the other two models by achieving faster and more stable convergence, with the loss reaching a plateau earlier and with fewer epochs. The trained models were evaluated through the following tests.

5.2. Complexity Analysis

As shown in Table 2, TSNetIQ exhibits a significantly higher number of floating-point operations (FLOPs) and a higher parameter count than the standard CNN model, with approximately 2.66 × 10⁸ FLOPs and 6.05 × 10⁵ parameters compared to CNN’s 5.78 × 10⁷ FLOPs and 6.87 × 10⁴ parameters. However, the complexity of TSNetIQ remains on the same order of magnitude as ResNetIQ, which has 1.90 × 10⁸ FLOPs and 2.46 × 10⁵ parameters. Despite the relatively increased computational cost, TSNetIQ consistently delivers superior angle estimation accuracy, outperforming both CNN and ResNetIQ models. This demonstrates that TSNetIQ achieves a better trade-off between model complexity and predictive precision, making it more suitable for high-resolution DOA estimation tasks.

5.3. Different SNR Environments

To assess the robustness of the proposed TSNetIQ model under varying noise conditions, experiments were conducted across multiple SNR levels, with 1000 test samples generated for each level. These samples were fed into three trained models—TSNetIQ, ResNetIQ, and CNN—with each sharing identical training configurations in terms of array element count

M

and sampling frequency

f_{s}

. The performance was quantitatively evaluated using RMSE, and further analyzed via error histograms, error scatter plots, and detailed tabulated prediction distributions.

Table 3 presents a breakdown of the prediction error ranges for the three models across the 1000 test samples. TSNetIQ demonstrates an outstanding concentration of predictions within the 0°–5° error interval, accounting for 99.9% of all predictions (999/1000), while only a single sample fell into the 5°–10° category and none exceeded a 10° error. In contrast, ResNetIQ and CNN exhibit degraded precision, with 12 and 14 predictions, respectively, falling into the moderate error range, and even a small number of cases (3 and 9, respectively) exceeding 10° of error. This clear contrast reflects the superior noise-resilience and angle resolution of TSNetIQ under variable acoustic conditions.

To substantiate this statistical observation, Table 4 illustrates ten representative prediction results from TSNetIQ. The predicted angles are shown to deviate from the ground truth by no more than 0.2°, with most results showing absolute errors as low as 0.1° or even zero. Notably, the model maintains this high level of precision across the entire angular space (e.g., from 0° to 335°), underscoring its consistency and reliability in directional inference.

Visual analysis through error histograms (Figure 6) further confirms these findings, the errors of TSNetIQ are sharply concentrated near zero, while ResNetIQ and CNN yield broader distributions, signifying more frequent moderate-to-large deviations. Likewise, scatter plots (Figure 7) comparing predicted versus true angles indicate that TSNetIQ predictions closely adhere to the ideal

y = x

line, while other models suffer from noticeable dispersion, particularly at angle extremities.

As shown in Figure 8, it is observed that the average prediction error and RMSE for all three models decrease with increasing SNR, as expected. However, this improvement trend becomes markedly slower at high SNR levels. Interestingly, due to the model being trained specifically at 5 dB SNR, the CNN model paradoxically exhibits performance degradation when evaluated at higher SNRs—suggesting poor generalization. By contrast, the TSNetIQ model maintains stable accuracy across the entire SNR spectrum, demonstrating a stronger generalization capability and robustness to unseen noise conditions.

To further evaluate the stability of model predictions under different SNRs, the standard deviations of RMSE values were calculated based on 10 independent prediction runs for each setting. As shown in Table 5, the standard deviation of RMSE consistently decreases with increasing SNR for all three models, indicating improved prediction stability under cleaner signal conditions. Among the models, TSNetIQ exhibits the lowest standard deviation across all SNR levels, particularly in high-SNR scenarios, where its variance approaches negligible levels (e.g., 0.06 at 20 dB). In contrast, ResNetIQ demonstrates moderate sensitivity to noise, with higher standard deviations in low-SNR conditions but reasonable convergence at higher SNRs. CNN, however, shows the largest variability, especially under low-SNR conditions (e.g., 6.48 at –20 dB and 2.79 at –5 dB), suggesting lower robustness to noise perturbations. These results confirm that TSNetIQ provides superior consistency and robustness across varying noise levels, underscoring its effectiveness for reliable DOA estimation in noisy environments.

Collectively, these results confirm that TSNetIQ not only achieves the lowest RMSE across all tested SNR levels, but also exhibits remarkably consistent and accurate DOA predictions, making it a highly promising solution for practical sound source localization in dynamic and noisy environments.

5.4. Different Sampling Frequencies

Since the microphone array receives time-domain data with a fixed 3 s sampling time, our model does not involve the traditional “number of snapshots” concept in DOA estimation. However, sampling frequency affects the clarity of the received sinusoidal signal, making it meaningful to investigate its impact on model performance. A total of 1000 groups of random test data were generated under different sampling frequencies and input into the three models trained with the same array element count

M

and SNR. The neural networks predicted the angle of arrival for each dataset, and RMSE was calculated. Our model remained the best-performing, as shown in Figure 9. The average prediction error and RMSE of all models decreased as sampling frequency increased, but the decline in RMSE slowed significantly with further frequency increases.

To further evaluate the stability of model predictions under different sampling frequencies, the standard deviations of RMSE values were calculated based on 10 independent prediction runs for each setting. As presented in Table 6, the standard deviation analysis across varying sampling frequencies reveals that TSNetIQ consistently achieves the lowest variability in RMSE measurements, with values remaining stable at around 0.10 to 0.14. This indicates a high level of robustness and reliability, regardless of the sampling rate. ResNetIQ exhibits moderate variance, decreasing progressively from 0.43 at

f_{s}

to 0.13 at

5 f_{s}

, suggesting an improved stability with increased sampling frequency. In contrast, the CNN model demonstrates the highest standard deviations across all settings, ranging from 0.73 to 0.30, highlighting its greater sensitivity to the input sampling conditions. Overall, TSNetIQ outperforms the other models in maintaining consistent performance under varying frequency conditions, underscoring its suitability for practical deployment where hardware sampling rates may vary.

5.5. Different Number of Array Elements

This experiment investigated the impact of array element count on DOA estimation model performance. This paper considered five scenarios with different element counts—4, 6, 8, 10, and 12—and retrained models under an SNR of 5 dB. A total of 1000 groups of random test data were generated under 5 dB SNR and input them into the three models trained with the same array element count

M

and sampling frequency

f_{s}

. The training results are shown in Figure 10. It can be observed that as the number of array elements increases, the RMSE of all three models decreases. However, as the number of array elements further increases, the downward trend of RMSE slows down, and the impact of array element number on DOA estimation models is far less than that of SNR. The model showed the smallest RMSE in all five cases, indicating its superior performance in this aspect.

To further assess model stability under varying

M

configurations, the standard deviations of RMSE were calculated based on 10 independent runs for each setting. As shown in Table 7, the standard deviation trends with respect to the number of array elements indicate that TSNetIQ maintains the most stable performance across configurations, with standard deviation values decreasing steadily from 0.85 to 0.14. This reflects the strong generalization of model capability and its low sensitivity to variations in array size. ResNetIQ shows moderately higher variability, especially when

M \leq 6

, stabilizing at 0.43 for

M = 10

and

M = 12

. In contrast, CNN exhibits the highest fluctuations, with a standard deviation of 2.76 at

M = 4

, which gradually reduces to 0.73 at

M = 12

, indicating a greater dependence on larger array configurations to achieve stable estimates. Overall, TSNetIQ demonstrates superior robustness and consistency across varying array dimensions, making it more reliable for practical applications involving flexible or constrained sensor layouts.

5.6. Limitations and Extensions of Experimental Parameters

Some experimental parameters in this study are subject to limitations but also offer extensibility. In the simulation settings described in Section 4.1, the height of the sound source was fixed at 1.5 m and the horizontal distance at 2.5 m in a 3D anechoic chamber. The element spacing was fixed at 0.02 m, which can be varied for comparative experiments. The 5 s, 2 kHz sine wave used as the audio source can be replaced with other signals, provided that the sampling rate

f_{s}

and sampling time

T

are adjusted accordingly.

6. Future Work

Although the proposed TSNetIQ model demonstrates a superior performance in single-source DOA estimation under anechoic conditions, several important aspects remain to be addressed in future research. First, real-world acoustic environments often involve multiple simultaneous sound sources, leading to signal superposition and mutual interference. Therefore, extending the current framework to support multi-source localization will be a valuable and necessary enhancement. To achieve this, the model could be adapted to output multiple angle predictions, potentially through sequence modeling or attention-based multi-output regression mechanisms. Second, while our experiments are conducted in idealized anechoic chambers, actual environments frequently contain reverberation caused by reflections from surrounding surfaces. This reverberation can significantly distort the spatial features embedded in the raw

I Q

signals. Future work will incorporate reverberant RIR modeling into the simulation process using tools such as the image-source method in pyroomacoustics Python package, thereby enhancing the realism and robustness of the training data.

The current results were achieved under relatively large-scale datasets. However, in practical applications, data efficiency is often a critical constraint. It is of interest to explore how the estimation accuracy and angular resolution of the model can be further improved without expanding the scale of the dataset. This could be approached by incorporating self-supervised learning, data augmentation, or knowledge distillation techniques, aiming to maximize information extraction from limited training data.

Increasing the angular resolution beyond the current 0.1° while maintaining computational feasibility remains an open challenge. Future efforts may consider multi-resolution regression strategies or hierarchical DOA prediction frameworks to achieve finer-grained direction estimation under the same data constraints. In addition, future work will consider the deployment of the proposed model on edge devices. In real-world applications, inference latency and power consumption are critical factors affecting feasibility. Therefore, we plan to adopt techniques such as model pruning, quantization, and a lightweight architecture design to reduce model complexity and improve runtime efficiency, aiming to enable real-time DOA estimation on embedded systems or mobile platforms.

The current study focuses solely on azimuthal DOA estimation within a single plane. To enhance the system’s spatial awareness, future efforts will extend the TSNetIQ framework to support 2D DOA estimation, i.e., simultaneously predicting both the azimuth and elevation angles of the sound source. This extension is particularly valuable for three-dimensional sound source localization in scenarios such as UAV tracking and aerial situational awareness.

7. Conclusions

This study proposes a new architecture, termed TSNetIQ, to execute more accurate DOA estimations of UAVs, which directly predicts continuous angle values between 0° and 360°, bypassing the limitations of angle discretization commonly found in classification models. This regression-based method not only avoids boundary errors but also offers a higher angular resolution and a lighter model size. The TSNetIQ model integrates the strengths of Transformer and CNN, along with an SE module, to enhance feature extraction and improve estimation accuracy. Unlike previous methods that rely on traditional covariance matrices or spectral decomposition, our model directly utilizes the

I Q

components of signals, making the training process more efficient. A simulation environment resembling an anechoic chamber was used to simulate sound source reception at different random angles by rotating the microphone array. This setup generates diverse datasets, providing a more realistic and interpretable training process. Extensive experiments under different SNRs, array element numbers, and sampling frequencies demonstrate that TSNetIQ outperforms conventional CNN and ResNet architectures, showing superior accuracy and robustness, especially in dynamic environments with varying noise conditions.

Supplementary Materials

The code repository is openly available in Github at https://github.com/saltyfish2003/TSNetIQ.git (accessed on 17 July 2025).

Author Contributions

Conceptualization, J.S. and T.J.; methodology, K.Z.; software, K.Z.; validation, K.Z., T.J. and S.X.; formal analysis, T.J. and K.Z.; investigation, T.J. and K.Z.; resources, S.X.; data curation, S.X. and Z.L.; writing—original draft preparation, T.J. and K.Z.; writing—review and editing, J.S.; visualization, K.Z., T.J. and Z.L.; supervision, J.S.; project administration, J.S.; funding acquisition, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the Natural Science Foundation of the Jiangsu Higher Education Institutions of China under Grant 23KJA510004, and in part by the Key Research and Development Program of Wuhan under Grant 2023020402010590.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is unavailable due to privacy.

Conflicts of Interest

Author Jinlong Sun was employed by the company Tongding Interconnection Information Company, Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DOA	Direction of Arrival
I	In-phase
Q	Quadrature
CNN	Convolutional Neural Networks
SNR	Signal-to-Noise Ratio
UAV	Unmanned Aerial Vehicle
EVD	Eigenvalue Decomposition
MUSIC	Multiple Signal Classification
ESPRIT	Estimation of Signal Parameters via Rotational Invariance Techniques
MVDR	Minimum Variance Distortionless Response
DNN	Deep Neural Networks
MLP	Multi-Layer Perceptron
TCN	Time-domain Convolutional Network
SE	Squeeze-and-Excitation
UCA	Uniform Circular Array
MSE	Mean Squared Error
MAE	Mean Absolute Error
RIR	Room Impulse Response
RMSE	Root Mean Square Error
FLOPs	Floating-point Operations

References

Shi, B.; Ma, X.; Zhang, W.; Shao, H.; Shi, Q.; Lin, J. Complex-Valued Convolutional Neural Networks Design and its Application on UAV DOA Estimation in Urban Environments. J. Commun. Inf. Netw. 2020, 5, 130–137. [Google Scholar] [CrossRef]
Wang, J.; He, Y.; Su, D.; Itoyama, K.; Nakadai, K.; Wu, J.; Huang, S.; Li, Y.; Kong, H. SLAM-Based Joint Calibration of Multiple Asynchronous Microphone Arrays and Sound Source Localization. IEEE Trans. Robot. 2024, 40, 4024–4044. [Google Scholar] [CrossRef]
Lu, Y.; Pan, C.; Chen, J.; Benesty, J. A Closed-Form DOA Estimator Using Spherical Microphone Arrays in the Presence of Interference. IEEE Signal Process. Lett. 2024, 31, 1770–1774. [Google Scholar] [CrossRef]
Lee, S.Y.; Chang, J.; Lee, S. Deep Learning-Enabled High-Resolution and Fast Sound Source Localization in Spherical Microphone Array System. IEEE Trans. Instrum. Meas. 2022, 71, 2506112. [Google Scholar] [CrossRef]
Stoica, P.; Sharman, K.C. Maximum likelihood methods for direction-of-arrival estimation. IEEE Trans. Acoust. Speech Signal Process. 1990, 38, 1132–1143. [Google Scholar] [CrossRef]
He, S.; Chen, H. Closed-Form DOA Estimation Using First-Order Differential Microphone Arrays via Joint Temporal-Spectral-Spatial Processing. IEEE Sens. J. 2017, 17, 1046–1060. [Google Scholar] [CrossRef]
Chen, J.C.; Yao, K.; Hudson, R.E. Source localization and beamforming. IEEE Signal Process. Mag. 2002, 19, 30–39. [Google Scholar] [CrossRef]
Lu, D.; Liu, Y.; Huang, W.; Xi, X. Accurate Fault Location in AC/DC Hybrid Line Corridors Based on Eigenvalue Decomposition. In Proceedings of the 2020 IEEE Power & Energy Society General Meeting (PESGM), Montreal, QC, Canada, 2–6 August 2020. [Google Scholar]
Schmidt, R. Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag. 1986, 34, 276–280. [Google Scholar] [CrossRef]
Kunlai, X.; Zhangmeng, L.; Pei, W. SAGE-based algorithm for DOA estimation and array calibration in the presence of sensor location errors. J. Syst. Eng. Electron. 2019, 30, 1074–1080. [Google Scholar] [CrossRef]
Xu, W.; Xiang, W.; Jia, Y.; Li, Y.; Yang, Y. Downlink Performance of Massive-MIMO Systems Using EVD-Based Channel Estimation. IEEE Trans. Veh. Technol. 2017, 66, 3045–3058. [Google Scholar] [CrossRef]
Zhou, L.; Ye, K.; Qi, J.; Sun, H. DOA Estimation Based on Pseudo-Noise Subspace for Relocating Enhanced Nested Array. IEEE Signal Process. Lett. 2022, 29, 1858–1862. [Google Scholar] [CrossRef]
Veerendra, D.; Balamurugan, K.S.; Villagómez-Galindo, K.S.; Khandare, M.; Patil, A.; Jaganathan, M. Optimizing Sensor Array DOA Estimation with the Manifold Reconstruction Unitary ESPRIT Algorithm. IEEE Sens. Lett. 2023, 7, 7006804. [Google Scholar]
Roy, R.; Kailath, T. ESPRIT-estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process. 1989, 37, 984–995. [Google Scholar] [CrossRef]
Kassir, H.A.; Zaharis, Z.D.; Lazaridis, P.I.; Kantartzis, N.V.; Yioultsis, T.V.; Xenos, T.D. A Review of the State of the Art and Future Challenges of Deep Learning-Based Beamforming. IEEE Access 2022, 10, 80869–80882. [Google Scholar] [CrossRef]
Chen, F.; Yang, D.; Mo, S.; Huang, X.; Wang, M.; Zhu, Z.; Li, Y. A Method for Estimating the Direction of Arrival Without Knowing the Source Number Using Acoustic Vector Sensor Arrays. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1001016. [Google Scholar] [CrossRef]
Yang, F.; Yang, S.; Sun, L.; Chen, Y.; Qu, S.; Hu, J. DOA Estimation via Sparse Signal Recovery in 4-D Linear Antenna Arrays with Optimized Time Sequences. IEEE Trans. Veh. Technol. 2020, 69, 771–783. [Google Scholar] [CrossRef]
Abeida, H.; Zhang, Q.; Li, J.; Merabtine, N. Iterative Sparse Asymptotic Minimum Variance Based Approaches for Array Processing. IEEE Trans. Signal Process. 2013, 61, 933–944. [Google Scholar] [CrossRef]
The, Q.T.; Huy, N.B.; Anh, P.T. A method for reducing speech distortion in minimum variance distortionless response beamformer. In Proceedings of the 2023 Seminar on Signal Processing, Saint Petersburg, Russia, 22 November 2023. [Google Scholar]
Kase, Y.; Nishimura, T.; Ohgane, T.; Ogawa, Y.; Kitayama, D.; Kishiyama, Y. Performance analysis of DOA estimation of two targets using deep learning. In Proceedings of the 2019 22nd International Symposium on Wireless Personal Multimedia Communications (WPMC), Lisbon, Portugal, 24–27 November 2019. [Google Scholar]
Mylonakis, C.M.; Zaharis, Z.D. A novel three-dimensional direction-of-arrival estimation approach using a deep convolutional neural network. IEEE Open J. Veh. Technol. 2024, 5, 643–657. [Google Scholar] [CrossRef]
Guo, Y.; Zhang, Z.; Huang, Y. Dual class token vision transformer for direction of arrival estimation in low SNR. IEEE Signal Process. Lett. 2024, 31, 76–80. [Google Scholar] [CrossRef]
Kase, Y.; Nishimura, T.; Ohgane, T.; Ogawa, Y.; Kitayama, D.; Kishiyama, Y. DOA estimation of two targets with deep learning. In Proceedings of the 2018 15th Workshop on Positioning, Navigation and Communications (WPNC), Bremen, Germany, 15–16 October 2018. [Google Scholar]
Wu, L.; Liu, Z.-M.; Huang, Z.-T. Deep convolution network for direction of arrival estimation with sparse prior. IEEE Signal Process. Lett. 2019, 26, 1688–1692. [Google Scholar] [CrossRef]
Elbir, A.M. DeepMUSIC: Multiple signal classification via deep learning. IEEE Sens. Lett. 2020, 4, 7001004. [Google Scholar] [CrossRef]
Tsai, D.-M.; Chou, Y.-H. Fast and Precise Positioning in PCBs Using Deep Neural Network Regression. IEEE Trans. Instrum. Meas. 2020, 69, 4692–4701. [Google Scholar] [CrossRef]
Papageorgiou, G.K.; Sellathurai, M.; Eldar, Y.C. Deep networks for direction-of-arrival estimation in low SNR. IEEE Trans. Signal Process. 2021, 69, 3714–3729. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Xie, Y.; Liu, A.; Lu, X.; Chong, D. Hybrid Multi-Class Token Vision Transformer Convolutional Network for DOA Estimation. IEEE Signal Process. Lett. 2025, 32, 2279–2283. [Google Scholar] [CrossRef]
Zeng, K.; Li, Z.; Zhao, H.; Xie, K.; Xie, S.; Niyato, D.; Chen, W.; Zheng, Z. A Spatiotemporal Information-Driven Cross-Attention Model with Sparse Representation for GNSS NLOS Signal Classification. IEEE Internet Things J. 2024, 11, 31892–31908. [Google Scholar] [CrossRef]
Zheng, S.; Yang, Z.; Shen, W.; Zhang, L.; Zhu, J.; Zhao, Z.; Yang, X. Deep learning-based DOA estimation. IEEE Trans. Cogn. Commun. Netw. 2024, 10, 819–835. [Google Scholar] [CrossRef]
Cheewaprakobkit, P.; Lin, C.-Y.; Shih, T.K.; Enkhbat, A. Enhancing Single Object Tracking with a Hybrid Approach: Temporal Convolutional Networks, Attention Mechanisms, and Spatial–Temporal Memory. IEEE Access 2023, 11, 139211–139222. [Google Scholar] [CrossRef]
Liu, F.; Yao, L.; Zhang, C.; Wu, T.; Zhang, X.; Jiang, X.; Zhou, J. Boost UAV-Based Object Detection via Scale-Invariant Feature Disentanglement and Adversarial Learning. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5622113. [Google Scholar] [CrossRef]
Zhu, H.; Wang, L.; Shen, N.; Wu, Y.; Feng, S.; Xu, Y.; Chen, C.; Chen, W. MS-HNN: Multi-Scale Hierarchical Neural Network with Squeeze-and-Excitation Block for Neonatal Sleep Staging Using Single-Channel EEG. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 31, 2195–2204. [Google Scholar] [CrossRef]
Scheibler, R.; Bezzam, E.; Dokmanić, I. Pyroomacoustics: A Python package for audio room simulation and array processing algorithms. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]

Figure 1. The structure of UCA.

Figure 2. The designed TSNetIQ model for DOA estimation.

Figure 3. Diagram of the simulation scenario.

Figure 4. Arrival angle distribution of the training dataset.

Figure 5. Performance of the loss curves of the three compared models.

Figure 6. Error histogram of the three models.

Figure 7. Visualization of error lines obtained by the three models.

Figure 8. RMSE results of three models under different SNRs.

Figure 9. RMSE results of three models under different sampling frequencies.

Figure 10. RMSE results of three models under different

M

conditions.

Figure 10. RMSE results of three models under different

M

conditions.

Table 1. Specifications of the circular microphone array hardware.

Parameter	Specification	Description
Maximum Number of Elements	64	Supports up to 64 microphones in the array
Array Diameter	150 mm	Total diameter of the circular microphone array
Operating Frequency Range	20 Hz–80 kHz	Effective acoustic signal reception bandwidth
Sampling Method	Continuous and synchronous sampling	All microphones are sampled in parallel and in sync

Table 2. FLOPS and Params of different models.

	FLOPs	Params
Network	FLOPs	Params
CNN	$5.78 \times 10^{7}$	$6.78 \times 10^{4}$
TSNetIQ	$2.66 \times 10^{8}$	$6.05 \times 10^{5}$
ResNetIQ	$1.90 \times 10^{8}$	$2.46 \times 10^{5}$

Table 3. Prediction error of the three compared models(1000 samples).

Error	TSNetIQ	ResNetIQ	CNN
0–5°	999	985	977
5–10°	1	12	14
>10°	0	3	9
0–5°	999	985	977

Table 4. Examples of DOA predictions using TSNetIQ.

True Angle/Degree	Predicted Angle/Degree	Error
0	0.1	0.1
9.4	9.3	0.1
68.8	68.8	0
102.9	102.7	0.2
103.3	103.2	0.1
154.7	154.7	0
233.3	233.2	0.1
257.2	257.1	0.1
315.2	315.2	0
335.3	335.2	0.1

Table 5. Standard deviation of RMSE over 10 predictions under varying SNRs.

	−20	−15	−10	−5	0	5	10	15	20
Model	−20	−15	−10	−5	0	5	10	15	20
TSNetIQ	5.63	3.59	1.15	0.55	0.29	0.14	0.1	0.07	0.06
ResNetIQ	6.38	4.43	2.73	1.13	0.57	0.43	0.17	0.14	0.1
CNN	6.48	4.5	3.66	2.79	1.09	0.73	0.75	0.44	0.3

Table 6. Standard deviation of RMSE over 10 predictions under varying sampling frequencies.

	Freq	$f_{s}$	$2 f_{s}$	$3 f_{s}$	$4 f_{s}$	$5 f_{s}$
Model		$f_{s}$	$2 f_{s}$	$3 f_{s}$	$4 f_{s}$	$5 f_{s}$
TSNetIQ		0.14	0.11	0.11	0.1	0.1
ResNetIQ		0.43	0.28	0.23	0.18	0.13
CNN		0.73	0.54	0.42	0.38	0.3

Table 7. Standard deviation of RMSE over 10 predictions under varying numbers of elements.

	4	6	8	10	12
Model	4	6	8	10	12
TSNetIQ	0.85	0.53	0.27	0.18	0.14
ResNetIQ	1.13	0.78	0.5	0.43	0.43
CNN	2.76	1.82	1.1	0.81	0.73

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, K.; Jin, T.; Xie, S.; Liu, Z.; Sun, J. TSNetIQ: High-Resolution DOA Estimation of UAVs Using Microphone Arrays. Appl. Sci. 2025, 15, 8734. https://doi.org/10.3390/app15158734

AMA Style

Zhu K, Jin T, Xie S, Liu Z, Sun J. TSNetIQ: High-Resolution DOA Estimation of UAVs Using Microphone Arrays. Applied Sciences. 2025; 15(15):8734. https://doi.org/10.3390/app15158734

Chicago/Turabian Style

Zhu, Kequan, Tian Jin, Shitong Xie, Zixuan Liu, and Jinlong Sun. 2025. "TSNetIQ: High-Resolution DOA Estimation of UAVs Using Microphone Arrays" Applied Sciences 15, no. 15: 8734. https://doi.org/10.3390/app15158734

APA Style

Zhu, K., Jin, T., Xie, S., Liu, Z., & Sun, J. (2025). TSNetIQ: High-Resolution DOA Estimation of UAVs Using Microphone Arrays. Applied Sciences, 15(15), 8734. https://doi.org/10.3390/app15158734

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TSNetIQ: High-Resolution DOA Estimation of UAVs Using Microphone Arrays

Abstract

1. Introduction

2. System Model and Problem Formulation

2.1. Signal Model

2.2. Problem Formulation

3. Proposed DOA Estimation Method

3.1. Preprocessing and IQ Extraction

3.2. DOA Estimation Using TSNetIQ

4. Simulation

4.1. Simulation Settings

4.2. Datasets Generation

4.3. Training Settings

4.4. Performance Metrics

4.5. Possible Extension to Multi-Source 2D DOA Estimation

5. Results and Evaluation

5.1. Model Training Results

5.2. Complexity Analysis

5.3. Different SNR Environments

5.4. Different Sampling Frequencies

5.5. Different Number of Array Elements

5.6. Limitations and Extensions of Experimental Parameters

6. Future Work

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI