1. Introduction
With the rapid advancement of science and technology, UAV technology has been extensively utilized across various fields [
1], gradually becoming an indispensable component of human life and work. However, this swift development also poses numerous challenges and issues, particularly concerning security, privacy protection, and legal regulation. The lightweight and flexible nature of UAVs facilitates their operation without oversight from relevant authorities, which not only heightens the risk of illegal usage but also introduces a range of security threats to society [
2,
3,
4,
5].
Unauthorized UAVs entering restricted airspace pose a threat to aviation safety. Such illegal operations are becoming increasingly common because some operators use covert methods to evade supervision. UAVs have also been misused for transporting explosives [
6] to carry out terrorist attacks [
7], smuggling drugs [
8], and other illegal activities, posing significant challenges to social stability and national security.
In light of the aforementioned circumstances, conducting research on UAV identification is of paramount importance. Currently, invasive UAVs can be identified through various methods including visual recognition [
9,
10], radar detection [
11,
12], radio frequency analysis [
13], and acoustic monitoring [
14]. However, the accuracy of visual recognition significantly diminishes in adverse meteorological conditions such as fog, overcast skies, and rain. Additionally, there is a propensity to misidentify objects that resemble UAVs (e.g., birds) as actual UAVs. Radar-based UAV recognition technology relies on measuring the scattering cross-section to identify targets via electromagnetic backscatter. Due to the small size of UAVs and their construction from low-conductivity materials, target echoes are often obscured by strong clutter or noise backgrounds. This results in a low signal-to-clutter ratio, complicating target classification and recognition efforts [
15,
16]. Furthermore, RF-based UAV identification technology may lose its effectiveness when an invading UAV follows a predetermined route; specifically, this occurs when the airborne transmitter does not emit radio frequency signals during flight [
17].
A quantitative comparison of these mainstream detection technologies is presented in
Table 1 [
18]. While radar and RF systems offer superior detection ranges, they are often constrained by high costs and specific operational limitations (e.g., small RCS targets for radar, radio silence for RF). In contrast, acoustic systems, despite their shorter detection range, provide a distinct advantage in terms of cost-effectiveness and passive monitoring capability, making them an ideal complementary solution for low-altitude security.
The microphone sensor-based sound recognition method offers several advantages over the aforementioned recognition technologies. Its low cost and easy accessibility of equipment make it particularly suitable for large-scale deployment. Anwar et al. [
19] proposed a novel machine learning framework designed to recognize UAV sounds in noisy environments, distinguishing them from various other sounds such as birds, aircraft, and thunderstorms. Wang et al. [
20] introduced a UAV acoustic detection method grounded in the blind source separation (BSS) framework. Shi et al. [
21] employed MFCC alongside Hidden Markov Models to recognize UAVs in real-world noisy settings. Yoo et al. [
22] implemented MFCC as input into a simple neural network model based on the LeNet architecture, achieving classification of over ten different types of UAVs from acoustic data with an accuracy exceeding 97%. Aydın Ilhan et al. [
23] proposed a sound processing algorithm utilizing the Light-Weight Convolutional Neural Network (LWCNN) model to differentiate UAV from ambient air noise, animal noises, and vehicle noises within environments characterized by high spatial noise density.
Although these studies have made significant advancements in UAV sound recognition, they may not be applicable to long-distance environments. This paper leverages the unique acoustic characteristics of UAVs to select parameters that remain stable under signal attenuation as the foundation for feature extraction. The features derived from both UAV sounds and environmental noise are fed into a CNN classifier for training, ultimately achieving effective recognition of UAV sound signals in long-distance settings. Furthermore, considering the variations in harmonic characteristics among different types of UAVs, this study enables accurate identification of three distinct types of UAVs within a specified range.
Definition of Long-Distance:
It is important to clarify that ’long-distance’ is a relative term strictly dependent on the UAV’s acoustic power output. Aligning with the European Union Aviation Safety Agency (EASA) drone classification standards [
24], the detection range varies significantly between classes.
- 1.
For C2-class equivalent drones (e.g., the DJI Phantom 4 Pro V2.0 used in this study, MTOM < 4 kg), which emit higher sound pressure levels, we define ’long-distance’ as ranges exceeding 100 m (tested up to 120 m).
- 2.
For C1-class equivalent drones (e.g., the DJI Air 2S and Mavic Pro, MTOM < 900 g), which are physically smaller and quieter, the ’long-distance’ threshold is lower. In our experiments, distances of 60–80 m effectively represent the long-range limit where environmental noise begins to dominate the signal.
2. Method
UAV sound is typically categorized into two types: pneumatic sound and mechanical sound. The former arises from the rotor’s interaction with the air and the friction between the UAV body and the surrounding atmosphere, which predominantly occupies the low-frequency range of the sound spectrum and exhibits slow attenuation. In contrast, mechanical sound originates from motor operations and transmission components, generally found in the high-frequency range of the acoustic signal, characterized by a more rapid rate of attenuation. Overall, the acoustic signature produced during UAV hovering manifests as a harmonic series (comprising a fundamental frequency and multiple harmonics) that can be effectively distinguished from environmental noise by leveraging this distinctive property of UAV-generated sounds.
Figure 1 illustrates spectra for various UAV models including DJI Phantom 4 Pro V2.0, DJI Air 2S, DJI Mavic Pro (Shenzhen Da-Jiang Innovations Science and Technology Co., Ltd., Shenzhen, China) alongside a human voice for comparative analysis.
In
Figure 1, the sound signal produced by UAVs can be approximated as a harmonic series. The fundamental frequencies of the DJI Phantom 4 Pro V2.0, DJI Air 2S, and DJI Mavic Pro are approximately 170 Hz, 200 Hz, and 360 Hz, respectively. In contrast, human vocalizations typically do not exhibit characteristics of a standard harmonic signal. Leveraging this distinctive feature allows for effective differentiation between UAV sounds and environmental noise.
Figure 2 illustrates the feature extraction flowchart for recognizing the DJI Phantom 4 Pro V2.0. In this study, spectral line interference is mitigated by adjusting the number of FFT points and extracting the envelope. Subsequently, frequency values corresponding to sufficiently prominent peaks in the envelope spectrum are recorded. The subset of these frequency values that aligns with the drone’s high harmonic frequencies is retained and compiled into feature vectors.
Figure 3 presents the flowchart for long-range UAV sound recognition. To identify three distinct models—DJI Phantom 4 Pro V2.0, DJI Air 2S, and DJI Mavic Pro—three one-dimensional feature vectors are combined into a single comprehensive one-dimensional feature vector. Every ten frames are then aggregated to form a two-dimensional feature matrix sample. These samples are inputted into a CNN for training purposes, ultimately achieving sound recognition of drones at long distances.
2.1. Mathematical Modeling of UAV Sound Signal
The acoustic emissions generated by a hovering UAV can be characterized as a signal composed of a harmonic series. The model for this harmonic signal, when considered within the context of real-world environments, can be expressed with the following mathematical formulation:
where
represents the fundamental frequency rate,
represents the phase of the
nth high-order harmonic component,
represents the amplitude of the
nth high-order harmonic component,
N is the total number of high-order harmonics,
is noise.
Because the distance between the UAV and the microphone collector is not fixed in the actual application environment, the amplitude of each high-order harmonic is not a fixed value, and will change with the change in distance. For UAVs in a hovering or quasi-stationary state, the frequency value of each high-order harmonic is relatively fixed and does not change significantly with distance. Therefore, the extracted feature information should focus on frequency rather than amplitude. Similarly, phase information is explicitly excluded in this study. From the principle of acoustic signal propagation, phase coherence degrades rapidly over long distances due to atmospheric turbulence and multipath effects. In contrast, the harmonic frequency distribution remains relatively robust against these propagation distortions.
It is important to acknowledge that when the UAV is in motion relative to the microphone, the Doppler effect will induce systematic frequency shifts in both the fundamental frequency and its harmonics. Specifically, the observed frequency increases as the UAV approaches and decreases as it recedes. In this study, our primary focus is on the signal attenuation characteristics of long-distance UAVs in hovering states. Although the proposed method utilizes a predefined harmonic frequency range (as described in Equation (
15)) which provides a degree of tolerance for minor frequency fluctuations, significant Doppler shifts caused by high-speed movement could potentially lead to harmonics falling outside the target bins. Therefore, addressing Doppler-induced variability remains a critical direction for future research in dynamic UAV monitoring.
2.2. Feature Extraction Based on Harmonic Features
The audio signal is divided into multiple frames, and the duration of each frame is
seconds. The number of samples within the frame is
calculated as follows:
In the above equation is the sampling frequency and is a rounding function.
For an audio signal of length in samples, the total number of frames can be obtained as follows:
In the above equation, denotes the floor function.
The signal
of each frame is intercepted from the audio signal:
For each frame , calculate the power spectral density using the Welch method.
Firstly, the window function
is utilized to perform windowing processing on each frame:
The Fast Fourier Transform (FFT) is applied to the windowed signal, thereby obtaining the frequency-domain representation
.
P is the number of FFT points,
F is the frequency resolution; when
P reduced, the frequency resolution increases. Calculate the power spectral density
of each frame:
To obtain a smoother spectrum curve, the power spectral density
is smoothed using a low-pass filter to obtain the envelope
.
Low-pass filter is
, whose cut-off frequency is
.
In the above equation and are the coefficients of the filter and z are complex frequency variables.
To eliminate the phase distortion, a bidirectional filter function
is used.
Then, the maximum value normalization is performed.
Detect
with peaks greater than 0.1 and its corresponding frequency
in
.
is a local maximum, detects the valley values
and
of the peak
.
Filter peaks by means of
:
Map the filtered peak frequency to the predefined harmonic frequency range matrix of the UAV A.
Check if the peak frequency falls into a certain frequency range:
The feature vector of UAV A extracted in each frame is , and the above steps are repeated similarly to obtain the feature vector of UAV B is , and the feature vector of UAV C is .
Combine , and jointly to constitute , where .
The feature vectors extracted from 10 consecutive frames are used to form a two-dimensional feature matrix:
Each sample of the training CNN is a two-dimensional feature matrix of .
2.3. CNN Classifier
CNN is a kind of deep learning model especially suitable for processing data with grid structure. CNN is used to process feature data. The overall structure of CNN classifier used in this paper is shown in
Table 2.
Convolutional layers are the core components of CNN and are used to extract features from the input data. A convolutional layer slides over the input data through a filter (convolution kernel) to perform a convolution operation and generate a Feature Map. Each convolution kernel is responsible for capturing different features of the image such as edges, textures, etc.
The dimension size of the convolution kernel (filter) of the first convolution layer is
, there are 8 filters in total, the weight matrix of each filter is
(the
tth filter), the bias is
, and the output of the convolution operation corresponding to the formula is
:
where,
i and
j is the position index of the output matrix and
k is the index of the filter.
m and
n are the indices of the convolution kernel (or filter) on the input matrix. They represent the row and column index positions of the convolution kernel, and the output dimension remains the same (padded with padding zeros), so
.
Batch normalization operation is used to normalize the output of convolutional layers to speed up training. The normalized output is
, then:
where
and
are learnable parameters and ∈ are constants that prevent division by zero.
is the mean of the first channel,
is the variance in the
kth channel,
H and
W is the spatial dimension.
The ReLU activation function sets negative values to 0, and the formula is:
Output after activation is
.
The pooling layer reduces the output size by half and extracts the maximum value of each 2 × 2 region from
through the pooling operation. The formula for the pooling output
is:
The output dimension is .
The second convolutional layer operates similarly to the first layer, with the kernel size still being, but the number of filters increased to 16. Let the input be
, then the output
after convolution is given by:
Output dimension
, batch normalization and ReLU activation are also applied to
, to generate the output
.
In the fully connected layer, the convolutional output is flattened into a vector
with dimension of
. The output of a fully connected layer is:
In the above equation is the weight matrix and is the bias vector.
The Softmax function is used to convert the output of the fully connected layer into a probability distribution with the formula:
In the above equation is the predicted probability of the ith class and is the number of classes.
The Softmax function ensures that the sum of all output probabilities is 1, that is:
The model training utilizes the Categorical Cross-Entropy loss function, calculated as:
where
is the true label (one-hot encoded) and
is the predicted probability. This loss function is chosen for its efficient gradient convergence in multi-class classification tasks where target classes are mutually exclusive.
is used to determine the predicted class of an input sample, usually choosing the class with the highest probability as the final score:
3. Experiment and Analysis
The UAV models utilized in this study include the DJI Phantom 4 Pro V2.0, DJI Mavic Pro and DJI Air 2S, as shown in
Figure 4g,
Figure 4h and
Figure 4i respectively. The microphone sensor employed is the LM1111, with a sampling frequency set at 48 kHz, as shown in
Figure 4b. The DJI Phantom 4 Pro V2.0 was positioned at distances of 5 m, 40 m, 50 m, 60 m, and 70 m from the microphone; the DJI Air 2S was positioned at distances of 5 m, 6 m, 7 m, 8 m, 9 m, 10 m, 11 m, 12 m, 30 m, 40 m and 50 m from the microphone, while the DJI Mavic Pro was stationed at distances of both 5 m and 50 m from the microphone. Environmental noise was recorded directly when the drones were not in operation. The experimental scenes are shown in
Figure 4a,c–f. The duration of environmental noise audio amounted to 1614 s; conversely, the audio recordings for the DJI Phantom 4 Pro V2.0, DJI Air 2S, and DJI Mavic Pro lasted for 486 s, 1374 s, and 264 s, respectively. The total duration of audio used for training the CNN was 3738 s.
For the sake of convenience, the harmonic features proposed in this paper are referred to as HF. The HF dimension is 10 × 22, with a feature vector of size 1 × 22 extracted from each frame. The resulting feature matrix consists of 10 feature vectors, and each frame has a duration of 0.3 s.
In the comparative experiments, the MFCC dimension is set at 100 × 22, where a feature vector of size 1 × 22 is also extracted from each frame. This feature matrix comprises 100 feature vectors, with each frame lasting for 25 ms and an overlap of 15 ms between consecutive frames.
In the additional comparative experiments, the Log-Mel Spectrogram dimension is set at 300 × 64, where a feature vector of size 1 × 64 is extracted from each frame. This feature matrix comprises 300 feature vectors, with each frame lasting for 25 ms and an overlap of 15 ms between consecutive frames. Furthermore, a Raw Waveform CNN based on the M5 architecture was implemented as an end-to-end deep learning baseline. Unlike the feature-based approaches, this model accepts raw time-domain waveforms directly as input. The input dimension is set to 14,400 × 1, corresponding to a 0.3-s audio segment at a sampling rate of 48 kHz. To effectively capture the low-frequency acoustic characteristics of UAVs, the first convolutional layer of the M5 model is configured with a large kernel size of 80.
Figure 5 presents the spectrograms of various UAV models at different distances. The energy is concentrated within a specific frequency range. When the DJI Phantom 4 Pro V2.0 hovers at a distance of 120 m from the microphone, high-order harmonics with frequencies of 344 Hz, 516 Hz, 688 Hz, and 860 Hz are still present in the generated sound signal and exhibit minimal variation over time. Similarly, when the DJI Air 2S hovers at a distance of 80 m from the microphone, high-order harmonics with frequencies of 460 Hz, 700 Hz, 920 Hz, 1160 Hz, and 1380 Hz remain evident in the generated sound signal and show little change over time. Furthermore, when the DJI Mavic Pro hovers at a distance of 60 m from the microphone, high-order harmonics with frequencies of 360 Hz, 720 Hz, and 1080 Hz persist in the generated sound signal without significant temporal variation.
As illustrated in
Figure 6, the feature extraction method based on HF proposed in this paper effectively captures the harmonic wave form features present in the spectrograms of sound signals from various types of UAVs at different distances. When comparing the 120 m HF data from the DJI Phantom 4 Pro V2.0 UAV to that at a distance of 5 m, it is evident that harmonics at frequencies of 344 Hz, 516 Hz, 688 Hz, and 860 Hz are still detectable and can be successfully extracted. Similarly, for the DJI Air 2S UAV at an altitude of 80 m compared to its performance at a distance of 5m, harmonics corresponding to frequencies of 460 Hz, 700 Hz, 920 Hz,1160 Hz, and 1380 Hz remain identifiable and extractable. Furthermore, analysis of the DJI Mavic Pro UAV reveals that even at a distance of 60 m compared to 5 m measurements; harmonics such as those found at 360 Hz and 720 Hz as well as 1080 Hz continue to exist and can be effectively extracted.
Figure 7 presents the confusion matrix for the validation set, where the labels “1,” “2,” “3,” and “4” correspond to “Noise,” “DJI Air 2S,” “DJI Mavic Pro,” and “DJI Phantom 4 Pro V2.0,” respectively. In this experiment, each sample is constructed by aggregating 10 consecutive frames (with a single frame duration of 0.3 s), forming a robust feature matrix. The validation set was generated through random splitting of the dataset. Consequently, the model achieved 100% recognition accuracy on the Mavic Pro in this internal validation phase, demonstrating excellent convergence and feature learning capability. However, it must be noted that this result reflects the model’s performance on data with a similar distribution to the training set. The rigorous evaluation of the model’s generalization ability under severe signal attenuation (long-distance scenarios) is presented separately in
Table 3.
The audio recordings of the three distinct UAV models listed in
Table 3 were not utilized during the CNN training process. The proposed HF method demonstrated superior performance, reaching a recognition rate of 78.03% for the DJI Phantom 4 Pro V2.0 at 120 m, 96.28% for the DJI Air 2S at 80 m, and 81.82% for the DJI Mavic Pro at 60 m, confirming the feasibility of the scheme for long-distance identification. In contrast, as presented in
Table 3, the recognition accuracy of the three comparative baselines—MFCC, Log-Mel Spectrogram, and Raw Waveform CNN (M5 architecture)—proved to be significantly less stable. While the modern Raw Waveform CNN (M5) achieved a competitive accuracy of 77.97% for the Phantom 4 Pro V2.0, it failed to generalize to the quieter DJI Air 2S and DJI Mavic Pro, dropping to 48.82% and 23.14%, respectively. Similarly, the Log-Mel Spectrogram method yielded inconsistent results, achieving only 48.57% for the Phantom 4 Pro V2.0 and 22.73% for the Mavic Pro. The traditional MFCC method performed poorest on the Mavic Pro with an accuracy of merely 1.47%. This extremely low figure is not an anomaly but reflects a systematic feature collapse in low-SNR conditions. Detailed analysis reveals that 38.24% of the Mavic Pro samples were misclassified as the DJI Phantom 4 Pro V2.0 due to spectral similarity after attenuation. Collectively, these comparisons suggest that incorporating domain-specific physical harmonic features offers a significant advantage in low Signal-to-Noise Ratio (SNR) environments. The proposed HF method effectively mitigates the impact of signal attenuation, demonstrating higher stability for long-distance monitoring compared to purely data-driven end-to-end models (like M5) and traditional features.