Research on Sound Recognition of Long-Distance UAV Based on Harmonic Features

Fan, Kuangang; Pan, Wenjie; Zhong, Jilong; Zeng, Zhiyu; Chen, Wenzheng

doi:10.3390/drones10010025

Open AccessArticle

Research on Sound Recognition of Long-Distance UAV Based on Harmonic Features

by

Kuangang Fan

^1,2,*

,

Wenjie Pan

³,

Jilong Zhong

³

,

Zhiyu Zeng

³ and

Wenzheng Chen

¹

School of Electrical Engineering, Shanghai Dianji University, Shanghai 201306, China

²

Jiangsu Golden-Keen Eyes Intelligent and Control Technology Company Limited, Suzhou 215412, China

³

School of Electrical Engineering and Automation, Jiangxi University of Science and Technology, Ganzhou 341000, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(1), 25; https://doi.org/10.3390/drones10010025

Submission received: 26 November 2025 / Revised: 14 December 2025 / Accepted: 31 December 2025 / Published: 1 January 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A feature extraction method based on high-order harmonic features was developed, achieving a recognition accuracy of 78.03% for the DJI Phantom 4 Pro V2.0 at a distance of 120 m.
The proposed method outperforms the traditional Mel-frequency cepstral coefficients (MFCC) approach, demonstrating a 16.03% improvement in accuracy under long-distance conditions.

What are the implications of the main findings?

This study addresses the challenge of signal attenuation in long-distance acoustic monitoring, providing a robust solution for identifying UAVs in complex environments.
Leveraging the low cost and easy deployment of acoustic sensors, this method facilitates large-scale monitoring networks, effectively complementing radar and visual systems for restricted airspace protection.

Abstract

With the extensive application of unmanned aerial vehicles (UAVs) in both military and civilian domains, the significance of UAV identification technology has become increasingly prominent. Among various recognition methods, voice recognition has garnered considerable attention due to its advantages of low cost and easy deployment. However, most existing research primarily focuses on isolating UAV sounds from noise signals in complex environments, with limited studies on long-distance UAV sound recognition. Based on this, this paper proposes a frequency domain feature extraction method based on harmonic features. By analyzing the harmonic features of UAV sounds, we select stable parameters with strong robustness against interference capabilities as the main features to minimize information redundancy and feature fluctuation. The experimental results indicate that this method achieves a recognition accuracy of 78.03% for the DJI Phantom 4 Pro V2.0 UAV at a distance of 120 m. To validate the proposed method, comprehensive comparisons against traditional MFCC, Log-Mel Spectrogram, and modern Raw Waveform CNN (M5) baselines demonstrate the superior robustness of the proposed approach. While these comparative methods exhibited significant performance drops in challenging long-distance scenarios (e.g., accuracies falling below 24% for the DJI Mavic Pro), the proposed method maintained consistent identification capabilities, validating its effectiveness in low-signal environments.

Keywords:

UAV sound recognition; long distance; harmonic features

1. Introduction

With the rapid advancement of science and technology, UAV technology has been extensively utilized across various fields [1], gradually becoming an indispensable component of human life and work. However, this swift development also poses numerous challenges and issues, particularly concerning security, privacy protection, and legal regulation. The lightweight and flexible nature of UAVs facilitates their operation without oversight from relevant authorities, which not only heightens the risk of illegal usage but also introduces a range of security threats to society [2,3,4,5].

Unauthorized UAVs entering restricted airspace pose a threat to aviation safety. Such illegal operations are becoming increasingly common because some operators use covert methods to evade supervision. UAVs have also been misused for transporting explosives [6] to carry out terrorist attacks [7], smuggling drugs [8], and other illegal activities, posing significant challenges to social stability and national security.

In light of the aforementioned circumstances, conducting research on UAV identification is of paramount importance. Currently, invasive UAVs can be identified through various methods including visual recognition [9,10], radar detection [11,12], radio frequency analysis [13], and acoustic monitoring [14]. However, the accuracy of visual recognition significantly diminishes in adverse meteorological conditions such as fog, overcast skies, and rain. Additionally, there is a propensity to misidentify objects that resemble UAVs (e.g., birds) as actual UAVs. Radar-based UAV recognition technology relies on measuring the scattering cross-section to identify targets via electromagnetic backscatter. Due to the small size of UAVs and their construction from low-conductivity materials, target echoes are often obscured by strong clutter or noise backgrounds. This results in a low signal-to-clutter ratio, complicating target classification and recognition efforts [15,16]. Furthermore, RF-based UAV identification technology may lose its effectiveness when an invading UAV follows a predetermined route; specifically, this occurs when the airborne transmitter does not emit radio frequency signals during flight [17].

A quantitative comparison of these mainstream detection technologies is presented in Table 1 [18]. While radar and RF systems offer superior detection ranges, they are often constrained by high costs and specific operational limitations (e.g., small RCS targets for radar, radio silence for RF). In contrast, acoustic systems, despite their shorter detection range, provide a distinct advantage in terms of cost-effectiveness and passive monitoring capability, making them an ideal complementary solution for low-altitude security.

The microphone sensor-based sound recognition method offers several advantages over the aforementioned recognition technologies. Its low cost and easy accessibility of equipment make it particularly suitable for large-scale deployment. Anwar et al. [19] proposed a novel machine learning framework designed to recognize UAV sounds in noisy environments, distinguishing them from various other sounds such as birds, aircraft, and thunderstorms. Wang et al. [20] introduced a UAV acoustic detection method grounded in the blind source separation (BSS) framework. Shi et al. [21] employed MFCC alongside Hidden Markov Models to recognize UAVs in real-world noisy settings. Yoo et al. [22] implemented MFCC as input into a simple neural network model based on the LeNet architecture, achieving classification of over ten different types of UAVs from acoustic data with an accuracy exceeding 97%. Aydın Ilhan et al. [23] proposed a sound processing algorithm utilizing the Light-Weight Convolutional Neural Network (LWCNN) model to differentiate UAV from ambient air noise, animal noises, and vehicle noises within environments characterized by high spatial noise density.

Although these studies have made significant advancements in UAV sound recognition, they may not be applicable to long-distance environments. This paper leverages the unique acoustic characteristics of UAVs to select parameters that remain stable under signal attenuation as the foundation for feature extraction. The features derived from both UAV sounds and environmental noise are fed into a CNN classifier for training, ultimately achieving effective recognition of UAV sound signals in long-distance settings. Furthermore, considering the variations in harmonic characteristics among different types of UAVs, this study enables accurate identification of three distinct types of UAVs within a specified range.

Definition of Long-Distance:

It is important to clarify that ’long-distance’ is a relative term strictly dependent on the UAV’s acoustic power output. Aligning with the European Union Aviation Safety Agency (EASA) drone classification standards [24], the detection range varies significantly between classes.

1.: For C2-class equivalent drones (e.g., the DJI Phantom 4 Pro V2.0 used in this study, MTOM < 4 kg), which emit higher sound pressure levels, we define ’long-distance’ as ranges exceeding 100 m (tested up to 120 m).
2.: For C1-class equivalent drones (e.g., the DJI Air 2S and Mavic Pro, MTOM < 900 g), which are physically smaller and quieter, the ’long-distance’ threshold is lower. In our experiments, distances of 60–80 m effectively represent the long-range limit where environmental noise begins to dominate the signal.

2. Method

UAV sound is typically categorized into two types: pneumatic sound and mechanical sound. The former arises from the rotor’s interaction with the air and the friction between the UAV body and the surrounding atmosphere, which predominantly occupies the low-frequency range of the sound spectrum and exhibits slow attenuation. In contrast, mechanical sound originates from motor operations and transmission components, generally found in the high-frequency range of the acoustic signal, characterized by a more rapid rate of attenuation. Overall, the acoustic signature produced during UAV hovering manifests as a harmonic series (comprising a fundamental frequency and multiple harmonics) that can be effectively distinguished from environmental noise by leveraging this distinctive property of UAV-generated sounds. Figure 1 illustrates spectra for various UAV models including DJI Phantom 4 Pro V2.0, DJI Air 2S, DJI Mavic Pro (Shenzhen Da-Jiang Innovations Science and Technology Co., Ltd., Shenzhen, China) alongside a human voice for comparative analysis.

In Figure 1, the sound signal produced by UAVs can be approximated as a harmonic series. The fundamental frequencies of the DJI Phantom 4 Pro V2.0, DJI Air 2S, and DJI Mavic Pro are approximately 170 Hz, 200 Hz, and 360 Hz, respectively. In contrast, human vocalizations typically do not exhibit characteristics of a standard harmonic signal. Leveraging this distinctive feature allows for effective differentiation between UAV sounds and environmental noise.

Figure 2 illustrates the feature extraction flowchart for recognizing the DJI Phantom 4 Pro V2.0. In this study, spectral line interference is mitigated by adjusting the number of FFT points and extracting the envelope. Subsequently, frequency values corresponding to sufficiently prominent peaks in the envelope spectrum are recorded. The subset of these frequency values that aligns with the drone’s high harmonic frequencies is retained and compiled into feature vectors.

Figure 3 presents the flowchart for long-range UAV sound recognition. To identify three distinct models—DJI Phantom 4 Pro V2.0, DJI Air 2S, and DJI Mavic Pro—three one-dimensional feature vectors are combined into a single comprehensive one-dimensional feature vector. Every ten frames are then aggregated to form a two-dimensional feature matrix sample. These samples are inputted into a CNN for training purposes, ultimately achieving sound recognition of drones at long distances.

2.1. Mathematical Modeling of UAV Sound Signal

The acoustic emissions generated by a hovering UAV can be characterized as a signal composed of a harmonic series. The model for this harmonic signal, when considered within the context of real-world environments, can be expressed with the following mathematical formulation:

x (t) = \sum_{n = 1}^{N} A_{n} sin (2 π n f_{1} t + φ_{n}) + n (t)

(1)

where

f_{1}

represents the fundamental frequency rate,

φ_{n}

represents the phase of the nth high-order harmonic component,

A_{n}

represents the amplitude of the nth high-order harmonic component, N is the total number of high-order harmonics,

n (t)

is noise.

Because the distance between the UAV and the microphone collector is not fixed in the actual application environment, the amplitude of each high-order harmonic is not a fixed value, and will change with the change in distance. For UAVs in a hovering or quasi-stationary state, the frequency value of each high-order harmonic is relatively fixed and does not change significantly with distance. Therefore, the extracted feature information should focus on frequency rather than amplitude. Similarly, phase information is explicitly excluded in this study. From the principle of acoustic signal propagation, phase coherence degrades rapidly over long distances due to atmospheric turbulence and multipath effects. In contrast, the harmonic frequency distribution remains relatively robust against these propagation distortions.

It is important to acknowledge that when the UAV is in motion relative to the microphone, the Doppler effect will induce systematic frequency shifts in both the fundamental frequency and its harmonics. Specifically, the observed frequency increases as the UAV approaches and decreases as it recedes. In this study, our primary focus is on the signal attenuation characteristics of long-distance UAVs in hovering states. Although the proposed method utilizes a predefined harmonic frequency range (as described in Equation (15)) which provides a degree of tolerance for minor frequency fluctuations, significant Doppler shifts caused by high-speed movement could potentially lead to harmonics falling outside the target bins. Therefore, addressing Doppler-induced variability remains a critical direction for future research in dynamic UAV monitoring.

2.2. Feature Extraction Based on Harmonic Features

The audio signal is divided into multiple frames, and the duration of each frame is

T_{f}

seconds. The number of samples within the frame is

N_{f}

calculated as follows:

N_{f} = r o u n d (T_{f} \times f_{s})

(2)

In the above equation

f_{s}

is the sampling frequency and

r o u n d (\cdot)

is a rounding function.

For an audio signal of length in samples, the total number of frames can be obtained as follows:

M = ⌊\frac{L}{N_{f}}⌋

(3)

In the above equation,

⌊\cdot⌋

denotes the floor function.

The signal

x_{n} (t)

of each frame is intercepted from the audio signal:

x_{n} (t) = x [(n - 1) \times N_{f} + 1 : n \times N_{f}] \dots n = 1, 2, \dots, M

(4)

For each frame

x_{n} (t)

, calculate the power spectral density using the Welch method.

Firstly, the window function

ω (t)

is utilized to perform windowing processing on each frame:

x_{n}^{'} (t) = x_{n} (t) \cdot ω (t)

(5)

The Fast Fourier Transform (FFT) is applied to the windowed signal, thereby obtaining the frequency-domain representation

x_{n} (f)

.

x_{n} (f) = \sum_{k = 0}^{P - 1} x_{n}^{'} (t) e^{- \frac{j 2 π f k}{P}}

(6)

F = \frac{f s}{P}

(7)

P is the number of FFT points, F is the frequency resolution; when P reduced, the frequency resolution increases. Calculate the power spectral density

P_{x x}^{n} (f)

of each frame:

P_{x x}^{n} (f) = \frac{| x_{n} {(f) |}^{2}}{F}

(8)

To obtain a smoother spectrum curve, the power spectral density

P_{x x} (f)

is smoothed using a low-pass filter to obtain the envelope

E (f)

.

Low-pass filter is

H (f)

, whose cut-off frequency is

f_{c}

.

H (f) = \frac{b_{0} + b_{1} z^{- 1} + \dots + b_{M} z^{- M}}{1 + a_{1} z^{- 1} + \dots + a_{N} z^{- N}}

(9)

In the above equation

b_{i}

and

a_{i}

are the coefficients of the filter and z are complex frequency variables.

To eliminate the phase distortion, a bidirectional filter function

f i l t f i l t (\cdot)

is used.

E (f) = f i l t f i l t (b, a, P_{x x} (f))

(10)

Then, the maximum value normalization is performed.

E^{'} (f) = \frac{E (f)}{E {(f)}_{m a x}}

(11)

Detect

p_{i}

with peaks greater than 0.1 and its corresponding frequency

f_{i}

in

E^{'} (f)

.

p_{i} = E^{'} (f_{i})

(12)

E^{'} (f_{i})

is a local maximum, detects the valley values

v_{i - 1}

and

v_{i + 1}

of the peak

p_{i}

.

v_{i - 1} = min_{f \cdot in \cdot left \cdot of \cdot f_{i}} E^{'} (f), v_{i + 1} = min_{f \cdot in \cdot right \cdot of \cdot f_{i}} E^{'} (f)

(13)

Filter peaks by means of

(14)

:

p_{i} \geq 3 \times \frac{v_{i - 1} + v_{i + 1}}{2}

(14)

Map the filtered peak frequency

f_{i}

to the predefined harmonic frequency range matrix

R_{j} = [f_{l o w}^{j}, f_{h i g h}^{j}] j = 1, 2, \dots, u_{1}

of the UAV A.

Check if the peak frequency falls into a certain frequency range:

y_{j} = \{\begin{matrix} f_{i} & i f f_{l o w}^{j} \leq f_{i} \leq f_{h i g h}^{j} \\ 0 & o t h e r w i s e \end{matrix}

(15)

The feature vector of UAV A extracted in each frame is

[y_{1}, y_{2}, \dots, y_{u_{1}}]

, and the above steps are repeated similarly to obtain the feature vector of UAV B is

[y_{1}, y_{2}, \dots, y_{u_{2}}]

, and the feature vector of UAV C is

[y_{1}, y_{2}, \dots, y_{u_{3}}]

.

Combine

[y_{1}, y_{2}, \dots, y_{u_{1}}]

,

[y_{1}, y_{2}, \dots, y_{u_{2}}]

and

[y_{1}, y_{2}, \dots, y_{u_{3}}]

jointly to constitute

[y_{1}, y_{2}, \dots, y_{u}]

, where

u = u_{1} + u_{2} + u_{3}

.

The feature vectors extracted from 10 consecutive frames are used to form a two-dimensional feature matrix:

[\begin{matrix} y_{1, 1} & y_{1, 2} & \dots & y_{1, u - 1} & y_{1, u} \\ y_{2, 1} & y_{2, 2} & \dots & y_{2, u - 1} & y_{2, u} \\ ⋮ & ⋱ & ⋮ \\ y_{10, 1} & y_{10, 2} & \dots & y_{10, u - 1} & y_{10, u} \end{matrix}]

(16)

Each sample of the training CNN is a two-dimensional feature matrix of

10 \times u

.

2.3. CNN Classifier

CNN is a kind of deep learning model especially suitable for processing data with grid structure. CNN is used to process feature data. The overall structure of CNN classifier used in this paper is shown in Table 2.

Convolutional layers are the core components of CNN and are used to extract features from the input data. A convolutional layer slides over the input data through a filter (convolution kernel) to perform a convolution operation and generate a Feature Map. Each convolution kernel is responsible for capturing different features of the image such as edges, textures, etc.

The dimension size of the convolution kernel (filter) of the first convolution layer is

3 \times 3

, there are 8 filters in total, the weight matrix of each filter is

W_{k}^{1} \in R^{3 \times 3}

(the tth filter), the bias is

b_{k}^{1} \in R

, and the output of the convolution operation corresponding to the formula is

Y^{1}

:

Y_{i, j, k}^{1} = \sum_{m = 1}^{3} \sum_{n = 1}^{3} X_{i + m - 1, j + n - 1} \cdot W_{m, n, k}^{1} + b_{k}^{1}

(17)

where, i and j is the position index of the output matrix and k is the index of the filter. m and n are the indices of the convolution kernel (or filter) on the input matrix. They represent the row and column index positions of the convolution kernel, and the output dimension remains the same (padded with padding zeros), so

Y^{1} \in R^{10 \times u \times 8}

.

Batch normalization operation is used to normalize the output of convolutional layers to speed up training. The normalized output is

Z^{1}

, then:

μ_{k} = \frac{\sum_{i = 1}^{H} \sum_{j = 1}^{W} Y_{i, j, k}^{1}}{H W}

(18)

σ_{k}^{2} = \frac{\sum_{i = 1}^{H} \sum_{j = 1}^{W} {(Y_{i, j, k}^{1} - μ_{k})}^{2}}{H W}

(19)

{\hat{Y}}_{i, j, k}^{1} = \frac{Y_{i, j, k}^{1} - μ_{k}}{\sqrt{σ_{k}^{2} + ϵ}}

(20)

Z_{i, j, k}^{1} = γ_{k} {\hat{Y}}_{i, j, k}^{1} + β_{k}

(21)

where

γ_{k}

and

β_{k}

are learnable parameters and ∈ are constants that prevent division by zero.

μ_{k}

is the mean of the first channel,

σ_{k}^{2}

is the variance in the kth channel, H and W is the spatial dimension.

The ReLU activation function sets negative values to 0, and the formula is:

A_{i, j, k}^{1} = m a x (0, Z_{i, j, k}^{1})

(22)

Output after activation is

A^{1} \in R^{10 \times u \times 8}

.

The pooling layer reduces the output size by half and extracts the maximum value of each 2 × 2 region from

A^{1}

through the pooling operation. The formula for the pooling output

P^{1}

is:

P_{i, j, k}^{1} = m a x {A_{2 i - 1, 2 j - 1, k}^{1}, A_{2 i - 1, 2 j, k}^{1}, A_{2 i, 2 j - 1, k}^{1}, A_{2 i, 2 j, k}^{1}}

(23)

The output dimension is

P^{1} \in R^{5 \times u / 2 \times 8}

.

The second convolutional layer operates similarly to the first layer, with the kernel size still being, but the number of filters increased to 16. Let the input be

P^{1}

, then the output

Y^{2}

after convolution is given by:

Y_{i, j, k}^{2} = \sum_{m = 1}^{3} \sum_{n = 1}^{3} P_{i + m - 1, j + n - 1, k}^{1} \cdot W_{m, n, k}^{2} + b_{k}^{2}

(24)

Output dimension

Y^{2} \in R^{5 \times u / 2 \times 16}

, batch normalization and ReLU activation are also applied to

Y^{2}

, to generate the output

A^{2} \in R^{5 \times u / 2 \times 16}

.

In the fully connected layer, the convolutional output is flattened into a vector

v e c (A^{2})

with dimension of

5 \times u / 2 \times 16 = 40 u

. The output of a fully connected layer is:

F = W^{4} \cdot v e c (A^{2}) + b^{4}

(25)

In the above equation

W^{4} \in R^{4 \times 40 u}

is the weight matrix and

b^{4} \in R^{4 \times 1}

is the bias vector.

The Softmax function is used to convert the output of the fully connected layer into a probability distribution with the formula:

{\hat{y}}_{i} = \frac{e^{F_{i}}}{\sum_{j = 1}^{C} e^{F_{j}}}

(26)

In the above equation

{\hat{y}}_{i}

is the predicted probability of the ith class and

C = 4

is the number of classes.

The Softmax function ensures that the sum of all output probabilities is 1, that is:

\sum_{i = 1}^{C} {\hat{y}}_{i} = 1

(27)

The model training utilizes the Categorical Cross-Entropy loss function, calculated as:

L = - \sum_{i = 1}^{C} t_{i} log ({\hat{y}}_{i})

(28)

where

t_{i}

is the true label (one-hot encoded) and

{\hat{y}}_{i}

is the predicted probability. This loss function is chosen for its efficient gradient convergence in multi-class classification tasks where target classes are mutually exclusive.

{\hat{y}}_{i}

is used to determine the predicted class of an input sample, usually choosing the class with the highest probability as the final score:

Final Forecast Category = \arg \max y_{i}

(29)

3. Experiment and Analysis

The UAV models utilized in this study include the DJI Phantom 4 Pro V2.0, DJI Mavic Pro and DJI Air 2S, as shown in Figure 4g, Figure 4h and Figure 4i respectively. The microphone sensor employed is the LM1111, with a sampling frequency set at 48 kHz, as shown in Figure 4b. The DJI Phantom 4 Pro V2.0 was positioned at distances of 5 m, 40 m, 50 m, 60 m, and 70 m from the microphone; the DJI Air 2S was positioned at distances of 5 m, 6 m, 7 m, 8 m, 9 m, 10 m, 11 m, 12 m, 30 m, 40 m and 50 m from the microphone, while the DJI Mavic Pro was stationed at distances of both 5 m and 50 m from the microphone. Environmental noise was recorded directly when the drones were not in operation. The experimental scenes are shown in Figure 4a,c–f. The duration of environmental noise audio amounted to 1614 s; conversely, the audio recordings for the DJI Phantom 4 Pro V2.0, DJI Air 2S, and DJI Mavic Pro lasted for 486 s, 1374 s, and 264 s, respectively. The total duration of audio used for training the CNN was 3738 s.

For the sake of convenience, the harmonic features proposed in this paper are referred to as HF. The HF dimension is 10 × 22, with a feature vector of size 1 × 22 extracted from each frame. The resulting feature matrix consists of 10 feature vectors, and each frame has a duration of 0.3 s.

In the comparative experiments, the MFCC dimension is set at 100 × 22, where a feature vector of size 1 × 22 is also extracted from each frame. This feature matrix comprises 100 feature vectors, with each frame lasting for 25 ms and an overlap of 15 ms between consecutive frames.

In the additional comparative experiments, the Log-Mel Spectrogram dimension is set at 300 × 64, where a feature vector of size 1 × 64 is extracted from each frame. This feature matrix comprises 300 feature vectors, with each frame lasting for 25 ms and an overlap of 15 ms between consecutive frames. Furthermore, a Raw Waveform CNN based on the M5 architecture was implemented as an end-to-end deep learning baseline. Unlike the feature-based approaches, this model accepts raw time-domain waveforms directly as input. The input dimension is set to 14,400 × 1, corresponding to a 0.3-s audio segment at a sampling rate of 48 kHz. To effectively capture the low-frequency acoustic characteristics of UAVs, the first convolutional layer of the M5 model is configured with a large kernel size of 80.

Figure 5 presents the spectrograms of various UAV models at different distances. The energy is concentrated within a specific frequency range. When the DJI Phantom 4 Pro V2.0 hovers at a distance of 120 m from the microphone, high-order harmonics with frequencies of 344 Hz, 516 Hz, 688 Hz, and 860 Hz are still present in the generated sound signal and exhibit minimal variation over time. Similarly, when the DJI Air 2S hovers at a distance of 80 m from the microphone, high-order harmonics with frequencies of 460 Hz, 700 Hz, 920 Hz, 1160 Hz, and 1380 Hz remain evident in the generated sound signal and show little change over time. Furthermore, when the DJI Mavic Pro hovers at a distance of 60 m from the microphone, high-order harmonics with frequencies of 360 Hz, 720 Hz, and 1080 Hz persist in the generated sound signal without significant temporal variation.

As illustrated in Figure 6, the feature extraction method based on HF proposed in this paper effectively captures the harmonic wave form features present in the spectrograms of sound signals from various types of UAVs at different distances. When comparing the 120 m HF data from the DJI Phantom 4 Pro V2.0 UAV to that at a distance of 5 m, it is evident that harmonics at frequencies of 344 Hz, 516 Hz, 688 Hz, and 860 Hz are still detectable and can be successfully extracted. Similarly, for the DJI Air 2S UAV at an altitude of 80 m compared to its performance at a distance of 5m, harmonics corresponding to frequencies of 460 Hz, 700 Hz, 920 Hz,1160 Hz, and 1380 Hz remain identifiable and extractable. Furthermore, analysis of the DJI Mavic Pro UAV reveals that even at a distance of 60 m compared to 5 m measurements; harmonics such as those found at 360 Hz and 720 Hz as well as 1080 Hz continue to exist and can be effectively extracted.

Figure 7 presents the confusion matrix for the validation set, where the labels “1,” “2,” “3,” and “4” correspond to “Noise,” “DJI Air 2S,” “DJI Mavic Pro,” and “DJI Phantom 4 Pro V2.0,” respectively. In this experiment, each sample is constructed by aggregating 10 consecutive frames (with a single frame duration of 0.3 s), forming a robust feature matrix. The validation set was generated through random splitting of the dataset. Consequently, the model achieved 100% recognition accuracy on the Mavic Pro in this internal validation phase, demonstrating excellent convergence and feature learning capability. However, it must be noted that this result reflects the model’s performance on data with a similar distribution to the training set. The rigorous evaluation of the model’s generalization ability under severe signal attenuation (long-distance scenarios) is presented separately in Table 3.

The audio recordings of the three distinct UAV models listed in Table 3 were not utilized during the CNN training process. The proposed HF method demonstrated superior performance, reaching a recognition rate of 78.03% for the DJI Phantom 4 Pro V2.0 at 120 m, 96.28% for the DJI Air 2S at 80 m, and 81.82% for the DJI Mavic Pro at 60 m, confirming the feasibility of the scheme for long-distance identification. In contrast, as presented in Table 3, the recognition accuracy of the three comparative baselines—MFCC, Log-Mel Spectrogram, and Raw Waveform CNN (M5 architecture)—proved to be significantly less stable. While the modern Raw Waveform CNN (M5) achieved a competitive accuracy of 77.97% for the Phantom 4 Pro V2.0, it failed to generalize to the quieter DJI Air 2S and DJI Mavic Pro, dropping to 48.82% and 23.14%, respectively. Similarly, the Log-Mel Spectrogram method yielded inconsistent results, achieving only 48.57% for the Phantom 4 Pro V2.0 and 22.73% for the Mavic Pro. The traditional MFCC method performed poorest on the Mavic Pro with an accuracy of merely 1.47%. This extremely low figure is not an anomaly but reflects a systematic feature collapse in low-SNR conditions. Detailed analysis reveals that 38.24% of the Mavic Pro samples were misclassified as the DJI Phantom 4 Pro V2.0 due to spectral similarity after attenuation. Collectively, these comparisons suggest that incorporating domain-specific physical harmonic features offers a significant advantage in low Signal-to-Noise Ratio (SNR) environments. The proposed HF method effectively mitigates the impact of signal attenuation, demonstrating higher stability for long-distance monitoring compared to purely data-driven end-to-end models (like M5) and traditional features.

4. Discussion

(1): The proposed method relies on predefined harmonic frequency bins and is therefore sensitive to variations in the acoustic signature. This assumption is supported by the time-frequency analysis presented in Figure 5 and Figure 6. The spectrograms demonstrate that while the signal amplitude attenuates significantly at 120 m, the harmonic peaks remain centered at the same frequency bins as those observed at close range (5 m). This spectral consistency confirms that harmonic frequency distribution is invariant to distance attenuation within the tested range. Physical changes to the UAV, such as added payloads or different propellers, may cause harmonic shifts that exceed the target windows, requiring parameter recalibration and model retraining. Consequently, this approach is designed for recognizing known UAV models rather than arbitrary or unknown targets. Generalization to new models necessitates prior analysis of their harmonic structures to define appropriate frequency parameters.
(2): This study is also constrained by the dataset size and the use of a single physical unit per UAV model at specific distances. While the long-duration recordings employed in this study capture some temporal fluctuations in environmental noise, future research should validate these findings on a larger, more diverse dataset. This includes testing across varying distances and weather conditions to fully assess robustness against environmental variation and SNR variability.

5. Conclusions

In order to address the issue of low accuracy in long-distance UAV sound recognition, this paper proposes a method for long-distance UAV sound recognition based on HF. The proposed method performs harmonic analysis on the UAV’s own sound signal and selects parameters that are less susceptible to distance-related factors as the primary components for feature extraction. This approach aims to minimize information redundancy and mitigate significant fluctuations in the feature vector. In the frequency domain, interference from spectral lines near high-order harmonic peaks is suppressed. Initially, prominent peaks are identified and reserved; subsequently, peaks close to the UAV’s high-order harmonic frequencies are selected. This process results in the formation of a feature vector. A two-dimensional feature matrix is constructed using feature vectors from 10 consecutive frames, with CNN chosen as the classifier. The experimental results demonstrate that the method presented in this study can accurately identify the DJI Phantom 4 Pro V2.0 UAV with an accuracy rate of 78.03% at a distance of 120 m. Furthermore, comparative analyses involving MFCC, Log-Mel Spectrogram, and a modern Raw Waveform CNN (M5 architecture) provide additional context to the performance of the proposed approach. While the data-driven baselines demonstrated effectiveness in certain conditions, they also exhibited sensitivity to the low Signal-to-Noise Ratio (SNR) inherent in long-distance scenarios (e.g., the Raw Waveform CNN recording 23.14% for the DJI Mavic Pro). In comparison, the proposed method maintained relatively stable recognition rates across different UAV models. These findings suggest that integrating physical harmonic priors serves as a beneficial strategy for enhancing the reliability of UAV sound recognition in complex, long-distance environments.

Author Contributions

Conceptualization: K.F. and W.P. Methodology: K.F. and J.Z. Software: K.F. and Z.Z. Validation: K.F., W.P. and W.C. Writing—original draft preparation: K.F. and W.P. Writing—review and editing: J.Z., Z.Z. and W.C. Funding acquisition: K.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China [No. 62363014, No. 61763018]; the Key Plan Project of Science and Technology of Ganzhou [GZ2024ZDZ 008]; and the Key Laboratory of Low Dimensional Quantum Materials and Sensor Devices of Jiangxi Education Institutes [No. GanJiaoKeZi-20241301].

Data Availability Statement

The data are available upon request.

Conflicts of Interest

Author Kuangang Fan is employed by the company Jiangsu Golden-Keen Eyes Intelligent and Control Technology Company Limited. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Vinogradov, E.; Minucci, F.; Pollin, S. Wireless Communication for Safe UAVs: From Long-Range Deconfliction to Short-Range Collision Avoidance. IEEE Veh. Technol. Mag. 2020, 15, 88–95. [Google Scholar] [CrossRef]
Zhang, T.; Lu, R.; Yang, X.; Xie, X.; Fan, J.; Tang, B. UAV Hunter: A Net-Capturing UAV System with Improved Detection and Tracking Methods for Anti-UAV Defense. Drones 2024, 8, 573. [Google Scholar] [CrossRef]
Dong, H.; Liu, J.; Wang, C.; Cao, H.; Shen, C.; Tang, J. Drone Detection Method Based on the Time-Frequency Complementary Enhancement Model. IEEE Trans. Instrum. Meas. 2023, 72, 1–12. [Google Scholar] [CrossRef]
Shi, Z.; Chang, X.; Yang, C.; Wu, Z.; Wu, J. An Acoustic-Based Surveillance System for Amateur Drones Detection and Localization. IEEE Trans. Veh. Technol. 2020, 69, 2731–2739. [Google Scholar] [CrossRef]
Wang, C.; Shi, Z.; Meng, L.; Wang, J.; Wang, T.; Gao, Q.; Wang, E. Anti-Occlusion UAV Tracking Algorithm with a Low-Altitude Complex Background by Integrating Attention Mechanism. Drones 2022, 6, 149. [Google Scholar] [CrossRef]
Park, S.; Kim, H.T.; Lee, S.; Joo, H.; Kim, H. Survey on Anti-Drone Systems: Components, Designs, and Challenges. IEEE Access 2021, 9, 42635–42659. [Google Scholar] [CrossRef]
Ye, L.; Hu, S.; Yan, T.; Xie, Y. GAF Representation of Millimeter Wave Drone RCS and Drone Classification Method Based on Deep Fusion Network Using ResNet. IEEE Trans. Aerosp. Electron. Syst. 2022, 59, 336–346. [Google Scholar] [CrossRef]
Basak, S.; Rajendran, S.; Pollin, S.; Scheers, B. Combined RF-Based Drone Detection and Classification. IEEE Trans. Cogn. Commun. Netw. 2022, 8, 111–120. [Google Scholar] [CrossRef]
Su, B.; Zhang, J.; Lin, Y. Efficient Drone Detection Method Based on YOLOv8s Improvement. Comput. Inform. 2025, 44, 445–466. [Google Scholar] [CrossRef]
Cai, H.; Zhang, J.; Xu, J. ALDNet: A Lightweight and Efficient Drone Detection Network. Meas. Sci. Technol. 2025, 36, 025402. [Google Scholar] [CrossRef]
Liu, J.; Xu, Q.-Y.; Su, M.; Chen, W.-S. UAV Swarm Target Identification and Quantification Based on Radar Signal Independency Characterization. Remote Sens. 2024, 16, 3512. [Google Scholar] [CrossRef]
Oh, B.-S.; Lin, Z. Extraction of Global and Local Micro-Doppler Signature Features from FMCW Radar Returns for UAV Detection. IEEE Trans. Aerosp. Electron. Syst. 2020, 57, 1351–1360. [Google Scholar] [CrossRef]
Al-Sa’d, M.F.; Al-Ali, A.; Mohamed, A.; Khattab, T.; Erbad, A. RF-Based Drone Detection and Identification Using Deep Learning Approaches: An Initiative towards a Large Open Source Drone Database. Future Gener. Comput. Syst. 2019, 100, 86–97. [Google Scholar] [CrossRef]
Fang, J.; Li, Y.; Ji, P.N.; Wang, T. Drone Detection and Localization Using Enhanced Fiber-Optic Acoustic Sensor and Distributed Acoustic Sensing Technology. J. Light. Technol. 2022, 41, 822–831. [Google Scholar] [CrossRef]
Yazıcı, A.; Baykal, B. Detection and Localization of Drones in MIMO CW Radar. IEEE Trans. Aerosp. Electron. Syst. 2023, 60, 226–238. [Google Scholar] [CrossRef]
Sun, C.; Mao, X.; Tang, Z.; Lou, P. Radar False Alarm Suppression Based on Target Spatial Temporal Stationarity for UAV Detecting. Drones 2024, 8, 699. [Google Scholar] [CrossRef]
Deng, J.; Ji, X.; Wang, B.; Wang, B.; Xu, W. Dr. Defender: Proactive Detection of Autopilot Drones Based on CSI. IEEE Trans. Inf. Forensics Secur. 2024, 19, 194–206. [Google Scholar] [CrossRef]
Wang, W. Research on UAV Acoustic Identification Technology Based on Blind Source Separation. Master’s Thesis, Jiangxi University of Science and Technology, Ganzhou, China, 2022. [Google Scholar] [CrossRef]
Anwar, M.Z.; Kaleem, Z.; Jamalipour, A. Machine Learning Inspired Sound-Based Amateur Drone Detection for Public Safety Applications. IEEE Trans. Veh. Technol. 2019, 68, 2526–2534. [Google Scholar] [CrossRef]
Wang, W.; Fan, K.; Ouyang, Q.; Yuan, Y. Acoustic UAV Detection Method Based on Blind Source Separation Framework. Appl. Acoust. 2022, 200, 109057. [Google Scholar] [CrossRef]
Shi, L.; Ahmad, I.; He, Y.; Chang, K. Hidden Markov Model Based Drone Sound Recognition Using MFCC Technique in Practical Noisy Environments. J. Commun. Netw. 2018, 20, 509–518. [Google Scholar] [CrossRef]
Yoo, S.; Ryu, Y.; Shin, S.; Oh, H. Analysis and Classification of Drone Sounds from Digital Media. Concurr. Comput. Pract. Exp. 2021, 35, 6671. [Google Scholar] [CrossRef]
Aydın, İ.; Kızılay, E. Development of a New Light-Weight Convolutional Neural Network for Acoustic-Based Amateur Drone Detection. Appl. Acoust. 2022, 193, 108773. [Google Scholar] [CrossRef]
European Commission. Commission Delegated Regulation (EU) 2019/945 of 12 March 2019 on unmanned aircraft systems and on third-country operators of unmanned aircraft systems. Off. J. Eur. Union 2019, L 152, 1–40. [Google Scholar]

Figure 1. Spectra of UAV and human voice. (a) Spectrum of DJI Phantom 4 Pro V2.0; (b) Spectrum of DJI Air 2S; (c) Spectrum of DJI Mavic Pro; (d) Spectrum of human voice.

Figure 2. Flowchart of Feature Extraction (Taking DJI Phantom 4 Pro V2.0 as an Example).

Figure 3. Flowchart of Feature Extraction (Taking Three Typical UAVs as Examples).

Figure 4. Experimental scene diagram. (a) Data acquisition system and laptop; (b) Close-up of the audio interface; (c) Overview of the outdoor experimental environment; (d) Side view of the equipment setup; (e) Front view of the equipment setup; (f) Flight scenario on a paved road; (g) The white target drone (DJI Phantom 4 Pro V2.0); (h) The red target drone (DJI Mavic Pro); (i) The grey target drone (DJI Air 2S).

Figure 5. Spectrograms of UAV sounds. (a) Spectrogram of DJI Phantom 4 Pro V2.0 at 5 m; (b) Spectrogram of DJI Phantom 4 Pro V2.0 at 120 m; (c) Spectrogram of DJI Air 2S at 5 m; (d) Spectrogram of DJI Air 2S at 80 m; (e) Spectrogram of DJI Mavic Pro at 5 m; (f) Spectrogram of DJI Mavic Pro at 60 m.

Figure 6. HF of UAV. (a) HF of DJI Phantom 4 Pro V2.0 at 5 m; (b) HF of DJI Phantom 4 Pro V2.0 at 120 m; (c) HF of DJI Air 2S at 5 m; (d) HF of DJI Air 2S at 80 m; (e) HF of DJI Mavic Pro at 5 m; (f) HF of DJI Mavic Pro at 60 m.

Figure 7. Confusion matrix diagram of the validation set. Note: This matrix reflects the model’s convergence on randomly sampled validation data; for the model’s performance in actual long-distance scenarios, please refer to Table 3.

Table 1. Comparison of mainstream UAV detection technologies.

UAV Feature	Device	Advantages	Disadvantages	Range
Infrared	Infrared Camera	Unaffected by weather; Long detection distance.	Low recognition accuracy.	<1 km
Optical	Optical Camera	Low cost; Miniaturizable; High recognition accuracy.	Susceptible to weather; Easily occluded.	<2 km
RF	USRP	Unaffected by obstacles; Identification of drone controller.	Cannot detect autonomous drones; High cost.	<5 km
Radar	Radar	Unaffected by weather; Long detection distance.	High cost; Easily occluded; Equipment restricted.	3–20 km
Acoustic	Acoustic Array	Low cost; Miniaturizable; Easy to combine with other methods.	Short detection range; Susceptible to noise interference.	<0.2 km

Table 2. CNN classifier structure.

Number of Layers	Type	Dimension
Level 1	Input	$10 \times 22 \times 1$
Level 2	2D convolution	$10 \times 22 \times 8$
Level 3	Batch normalization	$10 \times 22 \times 8$
Level 4	ReLU	$10 \times 22 \times 8$
Level 5	2D Max pooling	$5 \times 11 \times 8$
Level 6	2D convolution	$5 \times 11 \times 16$
Level 7	Batch normalization	$5 \times 11 \times 16$
Level 8	ReLU	$5 \times 11 \times 16$
Level 9	Fully connected	$1 \times 1 \times 4$
Level 10	Softmax	$1 \times 1 \times 4$
Level 11	Categorical output	$1 \times 1 \times 4$

Table 3. Comparison of recognition accuracy among Proposed HF, MFCC, Log-Mel Spectrogram, and Raw Waveform CNN.

UAV Model	Distance	Accuracy
UAV Model	Distance	HF	MFCC	Log-Mel Spectrogram	Raw Waveform CNN (M5)
DJI Phantom 4 Pro V2.0	120 m	78.03%	62.00%	48.57%	77.97%
DJI Air 2S	80 m	96.28%	75.89%	88.24%	48.82%
DJI Mavic Pro	60 m	81.82%	1.47%	22.73%	23.14%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fan, K.; Pan, W.; Zhong, J.; Zeng, Z.; Chen, W. Research on Sound Recognition of Long-Distance UAV Based on Harmonic Features. Drones 2026, 10, 25. https://doi.org/10.3390/drones10010025

AMA Style

Fan K, Pan W, Zhong J, Zeng Z, Chen W. Research on Sound Recognition of Long-Distance UAV Based on Harmonic Features. Drones. 2026; 10(1):25. https://doi.org/10.3390/drones10010025

Chicago/Turabian Style

Fan, Kuangang, Wenjie Pan, Jilong Zhong, Zhiyu Zeng, and Wenzheng Chen. 2026. "Research on Sound Recognition of Long-Distance UAV Based on Harmonic Features" Drones 10, no. 1: 25. https://doi.org/10.3390/drones10010025

APA Style

Fan, K., Pan, W., Zhong, J., Zeng, Z., & Chen, W. (2026). Research on Sound Recognition of Long-Distance UAV Based on Harmonic Features. Drones, 10(1), 25. https://doi.org/10.3390/drones10010025

Article Menu

Research on Sound Recognition of Long-Distance UAV Based on Harmonic Features

Highlights

Abstract

1. Introduction

2. Method

2.1. Mathematical Modeling of UAV Sound Signal

2.2. Feature Extraction Based on Harmonic Features

2.3. CNN Classifier

3. Experiment and Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI