Multimodal UAV Target Detection Method Based on Acousto-Optical Hybridization

He, Tianlun; Hou, Jiayu; Chen, Da

doi:10.3390/drones9090627

Open AccessArticle

Multimodal UAV Target Detection Method Based on Acousto-Optical Hybridization

by

Tianlun He

¹

,

Jiayu Hou

²

and

Da Chen

^3,*

¹

School of Precision Instrument and Opto-Electronics Engineering, Tianjin University, Tianjin 300072, China

²

School of Safety Science and Engineering, Civil Aviation University of China, Tianjin 300300, China

³

Tianjin Engineering Research Center of Civil Aviation Energy Environment and Green Development, Civil Aviation University of China, Tianjin 300300, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(9), 627; https://doi.org/10.3390/drones9090627

Submission received: 22 June 2025 / Revised: 26 August 2025 / Accepted: 27 August 2025 / Published: 5 September 2025

Download

Browse Figures

Versions Notes

Abstract

Urban unmanned aerial vehicle (UAV) surveillance faces significant obstacles due to visual obstructions, inadequate lighting, small target dimensions, and acoustic signal interference caused by environmental noise and multipath propagation. To address these issues, this study proposes a multimodal detection framework that integrates an efficient YOLOv11-based visual detection module—trained on a comprehensive dataset containing over 50,000 UAV images—with a Capon beamforming-based acoustic imaging system using a 144-element spiral-arm microphone array. Adaptive compensation strategies are implemented to improve the robustness of each sensing modality, while detections results are validated through intersection-over-union and angular deviation metrics. The angular validation is accomplished by mapping acoustic direction-of-arrival estimations onto the camera image plane using established calibration parameters. Experimental evaluation reveals that the fusion system achieves outstanding performance under optimal conditions, exceeding 99% accuracy. However, its principal advantage becomes evident in challenging environments where individual modalities exhibit considerable limitations. The fusion approach demonstrates substantial performance improvements across three critical scenarios. In low-light conditions, the fusion system achieves 78% accuracy, significantly outperforming vision-only methods which attain only 25% accuracy. Under occlusion scenarios, the fusion system maintains 99% accuracy while vision-only performance drops dramatically to 9.75%, though acoustic-only detection remains highly effective at 99%. In multi-target detection scenarios, the fusion system reaches 96.8% accuracy, bridging the performance gap between vision-only systems at 99% and acoustic-only systems at 54%, where acoustic intensity variations limit detection capability. These experimental findings validate the effectiveness of the complementary fusion strategy and establish the system’s practical value for urban airspace monitoring applications.

Keywords:

multi-modal detection; optical-acoustic fusion; drone invasion

1. Introduction

Unmanned aerial vehicles (UAVs) have become integral instruments in modern urban management, driven by the rapid expansion of the low-altitude economy. These technologies facilitate diverse applications, including logistics, distribution environmental monitoring, and urban planning [1]. In the context of urban product supply chains and transportation management, UAVs offer innovative solutions for addressing the challenges associated with the “last mile” delivery segment [2,3,4,5]. Nevertheless, the widespread adoption of UAVs has been accompanied by a significant escalation in security concerns. Unauthorized drone activities, for example, pose substantial risks to public safety by potentially disrupting civil aviation pathways or being exploited for illicit purposes such as unauthorized surveillance and the conveyance of hazardous materials [6]. Consequently, the management of low-altitude airspace and the assurance of urban safety are becoming increasingly reliant on the advancement of reliable and efficient technologies for drone detection and identification [7].

Currently, various UAV detection techniques have been developed, predominantly categorized into active methods such as radar detection, and passive methods including radio frequency (RF) detection [8,9,10] and image recognition [11,12]. Radar detection offers significant advantages in terms of long-range detection capabilities and high sensitivity to moving targets by emitting electromagnetic waves and analyzing the reflected signals. Radar systems designed for aircraft detection face significant challenges in identifying drones, as these systems typically employ techniques to suppress unwanted clutter echoes originating from small, slow-moving, low-altitude flying objects—precisely the characteristics that distinguish drones. Additionally, the radar cross sections (RCS) of a medium-sized consumer drone closely resemble those of birds, which can lead to the misclassification of targets [13,14]. RF detection, which involves passively monitoring communication links between drones and their controllers to detect and locate targets, also presents inherent limitations. Notably, this approach is ineffective against drones operating autonomously or utilizing encrypted communications, and furthermore, RF signal decoding is prohibited in numerous countries.

Visual recognition demonstrates superior detection capabilities for low, slow, and small (LSS) UAV targets. Owing to its effectiveness, cost-efficiency, and precision, visual recognition has garnered significant interest from researchers in recent years for applications including the localization, tracking, and early warning of UAV events. To enhance the computational network’s focus on salient features while suppressing irrelevant information, C. Wang et al. [15] implemented a lightweight UAV cluster identification method incorporating an attention mechanism. Extensive ablation studies conducted on the SIRSTD dataset substantiated the effectiveness of the ST-Trans approach in improving detection performance. Furthermore, Xiao et al. [16] originally proposed the ST-Trans framework as a robust end-to-end solution for detecting small infrared targets within complex background environments. He et al. [17] addressed the challenge of UAV detection under adverse weather conditions, such as dense fog, by developing a specialized dataset tailored for foggy environments to enhance recognition accuracy and efficiency. Additionally, they proposed an advanced image recognition method that integrates a content-aware dynamic sparse attention module with a dual-routing mechanism, enabling the flexible allocation of computational resources. Although vision-based drone tracking and recognition methods have significantly enhanced detection accuracy, their effectiveness remains constrained under challenging conditions such as fog, snow, rain, or nighttime, due to the inherent limitations of camera systems.

The identification and tracking of drones through the utilization of acoustic sensors and source localization can produce favorable results. To address the challenge of separating mixed signals amid interfering sources, Zahoor Uddin et al. [18] proposed a multi-UAV detection methods utilizing acoustic signals. Additionally, they developed a Time-Varying Drone Detection Technique (TVDDT), which leverages variations in mixing coefficients to classification and monitors based on a minimum variation criterion within the channel. The proposed approach was assessed through simulation studies, demonstrating highly favorable performance outcomes. Lin Shi et al. [19] proposed a methodology for UAV identification predicted on propeller noise, employing Mel Frequency Cepstral Coefficients (MFCC) for feature extraction and Hidden Markov Models (HMM) for classification. The features extracted from multiple training datasets were utilized to train the HMM-based classifier. Empirical findings suggest that this approach achieves a relatively high identification accuracy even in noisy environments. Furthermore, microphone arrays demonstrate significant efficacy for both detection and localization tasks due to their superior capabilities in spatial localization. Guo et al. [20] conducted a study on adaptive beamforming techniques applied in both ideal and non-ideal conditions. Initially, beamforming was utilized to localize the sound source, with the acoustic signals captured via microphone arrays. Subsequently, the recorded sounds were classified employing HMM, followed by tracking through adaptive beamforming methods.

Blanchard et al. [21] developed an acoustic detection array comprising a limited number of microphones turned to specific UAV frequency bands, enabling comprehensive acoustic field reconstruction in all spatial directions for the purpose of localizing and characterizing sound sources. Nevertheless, in complex urban settings, conventional approaches face significant challenges in accurately identifying and localizing a dynamically varying noise source due to the prevalence of reflection, diffractions, and multipath propagation effects, particularly in low-altitude noise scenarios such as UAV flights. Furthermore, the increasing diversity of UAV flight activities necessitates monitoring systems with enhanced real-time processing and multi-target recognition capabilities, thereby further complicating the task of noise source identification.

Among the aforementioned technical approaches, relying solely on a single detection method—such as recognition or acoustic signal analysis—often encounters significant limitations in practical applications. These limitations include constraints imposed by the operational environment, inadequate real-time responsiveness, and limited robustness against interference. In urban settings, static occlusions—such as buildings and trees—as well as dynamic occlusions caused by targets themselves, substantially impair the accuracy of visual detection. Additionally, nighttime and low-light conditions result in the considerable degradation of detection performance due to diminished image quality. These challenges are particularly pronounced in complex scenarios like urban low altitude flight, significantly undermining the robustness and practical applicability of UAV monitoring systems. Consequently, the integrated processing of image recognition and acoustic localization through multimodal fusion techniques, which leverage the complementary and redundant nature of these data sources, can markedly enhance target recognition accuracy, response time, and environmental adaptability. In the context, the acoustic modality offers valuable compensatory information when visual data are limited or compromised.

Numerous studies have investigated audio-visual fusion methodologies for the real-time detection of drones. Fredrik et al. proposed a multi-modal framework that primarily employs thermal infrared and visible light cameras, leveraging the YOLOv2 algorithm for the detection of drones, aircraft, and birds. Their approach also incorporated an acoustic component, utilizing a microphone in conjunction with a Long Short-Time Memory (LSTM) classifier to recognize acoustic signatures. To reduce false positive detections associated with manned aircraft, the system integrated Automatic Dependent Surveillance-Broadcast (ADS-B) data. The study identified that the principal sources of misclassification originated from insects and cloud formations, indicating that a single-microphone audio classification module exhibits limited efficacy. Moreover, the incorporation of heterogeneous sensors alongside the simultaneous operation of multiple detection models substantially elevates the overall complexity of the system [22]. Liu et al. present an audio-assisted camera array system for drone detection, consisting of 30 high-definition cameras and 3 microphones mounted on a hemispherical frame. The system fuses HOG visual features with MFCC audio features using SVM classifiers to achieve real-time drone detection with 360-degree sky coverage [12]. Experimental results demonstrate significant improvement in detection accuracy compared to vision-only methods, with positive sample precision increasing from 79.5% to 95.74%. However, the massive system requires 8 workstations for data processing and control.

Active detection techniques, including radar and RF systems, experience diminished stability during nighttime, adverse weather conditions, and in the presence of occlusion conditions. In contrast, passive sensing modalities, such as acoustic and visual methods, generally demonstrate greater robustness under these challenging circumstances. Nonetheless, each individual modality possesses intrinsic limitations, underscoring the critical role of multimodal fusion approaches to achieve dependable UAV detection within complex environments.

Consequently, this study introduces a novel approach for UAV detection that mitigates the shortcomings inherent in single-modal detection systems within complex urban settings. Conventional visual-based detection techniques frequently encounter difficulties under adverse conditions, including visual occlusion, low-light environments, and intricate backgrounds characterized by multiple sources of interference. To address these challenges, the research develops a multimodal fusion framework that integrates acoustic sensing with computer vision technologies. The proposed methodology not only improves the reliability and accuracy of UAV detection but also enhances the operational range of monitoring systems in practical scenarios, where environmental factors often impair the effectiveness of single-sensor approaches.

2. Materials and Methods

To improve the precision and reliability of UAV detection and localization systems operating in complex environments, this study proposes a multimodal monitoring framework that integrates acoustic beamforming imaging with visual-based recognition. This method leverages the complementary strengths of acoustic and visual sensors to enable efficient detection, classification, and localization of low-altitude UAV targets. Specifically, an array of acoustic sensors acquires multichannel sound signals from the surrounding environment. By employing the Capon beamforming algorithm, the system estimates the spatial spectrum through iterative calculations across grid points, where spectral peaks correspond to the estimated direction of arrival of the sound source. For the continuous narrowband acoustic signals generated by UAVs during flight, this method enhances their spatial features even under strong interference, thereby enabling accurate localization. Simultaneously, a visible-light camera captures aerial imagery synchronously. Using the YOLOv11 deep learning-based object detection algorithm, UAV targets are swiftly detected and classified. YOLOv11 excels at detecting small objects and processing data in real time, accurately segmenting UAV contours, and generating precise bounding boxes even under complex visual backgrounds.

In the fusion stage, this study adopts a spatial-domain alignment mechanism to align the detection outputs from the two sensor modalities. By using the camera’s imaging perspective parameters, the estimated acoustic source directions are projected onto the image plane, allowing the verification of the consistency between acoustic localization and visual detection results. This alignment significantly enhances target recognition accuracy and reduces the false alarm. For targets that generate a distinct acoustic signal yet appear visually indistinct or blurred, acoustic localization offers essential preliminary directional data. Conversely, when acoustic localization accuracy declines as a result of environmental noise, visual detection provides supplementary and complementary data. The integration of these multimodal inputs enhances spatial awareness for UAV detection and improves the system’s resilience in challenging acoustic and visual environments. Consequently, the approach presents a robust technical framework for accurate UAV monitoring and control within the urban airspace.

2.1. YOLOv11-Based UAV Target Detection Algorithm

YOLOv11, released by Ultralytics in September 2024 [23], represents a significant advancement in real-time object detection technology. It achieves superior performance across a wide range of computer vision tasks, including object detection, instance segmentation, pose estimation, tracking, and classification [24]. By incorporating the C3k2 (Cross Stage Partial with kernel size 2) block, SPPF (Spatial Pyramid Pooling-Fast), and C2PSA (Convolutional Block with Parallel Spatial Attention) modules, YOLOv11 substantially enhances the accuracy and efficiency of feature extraction. Furthermore, it adopts an improved loss function, center-based point prediction and refined decoding methods, together with advanced data augmentation strategies, leading to markedly better generalization ability and robustness, thereby extending its applicability to diverse and challenging scenarios.

2.1.1. C3k2 Module

The C3k2 module is a lightweight convolutional block first introduced in YOLOv11 to enhance feature extraction efficiency. It is derived from the CSP (Cross Stage Partial) architecture, where the feature map is split into two parts: one part undergoes a series of convolutional operations, while the other is directly passed through to the output stage. This partial computation strategy reduces redundancy, improves gradient flow, and lowers computational cost. Compared with the standard C3 block, C3k2 replaces the conventional 3 × 3 convolution kernels with 2 × 2 kernels, enabling finer feature granularity, faster inference speed and reduced parameter count while maintaining competitive representational capacity. This design makes C3k2 particularly suitable for real-time object detection tasks where both accuracy and efficiency are critical.

2.1.2. C2PSA Module

The C2PSA module is an innovative attention mechanism first introduced in YOLOv11 to improve spatial feature representation [25]. Unlike traditional convolution blocks that process all spatial information uniformly, C2PSA employs two parallel branches: one branch focuses on local detail extraction through convolution operations, while the other applies a Parallel Spatial Attention (PSA) mechanism to selectively emphasize informative spatial regions and suppress irrelevant background. This dual-branch design slows the model to capture both fine-grained local patterns and broader contextual dependencies. By integrating spatial attention directly into the convolutional block, C2PSA enhances object localization accuracy, especially for small or densely packed objects, while maintaining computational efficiency. This innovation expands the model’s capability to adaptively focus on key features across diverse computer vision tasks.

2.1.3. Evaluation Metrics

(1): Precision

Precision measures the proportion of all predicted “UAV” detections that are true UAV targets. It is defined as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(1)

where

T P

(True Positive) is the number of samples correctly predicted as positive and truly positive;

F P

(False Positive) is the number of samples predicted as positive but actually negative.

(2): Recall

Recall measures the model’s ability to identify all actual “UAV” targets. It is defined as

R e c a l l = \frac{T P}{T P + F N}

(2)

where

F N

(False Negative)—the number of samples predicted as negative but actually positive.

(3): $F_{1}$ Score

F_{1}

comprehensively evaluates both precision and recall, reflecting the model’s overall balance between positive and negative samples in the UAV detection task. It is defined as the harmonic mean of precision and recall:

F_{1} = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(3)

(4): mAP (mean Average Precision)

mAP is a comprehensive evaluation metric in object detection tasks, measuring the model’s overall detection performance across different categories and confidence thresholds.

m A P @ 0.5

denotes the average precision (AP) across all categories when the Intersection over Union (

I o U

) threshold between predicted and ground truth bounding boxes is set to 0.5:

m A P @ 0.5 = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}^{I o U = 0.5}

(4)

where

A P

(Average Precision) is obtained as the area under the precision-recall curve, with values closer to 1 indicating better performance.

m A P @ 0.5 : 0.95

refers to the mean

A P

averaged over multiple

I o U

thresholds, ranging from 0.5 to 0.95 in steps of 0.05, representing a more stringent evaluation metric:

m A P @ 0.5 : 0.95 = \frac{1}{10 N} \sum_{i = 1}^{N} \sum_{j = 0}^{9} A P_{i}^{I o U = 0.5 + 0.05 j}

(5)

2.2. Principles of Capon-Based Microphone-Array Imaging

In the practical application scenarios for UAVs, there will be a large amount of background noise interference, and these factors may have an impact on the accurate identification and localization of the target sound source. As a small and highly mobile sound source, UAVs emit sound source signals with a relatively decentralized spectrum and weak intensity, and using the traditional energy spectrum estimation is difficult for accurately locating the UAV sound source in a high background noise environment. Therefore, the Capon-based beamforming algorithm is used in this paper [26,27]. This algorithm, as a typical high-resolution spatial spectrum estimation method, can minimize the interference and noise power from other directions while maintaining a distortion-free response to the target direction, and thus has significant advantages in terms of noise suppression and spatial resolution capability. Compared with the traditional delay-and-sum or traditional beam forming methods, the Capon algorithm can effectively inhibit the leakage of side flaps and improve the differentiation ability of close-range multiple sources, which is especially suitable for accurate acoustic source localization in complex multi-path environments in UAV monitoring. Based on this, this paper adopts the Capon algorithm to analyze the spatial spectrum of the multi-channel data collected by the acoustic sensor array, and extracts the DOA (direction-of-arrival) information of the UAV accordingly, so as to realize the high-precision acoustic localization.

In signal processing for UAV detection, the distance

r

from the sound source to the center of the array satisfies

r > > \frac{2 D^{2}}{λ}

(6)

where

D

—the maximum current size of the array; and λ—acoustic wavelength.

Thus, the source is in the far-field region of the array and the acoustic field is a plane wave, so the signal received at each array element has only a flavor difference related to the propagation direction, and its amplitude change can be ignored. Therefore, the received signal of the

m

-th sensor on the array from a narrowband plane wave in the direction can be expressed as

x_{m} (t) = s (t) e^{- j \frac{2 π}{λ} (n - 1) d \sin θ} + n_{m} (t)

(7)

Combining the signals from all the array elements into vector form, the array received vector model is obtained as

x (t) = s (t) a (θ) + n (t)

(8)

where

a (θ)

—array pairs of normalized array manifold vectors from direction

θ_{0}; s (t)

is the incident signal from direction

θ_{0}; a n d n (t)

—for background noise and other interfering signals.

The Capon algorithm suppresses interference and noise by constructing optimized weight vectors to minimize the output power in the non-target direction while ensuring a distortion-free response in the target direction, thus achieving a high-resolution spatial spectral estimation of the sound source.

First, the covariance matrix is constructed for the received data:

R_{x} = E [x (t) x^{H} (t)]

(9)

The optimal beamformer weight vector

w

is constructed for the direction

θ_{0}

by satisfying the unit response constraint

W^{H} a (θ) = 1

while minimizing the output power:

\underset{w}{m i n} ω^{H} R_{x} w, s . t . w^{H} a (θ) = 1

(10)

The analytical solution of this optimization problem is

w_{o p t} (θ) = \frac{R_{x}^{- 1} a (θ)}{a^{H} (θ) R_{x}^{- 1} a (θ)}

(11)

The corresponding spatial spectrum estimation expression is

P_{c a p o n} (θ) = \frac{1}{a^{H} (θ) R_{x}^{- 1} a (θ)}

(12)

where

P_{c a p o n} (θ)

—e spatial spectral power in direction

θ

; and R_x—data covariance matrix.

By scanning over the candidate directions

θ \in [- 90 °, + 90 °]

and computing

P_{c a p o n} (θ)

for each angle, a spatial power spectrum is generated. The direction corresponding to the spectral peak is then identified as the estimated direction of the acoustic source.

2.3. Fusion Strategies

To achieve synergistic recognition by combining acoustic source imaging and image recognition, this section implements a spatial fusion recognition framework, as depicted in Figure 1. This framework incorporates multi-dimensional conditional criteria during the decision-making phase to enable the precise identification of UAV targets. The proposed mechanism is engineered to extract valid targets from multimodal redundant data by jointly assessing the spatial overlap confidence level derived from both acoustic and image modality. This integrative evaluation substantially enhances recognition accuracy and robustness. The fusion recognition decision is performed on a per-frame or per-time-window basis, with the decision criteria encompassing the following elements:

(1): Independent Confidence Evaluation
Image Confidence Threshold: If the YOLOv11 model identifies a UAV target within an image and the confidence score of the bounding box, denoted as Pimg, is greater than or equal to 0.4, the image modality is considered to contain a valid candidate target;
Acoustic Intensity Threshold: A definitive localization response is considered to exist for the source modality if the peak primary flap energy Sacoustic ≥ a threshold (e.g., =45 dB) in a direction in the Capon imaging direction map and the peak differs from the surrounding paraflaps by a set dynamic range (e.g., ≥3 dB).
(2): Spatial Consistency Constraint (Region Overlap Criterion)
The bounding box obtained from image recognition is projected onto the spatial coordinate system corresponding to the acoustic source map. The spatial angular offset or projection overlap between the center of the detected image target and the peak direction of the main lobe in the acoustic map is then computed. If the spatial overlap between the two satisfies any of the following criteria, the target is classified as spatially consistent:
Region IoU ≥ 0.3;
Angular deviation between the image target center and acoustic direction ≤ 5°.
(3): Redundant Compensation Strategy:
In the following scenarios, target confirmation can still be initiated when only a single modality criterion is satisfied, provided that a higher confidence threshold is met:
Vision-Dominated Compensation: If there is a high-confidence target for image recognition Pimg ≥ 0.6, but there is no significant main lobe in the acoustic source map, and the target does not appear in the image blindness or strong occlusion area, then target confirmation is triggered through the visual channel.
Acoustic-Dominated Compensation: In instances where Capon beamforming imaging exhibits a pronounced main lobe response and satisfies the criteria for dynamic range and directional resolution, yet the image recognition confidence score falls below 0.2 or the target (such as a distant or small object) remains undetected, a candidate target alert or assisted confirmation is initiated based on the directional information derived from the acoustic source. This approach integrates a trained YOLOv11 model with real-time acoustic source localization obtained from acoustic arrays, enabling the identification of UAV targets concurrent with real-time acoustic source localization.

2.4. Data Collection

2.4.1. Image Train Dataset

We constructed a dataset of 51,446 high-resolution images originally captured at 1920 × 1080 pixels, featuring UAVs across multiple aircraft models, diverse viewing angles, and complex environments, including urban, forested, and airport scenes. For model training, all images were resized to 640 × 480 pixels while preserving aspect ratio where possible. Each UAV instance was manually labeled in YOLO format (x, y, w, h), where (x, y) is the normalized center coordinate and (w, h) are the normalized width and height of the bounding box.

Figure 2 Dataset samples show diverse UAV types (multi-rotor and fixed-wing) captured under various challenging conditions including different environments (urban, forest, rural), illumination levels (bright daylight to dusk), observation distances, and background complexities. The images demonstrate the dataset’s coverage of real-world scenarios for robust UAV detection model training.

2.4.2. Measured Data and Experimental Equipment

In order to systematically evaluate the performance of the proposed acoustic source imaging and image recognition fusion method in real scenarios, this paper constructs a series of typical UAV flight test scenarios and conducts a large-scale empirical verification of the fusion recognition system. The test scenarios cover typical UAV activities such as static hovering, multi-target cooperative flight, occlusion interference, dynamic flight at different altitudes, dark environment, and complex background environment.

The drone models used for collecting test data samples are the DJI Matrice 300 RTK and the DJI Mavic 3 Enterprise (SZ DJI Technology Co., Ltd., Shenzhen, China).

3. Results and Discussion

3.1. Pre-Laboratory Simulations

To thoroughly assess the effectiveness and applicability of the proposed UAV event recognition approach, which integrates Capon algorithm acoustic source imaging and YOLOv11 image recognition technology, this study conducts systematic analog simulation experiments from two perspectives: acoustic signal processing and computer vision. On the one hand, the simulation analysis of sound source localization and imaging capability is carried out by constructing a typical array structure, focusing on evaluating the dynamic range of the Capon algorithm under different signal-to-noise ratios and the ability of multi-source resolution. On the other hand, based on the YOLOv11 deep neural network model, a multi-class UAV image dataset is constructed, and the model training and testing are carried out to evaluate the target detection accuracy in the complex backgrounds and the model’s generalization performance.

3.1.1. Simulation of Acoustic Source Localization Using the Capon Algorithm

To validate the spatial scanning-based localization mechanism of the Capon algorithm, numerical simulations were conducted using a spiral array configuration, as shown in Figure 3. The algorithm performs degree-by-degree scans over the entire 360° azimuthal domain, calculating sound pressure level response functions for each scanning direction to generate a detailed spatial sound field distribution map. The simulation outcomes indicate that when the target source is located at 50° azimuth, the Capon algorithm produces a pronounced peak int the sound pressure level response near this direction, ultimately localizing the source at 49° azimuth with an angular deviation of 1°. These results effectively validate the fundamental working principle of the Capon algorithm, which involves detecting energy maximum detection through spatial domain scanning to determine the spatial position of sound sources. This simulation study confirms the feasibility and effectiveness of spatial search localization methods based on adaptive beamforming theory in sound source localization applications.

Figure 4 investigates the relationship between source localization error and dynamic range through Monte Carlo simulations, where the color coding represents different signal-to-noise ratio levels (from blue for low SNR to red for high SNR). The simulation results clearly reveal the significance of the 3 dB dynamic range as a performance threshold: when the dynamic range is below 3 dB, localization errors exhibit high dispersion with the error range reaching 2–14 degrees, and this instability persists across various SNR conditions.

However, when the dynamic range exceeds 3 dB, localization errors drop dramatically and tend to stabilize, with most errors controlled within 4 degrees. Particularly noteworthy is that under high SNR conditions (yellow and red data points), performance improvement after the 3 dB threshold becomes more pronounced, with localization errors stabilizing below 1 degree. These simulation results provide substantial theoretical justification for selecting 3 dB as the dynamic range threshold.

3.1.2. Acoustic Array Configuration and Dynamic Performance Simulation

In this study, we employ an unequally spaced spiral-arm microphone array (Figure 5). By arranging the sensor elements along a spiral path in the plane, the array forms a non-uniform spatial sampling pattern that offers superior anti-aliasing performance and higher angular resolution in wideband signal processing compared with traditional uniform linear or spiral arrays [28,29]. The non-uniform spacing effectively mitigates the direction estimation ambiguity caused by grating lobes in equally spaced arrays, thereby extending the array’s effective operating bandwidth and making it particularly suitable for extracting UAV propeller features in the 1–5 kHz range [30]. Furthermore, the gradual expansion of elements in polar coordinates produces a more uniform distribution of spatial sidelobe energy, preventing the occurrence of strong sidelobe peaks at specific angles. This improves the main lobe focusing and enhances anti-jamming capability, thereby increasing the signal-to-noise ratio (SNR) in complex acoustic environments.

To investigate the influence of array geometry and scale on the system’s dynamic performance, a dynamic response simulation was carried out based on array signal processing theory. Three array scales were considered—32, 64, and 144 sensors—and for each scale, two array geometries were compared: linear arm and spiral configurations. The array aperture was kept constant at 0.5 m to ensure a fair comparison. Under identical conditions, the amplitude–frequency response characteristics of each array configuration were analyzed within the frequency range of 500 Hz to 10 kHz. Figure 6 presents the dynamic response curves for the different array scales and geometries, providing a clear illustration of the relationship between array geometry and frequency response.

Considering the acoustic characteristics of UAVs, their noise spectra typically consist of a dominant low-frequency fundamental component generated by rotor rotation, accompanied by multiple higher-order harmonics and broadband aerodynamic noise distributed in the mid-to-high frequency range. To effectively capture these features across the entire spectrum, the array must maintain sufficient sensitivity at low frequencies while preserving strong dynamic response capability in the mid-to-high frequency bands. Simulation results show that the spiral 144 array maintains high amplitude responses both in the low-frequency range (0–1500 Hz) and in the mid-to-high frequency range (3–8 kHz), whereas lower-scale arrays exhibit significant attenuation at higher frequencies. Furthermore, the spiral configuration offers superior spatial sampling uniformity and sidelobe suppression, enabling more robust separation and extraction of multi-frequency components in complex acoustic environments. Consequently, the spiral 144 array was selected as the primary sensor configuration for subsequent UAV noise localization and feature analysis in this study.

In order to further evaluate the resolving performance of the designed spiral-arm array in complex acoustic fields, especially the spatial resolving capability under multi-target interference conditions, a set of two-source angle-resolving simulation experiments are designed in this section. The experiment aims to examine whether the array has the ability to accurately separate two neighboring sound sources under different frequency and source spacing conditions. The simulation scenario is as follows: two narrowband sources are located in the plane in front of the array, 1 m away from the center of the array, and the spacing between the sources is gradually increased from 0 m to 1 m; the frequency is varied from 1 kHz to 10 kHz in steps of 500 Hz. The Capon algorithm is used to construct the directional map imaging matrix, and to judge whether the two independent main flaps can be accurately distinguished in the imaging results as the “successful separation” of the two sources.

The array aperture was 40 cm, the scene was free-field (no reflections), and the SNR was fixed at 20 dB. Figure 7 summarizes the results: the horizontal axis shows source spacing, the vertical axis shows frequency, and the color indicates the outcome—red for successful separation and blue for failure.

Figure 7 shows that weak spatial resolution of the array in the low-frequency band (f < 2000 Hz) and the high probability of not being able to separate two sources even when they are spaced 1 m apart reflect the excessive width of the main flap of the array at this frequency and the serious interference of the side flaps. This phenomenon is directly related to the array aperture—at the relatively small 40 cm effective aperture, the spatial resolution of the low-frequency signals is limited, and the imaging results in a significant widening of the main flap, resulting in the failure to recognize neighboring sound sources.

3.1.3. Image Recognition Model Training

In this comparison, we selected YOLOv8n, YOLOv11n, and YOLOv11s. YOLOv8n is a stable and widely adopted version, while YOLOv11n and YOLOv11s introduce improvements in speed and accuracy, and the comparison results are presented in Table 1. YOLOv9 and YOLOv10, though innovative, lack the same level of ecosystem support. Since the goal is to deploy on edge computing devices, we limited the comparison to YOLOv11s to balance high accuracy with computational efficiency. This selection allows for a direct comparison of stability vs. recent optimizations, suitable for edge deployment, based on training with our UAV dataset. Additionally, we included the RT-DETR-L model in our evaluation to explore the performance of transformer-based detectors for UAV detection. However, on our dataset, RT-DETR-L did not demonstrate superior accuracy compared to the YOLO series models. Combined with its higher computational complexity and lower inference speed, this makes YOLOv11n a more suitable choice for real-time UAV detection on edge devices.

The analysis of the training outcomes indicates that the drone detection model exhibits a performance profile characterized by precise recognition capabilities, albeit with potential for enhancement in localization accuracy. The model achieves an excellent mAP50 score of 91.079%, indicating strong target capability and accurately identifying over 91% of drone targets under the relaxed IoU threshold of 0.5, effectively reducing the risk of missed detections. However, the moderate mAP50-95 performance of 47.347% reveals improvement opportunities in precise bounding box localization. When the IoU threshold is elevated to the strict standard of 0.5–0.95, the model’s capability for precise drone boundary localization is relatively limited, particularly when processing challenging samples with blurred boundaries or partial occlusions.

The confusion matrix (Figure 8) presented in the figure demonstrates the classification performance of the drone detection model on the test dataset. The model correctly identified 2611 drone targets (true positives), while generating 191 false negatives and 355 false positives. Based on the confusion matrix calculations, the model achieved a precision of 88.0% and a recall of 93.2% for the drone detection task, corresponding to an F1-score of 90.5%. The results indicate that while maintaining high recall performance, the model demonstrates good precision; however, there remains a certain degree of background region misclassification, which may lead to elevated false alarm rates in practical applications. Further optimization is needed to improve the model’s discriminative capability and reduce the false positive rate in subsequent work. Figure 9 illustrates selected prediction results with corresponding bounding box annotations and confidence scores.

In the validation set, we selected four groups of images with typical detection problems for analysis as shown in Figure 10. Each group of images is displayed according to four types of results: Ground Truth, True Positives, False Positives, and False Negatives. As shown in Figure 10a,d, when the environmental background is extremely similar to the detected drone, the model is prone to missed detection or misclassification. In another case, such as Figure 10b,c, when luminous drones in night scenes encounter backgrounds that also contain luminous objects, it easily causes multiple detections or excessive errors between detection boxes and ground truth boxes.

Regarding the significant difference between mAP50 and mAP50-95, as illustrated in Figure 11, we conducted a detailed analysis of the prediction results and found that the majority of the model’s predicted bounding boxes exhibit consistently smaller dimensions compared to the ground truth annotations. This systematic deviation indicates that the reduced mAP50–95 performance is primarily attributed to bounding box regression inaccuracies rather than classification errors. The substantial performance gap between mAP50 (which evaluates detection at a single IoU threshold) and mAP50-95 (which averages across multiple IoU thresholds from 0.5 to 0.95) suggests that while the model successfully localizes UAV targets, the precise boundary delineation requires refinement. However, since our proposed framework employs a multimodal fusion approach integrating acoustic and visual modalities, the limitations in visual bounding box precision can be effectively compensated through acoustic spatial localization, thereby maintaining robust overall detection performance.

3.2. Field Experiments

In order to verify the effectiveness and practicality of the UAV event recognition method based on the fusion of Capon’s high-resolution acoustic source imaging algorithm and YOLOv11 image recognition model proposed in this paper, a field UAV recognition experiment was conducted. The experiment is conducted in open airspace, containing typical multi-target flights, partial occlusion, environmental noise, and other complex factors to comprehensively examine the operability and accuracy of the multi-modal recognition method.

3.2.1. Experimental Platform

This experimental platform consists of two parts, the acoustic imaging system and the visual recognition system, aiming at realizing the multimodal simultaneous perception and cooperative recognition of UAV targets. Among them, the acoustic imaging system adopts the 144-channel unequally spaced spiral arm array constructed in the previous section, and the array is deployed in the center of the test area to obtain omni-directional sound field information. The array operates at a sampling rate set to 25.6 kHz, covering a broadband signal range of 1–5 kHz, and performs two-dimensional spatial imaging of the acoustic source by means of Capon’s high-resolution beam-forming algorithm. Acoustic signal acquisition and processing are performed in real time on an embedded processing platform to ensure low latency operation.

The visual recognition system deploys the YOLOv11 deep learning model on an edge computing platform with GPU acceleration, which is mainly responsible for image acquisition and target detection within a horizontal viewing angle of about 100° in front of the test area. The input video stream has a resolution of 640 × 480 and a frame rate of 30 fps, which is able to stably capture UAV targets in different attitudes and motion postures, providing reliable visual information support for subsequent modal fusion.

To ensure the statistical significance of the recognition performance data, multiple repetitions of each scenario type were conducted with no fewer than five independent flight test sessions per test scenario. Throughout the comprehensive experimental campaign, approximately 6000 image frames were captured with corresponding acoustic signal recordings spanning over one hour of flight operations.

3.2.2. Results of Typical Experimental Scenarios

(1): Static Hovering (Unobstructed)

In the static hovering experiments conducted under ideal unobstructed conditions, the UAV was maintained within 2–3 m in front of the array and hovered stably for about 30 s. The experimental results are shown in Figure 12, in which the image recognition system is able to detect the UAV target continuously and stably with high recognition confidence, and the corresponding quantitative results are summarized in Table 2. The acoustic array imaging also exhibits an obvious stable main flap pointing with consistent spatial distribution. The fusion recognition system realizes bimodal joint confirmation through the consistency verification of image and acoustic source directions, effectively enhancing the reliability of target confidence determination.

(2): Occlusion and Interference Scenarios

In this scenario, the experiment simulates typical occlusion conditions in urban environments: UAVs traversing buildings, utility poles, and nighttime environments during flight. The image recognition system suffers from target occlusion resulting in missed detection in some frames, or frame offset problems due to partial occlusion. At the same time, there is a certain degree of acoustic interference in the background (wind sound, environmental noise), which leads to a certain amount of side flap interference in the sound source direction map.

Under partial occlusion conditions, the UAV is only minimally visible in the visual frame, resulting in failed or inconsistent visual detection as evidenced by the absence or inaccuracy of bounding boxes in the middle row, as shown in Figure 13. However, the acoustic beamforming system (top row) successfully maintains clear directional localization with concentrated energy patterns, demonstrating the robustness of acoustic sensing when visual modalities are compromised.

In cases with occlusion interference, the acoustic heatmaps exhibit spatial spreading and sidelobe effects, indicating the presence of significant acoustic obstruction. Despite these acoustic occlusion challenges, the Capon beamforming method’s optimization capabilities enable the system to maintain a stable identification of the primary sound source direction. The fusion results demonstrate that acoustic localization can effectively compensate for visual detection inaccuracies under partial occlusion or non-wall-type obstruction conditions, as shown in Table 3.

Figure 14 demonstrates the detection performance under low-light conditions, displaying in a 3 × 4 grid format that illustrates UAV detection challenges in nighttime environments where targets are equipped with LED lighting systems. In these low-light scenarios, the visual detection system exhibits inconsistent performance due to reduced contrast and limited visual information, with some frames successfully detecting illuminated UAV targets (indicated by red bounding boxes and confidence scores) while others show missed detections despite visible LED indicators. Notably, in the bottom four columns, the acoustic signals are significantly affected by environmental noise interference, causing the acoustic localization to drift toward the edge of the field of view, which demonstrates the limitations of acoustic sensing under adverse acoustic conditions. However, in such cases where acoustic localization accuracy is compromised, the multimodal fusion system can still determine the precise UAV position when visual detection confidence levels are sufficiently high. This complementary relationship highlights the robustness of the proposed multimodal approach, where each sensing modality can compensate for the other’s limitations—acoustic sensing provides reliable backup when visual detection degrades in low-light conditions, while high-confidence visual detections can override inaccurate acoustic localization affected by environmental noise, ensuring comprehensive UAV detection capabilities across diverse operational scenarios.

Table 4 presents the comparative performance analysis of different detection methods under low-light conditions, revealing significant insights into the effectiveness of multimodal fusion approaches. The results demonstrate that under low-light conditions, vision-only detection achieves merely 25% accuracy, confirming the severe limitations of visual systems when ambient illumination is insufficient. This poor performance can be attributed to reduced contrast, limited feature extraction capabilities, and the challenge of distinguishing UAV targets from background noise in dark environments. In contrast, acoustic-only detection maintains a substantially higher accuracy of 74%, demonstrating the robustness and reliability of acoustic sensing for UAV tracking under adverse lighting conditions. The acoustic system’s ability to maintain stable source tracking regardless of illumination levels makes it particularly valuable for nighttime operations. The fusion approach achieves 78% accuracy, representing only a modest 4% improvement over acoustic-only detection. This limited enhancement indicates that when acoustic localization is subject to environmental noise interference (as observed in Figure 11), visual recognition cannot provide effective complementary information due to its own fundamental limitations in low-light scenarios. The relatively small performance gain suggests that in extremely challenging conditions where both modalities face significant constraints, the fusion benefit is primarily derived from the more reliable sensing modality (acoustic) rather than true complementary enhancement. This finding emphasizes the importance of environmental condition assessment in multimodal system design and the need for adaptive weighting strategies that account for the relative reliability of each sensing modality under specific operational conditions.

(3): Multi-Target Parallel Flight Scenario

In order to evaluate the capability of the recognition system for multi-target recognition and separation, a scenario in which two UAVs of different sizes fly simultaneously at different angles in front of the array is designed. The experimental results are shown in Figure 15, in which, due to the large difference in UAV body structure, one small UAV is characterized by low noise and high flexibility, and the other UAV is characterized by large load and high noise. Therefore, in the UAV sound source imaging system based on Capon’s algorithm, the problem that the dynamic range exceeds the preset conditions in some frames leads to the noise being masked and cannot be recognized.

However, the image recognition system can efficiently carry out the confirmation of the number of targets and the correction of recognition results. The fusion mechanism adopts image-assisted screening in this scenario, which effectively confirms the relationship between the number of sound sources and their spatial locations, enhances the ability to distinguish between neighboring sources, and verifies the effectiveness of the multi-target fusion recognition strategy.

The comprehensive field experiments revealed distinct performance characteristics across different operational scenarios, demonstrating the complementary strengths and limitations of each detection modality, as shown in Table 5. Under ideal conditions with static hovering UAVs, all three methods achieved near-perfect detection rates exceeding 99%, confirming the fundamental capabilities of each approach when environmental constraints are minimized. However, challenging operational conditions exposed significant performance divergences between modalities. In occlusion and interference scenarios, vision-only detection suffered dramatic degradation to 9.75% due to visual obstruction and poor lighting conditions, while acoustic-only detection maintained robust performance at 74% accuracy, with the fusion approach achieving modest improvement to 78%. Notably, in multi-target detection scenarios where significant acoustic intensity variations exist between different UAVs, acoustic localization performance for multi-target positioning degrades to only 54%, while visual recognition maintains stable detection capabilities under clear environmental conditions without obvious interference. This scenario exemplifies the effective complementary compensation that visual modality provides for acoustic sensing limitations, further validating the necessity and effectiveness of the multimodal fusion approach.

4. Conclusions

This study presents a multimodal fusion technology framework for UAV target recognition, leveraging a joint recognition approach that combines the high-resolution Capon acoustic source imaging algorithm with the deep learning-based image detection model YOLOv11. This integrated system enables precise identification and spatial localization of UAV targets within the airspace. The framework incorporates a 144-channel, non-uniformly spaced spiral arm acoustic array design and employs a comprehensive spatial consistency verification method through fusion decision-making strategies that integrate spatial redundancy and overlapping criteria. The proposed framework demonstrates robust directional imaging performance in challenging acoustic environments while maintaining reliable visual recognition capabilities for distant and small-scale targets, exhibiting strong robustness and real-time response capabilities across diverse operational conditions.

To validate the effectiveness and practical applicability of the proposed method, comprehensive real-world testing experiments were conducted across multiple typical operational scenarios, including static hovering, occlusion and interference conditions, multi-target parallel flights, and low-light environments, covering common perceptual challenges encountered in UAV detection missions. Multiple repetitions of each scenario type were performed with no fewer than five independent flight test sessions per condition, resulting in approximately 6000 labeled image frames and over one hour of acoustic signal recordings for statistical analysis. The experimental results demonstrate that while the fusion system achieves exceptional performance under ideal conditions (>99% accuracy), its true value emerges under challenging scenarios where individual modalities face significant limitations. Specifically, the fusion approach demonstrates significant performance improvements across challenging scenarios, achieving 78% accuracy in both low-light conditions (compared to 25% for vision-only detection) and occlusion scenarios (where vision-only detection drops to 9.75% while acoustic-only detection maintains 99% accuracy), and reaching 99% accuracy in multi-target scenarios (where vision-only achieves 99% while acoustic-only performance degrades to 54% due to acoustic intensity variations), substantially outperforming individual modalities and validating the effectiveness of the complementary fusion strategy.

Overall, the proposed acousto-optic fusion recognition system significantly surpasses traditional single-modality recognition algorithms in terms of target detection accuracy, anti-interference capability, and environmental adaptability. Through the inter-modal information complementary mechanism, the system effectively leverages the inherent strengths of both acoustic and visual sensing modalities while mitigating their respective limitations in challenging operational environments. This adaptive fusion approach enables stable target perception and recognition capabilities in dynamic and complex scenarios where single-modality systems would otherwise fall. The study provides an efficient and reliable technical solution for UAV intrusion detection, low-altitude monitoring, and other critical applications, demonstrating substantial engineering value and establishing a solid foundation for future research in adaptive multimodal UAV detection systems.

Author Contributions

T.H.: Conceptualization, Methodology, Investigation, Writing—original draft; J.H.: Data curation, Software, Validation; D.C.: Resources, Funding acquisition, Writing—review. All authors have read and agreed to the published version of the manuscript.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest related to this work.

References

Hassanalian, M.; Abdelkefi, A. Classifications, applications, and design challenges of drones: A review. Prog. Aerosp. Sci. 2017, 91, 99–131. [Google Scholar] [CrossRef]
Amazon. Amazon Prime Air Prepares for Drone Deliverise. 2022. Available online: https://www.aboutamazon.com/news/transportation/amazon-prime-air-prepares-for-drone-deliveries (accessed on 20 May 2025).
Baldisseri, A.; Siragusa, C.; Seghezzi, A.; Mangiaracina, R.; Tumino, A. Truck-based drone delivery system: An economic and environmental assessment. Transp. Res. Part D Transp. Environ. 2022, 107, 103296. [Google Scholar] [CrossRef]
SF-Express. SF-Express Approved to Fly Drones to Deliver Goods. 2021. Available online: https://www.caixinglobal.com/2018-03-28/sf-express-approved-to-fly-drones-to-deliver-goods-101227325.html (accessed on 20 May 2025).
Wehbi, L.; Bektaş, T.; Iris, Ç. Optimising vehicle and on-foot porter routing in urban logistics. Transp. Res. Part D Transp. Environ. 2022, 109, 103371. [Google Scholar] [CrossRef]
Jin, W.C.; Kim, K.; Choi, J.W. Adaptive Beam Control Considering Location Inaccuracy for Anti-UAV Systems. IEEE Trans. Veh. Technol. 2024, 73, 2320–2331. [Google Scholar] [CrossRef]
Zhang, W.; Zeng, X.; Yang, Y.; Zhong, S.; Gong, J.; Yang, X. Rotor unmanned aerial vehicle localization in the building sheltered area based on millimetre-wave frequency-modulated continuous wave radar. IET Radar Sonar Navig. 2024, 18, 1937–1951. [Google Scholar] [CrossRef]
Kumbasar, N.; Kılıç, R.; Oral, E.A.; Ozbek, I.Y. Comparison of spectrogram, persistence spectrum and percentile spectrum based image representation performances in drone detection and classification using novel HMFFNet: Hybrid Model with Feature Fusion Network. Expert Syst. Appl. 2022, 206, 117654. [Google Scholar] [CrossRef]
Ren, J.; Jiang, X. A three-step classification framework to handle complex data distribution for radar UAV detection. Pattern Recognit. 2021, 111, 107709. [Google Scholar] [CrossRef]
Sazdić-Jotić, B.; Pokrajac, I.; Bajčetić, J.; Bondžulić, B.; Obradović, D. Single and multiple drones detection and identification using RF based deep learning algorithm. Expert Syst. Appl. 2022, 187, 115928. [Google Scholar] [CrossRef]
Hirabayashi, M.; Kurosawa, K.; Yokota, R.; Imoto, D.; Hawai, Y.; Akiba, N.; Tsuchiya, K.; Kakuda, H.; Tanabe, K.; Honma, M. Flying object detection system using an omnidirectional camera. Forensic Sci. Int. Digit. Investig. 2020, 35, 301027. [Google Scholar] [CrossRef]
Liu, H.; Wei, Z.; Chen, Y.; Pan, J.; Lin, L.; Ren, Y. Drone Detection Based on an Audio-Assisted Camera Array. In Proceedings of the 2017 IEEE Third International Conference on Multimedia Big Data (BigMM), Laguna Hills, CA, USA, 19–21 April 2017. [Google Scholar]
Gong, J.; Yan, J.; Li, D.; Kong, D. Interference of Radar Detection of Drones by Birds. Prog. Electromagn. Res. M 2019, 81, 1–11. [Google Scholar] [CrossRef]
Patel, J.S.; Fioranelli, F.; Anderson, D. Review of radar classification and RCS characterisation techniques for small UAVs or drones. IET Radar Sonar Navig. 2018, 12, 911–919. [Google Scholar] [CrossRef]
Penhale, M.; Barnard, A. Direction of arrival estimation in practical scenarios using moving standard deviation processing for localization and tracking with acoustic vector sensors. Appl. Acoust. 2020, 168, 107421. [Google Scholar] [CrossRef]
Tong, X.; Zuo, Z.; Su, S.; Wei, J.; Sun, X.; Wu, P. ST-Trans: Spatial-Temporal Transformer for Infrared Small Target Detection in Sequential Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5001819. [Google Scholar] [CrossRef]
He, X.; Fan, K.; Xu, Z. Uav identification based on improved YOLOv7 under foggy condition. Signal Image Video Process. 2024, 18, 6173–6183. [Google Scholar] [CrossRef]
Uddin, Z.; Qamar, A.; Alharbi, A.G.; Orakzai, F.A.; Ahmad, A. Detection of Multiple Drones in a Time-Varying Scenario Using Acoustic Signals. Sustainability 2022, 14, 4041. [Google Scholar] [CrossRef]
Shi, L.; Ahmad, I.; He, Y.; Chang, K. Hidden Markov model based drone sound recognition using MFCC technique in practical noisy environments. J. Commun. Netw. 2018, 20, 509–518. [Google Scholar] [CrossRef]
Guo, J.; Ahmad, I.; Chang, K. Classification, positioning, and tracking of drones by HMM using acoustic circular microphone array beamforming. EURASIP J. Wirel. Commun. Netw. 2020, 2020, 9. [Google Scholar] [CrossRef]
Blanchard, T.; Thomas, J.H.; Raoof, K. Acoustic localization and tracking of a multi-rotor unmanned aerial vehicle using an array with few microphones. J. Acoust. Soc. Am. 2020, 148, 1456. [Google Scholar] [CrossRef]
Svanstrom, F.; Englund, C.; Alonso-Fernandez, F. Real-Time Drone Detection and Tracking With Visible, Thermal and Acoustic Sensors. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 7265–7272. [Google Scholar]
Jocher, G. Ultralytics YOLO11. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 20 May 2025).
Awad, A.H.; Aly, S.A. Early Diagnoses of Acute Lymphoblastic Leukemia Using YOLOv8 and YOLOv11 Deep Learning Models. arXiv 2024, arXiv:2410.10701. [Google Scholar] [CrossRef]
He, L.; Zhou, Y.; Liu, L.; Ma, J. Research and Application of YOLOv11-Based Object Segmentation in Intelligent Recognition at Construction Sites. Buildings 2024, 14, 3777. [Google Scholar] [CrossRef]
Zuo, W.; Xin, J.; Liu, C.; Zheng, N.; Sano, A. Improved Capon Estimator for High-Resolution DOA Estimation and Its Statistical Analysis. IEEE/CAA J. Autom. Sin. 2023, 10, 1716. [Google Scholar] [CrossRef]
Rubsamen, M.; Pesavento, M. Maximally Robust Capon Beamformer. IEEE Trans. Signal Process. 2013, 61, 2030–2041. [Google Scholar] [CrossRef]
Yang, Y.; Chu, Z. A Review of High-performance Beamforming Methods for Acoustic Source Identification. J. Mech. Eng. 2021, 57, 166–183. [Google Scholar]
Sarradj, E. A Generic Approach to Synthesize Optimal Array Microphone Arrangements. In Proceedings of the 6th Berlin Beamforming Conference (BeBeC 2016), Berlin, Germany, 29 February–1 March 2016; pp. 1–14. [Google Scholar]
Dumitrescu, C.; Minea, M.; Costea, I.M.; Cosmin Chiva, I.; Semenescu, A. Development of an Acoustic System for UAV Detection. Sensors 2020, 20, 4870. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Fusion decision structure diagram.

Figure 2. Drone dataset.

Figure 3. Capon power spectrum and SPL analysis.

Figure 4. Localization error vs. dynamic range (color = SNR).

Figure 5. Schematic diagram of the spiral structure of the sensor array.

Figure 6. Dynamic range of different array structures. (The color scale represents the dynamic range values in decibels (dB), with lighter colors indicating higher dynamic ranges and darker colors indicating lower dynamic ranges).

Figure 7. Spatial resolution of dual sources at different frequencies and distances. (red for successful separation and blue for failure).

Figure 8. Confusion matrix.

Figure 9. Val batch labels.

Figure 10. Performance analysis of YOLOv11 under challenging detection scenarios. Each group (a–d) shows four sub-images: top-left Ground Truth, top-right True Positives, bottom-left False Positives, and bottom-right False Negatives. Specifically, (a) mainly shows missed detections; (b) involves both missed detections and false identifications; (c) shows mostly correct detections with few errors; (d) presents severe missed detections with many false identifications.

Figure 11. Labeled bounding box and predicted bounding box error. (a,c) manual ground-truth annotations; (b,d) YOLOv11 model predictions.

Figure 12. Multimodal detection performance comparison for hovering UAV: (left column) acoustic-only detection, (middle column) vision-only detection, (right column) multimodal fusion results. Colors represent sound intensity, with warmer colors showing stronger energy.

Figure 13. Occlusion analysis. Colors represent sound intensity, with warmer colors showing stronger energy.

Figure 14. Low-light analysis.

Figure 15. Multi-target experiment. Colors represent sound intensity, with warmer colors showing stronger energy.

Table 1. Performance comparison of models for UAV detection.

Model	Precision	Recall	mAP50	mAP50–95
RTDETR-l	90.63%	83.64%	87.40%	44.14%
YOLOv8n	91.71%	84.53%	90.89%	48.32%
YOLOv11n	91.71%	85.41%	91.08%	48.78%
YOLOv11s	92.47%	85.54%	91.47%	48.48%

Table 2. Detection accuracy comparison of different modalities for UAV recognition.

Method	Vision-Only	Acoustic-Only	Fusion
Accuracy	99%	99%	99%

Table 3. Occlusion detection accuracy comparison of different modalities for UAV recognition.

Method	Vision-Only	Acoustic-Only	Fusion
Accuracy	9.75%	99%	99%

Table 4. Low-light accuracy comparison of different modalities for UAV recognition.

Method	Vision-Only	Acoustic-Only	Fusion
Accuracy	25%	74%	78%

Table 5. Multi-target accuracy comparison of different modalities for UAV recognition.

Method	Vision-Only	Acoustic-Only	Fusion
Accuracy	99%	54%	99%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, T.; Hou, J.; Chen, D. Multimodal UAV Target Detection Method Based on Acousto-Optical Hybridization. Drones 2025, 9, 627. https://doi.org/10.3390/drones9090627

AMA Style

He T, Hou J, Chen D. Multimodal UAV Target Detection Method Based on Acousto-Optical Hybridization. Drones. 2025; 9(9):627. https://doi.org/10.3390/drones9090627

Chicago/Turabian Style

He, Tianlun, Jiayu Hou, and Da Chen. 2025. "Multimodal UAV Target Detection Method Based on Acousto-Optical Hybridization" Drones 9, no. 9: 627. https://doi.org/10.3390/drones9090627

APA Style

He, T., Hou, J., & Chen, D. (2025). Multimodal UAV Target Detection Method Based on Acousto-Optical Hybridization. Drones, 9(9), 627. https://doi.org/10.3390/drones9090627

Article Menu

Multimodal UAV Target Detection Method Based on Acousto-Optical Hybridization

Abstract

1. Introduction

2. Materials and Methods

2.1. YOLOv11-Based UAV Target Detection Algorithm

2.1.1. C3k2 Module

2.1.2. C2PSA Module

2.1.3. Evaluation Metrics

2.2. Principles of Capon-Based Microphone-Array Imaging

2.3. Fusion Strategies

2.4. Data Collection

2.4.1. Image Train Dataset

2.4.2. Measured Data and Experimental Equipment

3. Results and Discussion

3.1. Pre-Laboratory Simulations

3.1.1. Simulation of Acoustic Source Localization Using the Capon Algorithm

3.1.2. Acoustic Array Configuration and Dynamic Performance Simulation

3.1.3. Image Recognition Model Training

3.2. Field Experiments

3.2.1. Experimental Platform

3.2.2. Results of Typical Experimental Scenarios

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI