1. Introduction
Unmanned aerial vehicles (UAVs) have become integral instruments in modern urban management, driven by the rapid expansion of the low-altitude economy. These technologies facilitate diverse applications, including logistics, distribution environmental monitoring, and urban planning [
1]. In the context of urban product supply chains and transportation management, UAVs offer innovative solutions for addressing the challenges associated with the “last mile” delivery segment [
2,
3,
4,
5]. Nevertheless, the widespread adoption of UAVs has been accompanied by a significant escalation in security concerns. Unauthorized drone activities, for example, pose substantial risks to public safety by potentially disrupting civil aviation pathways or being exploited for illicit purposes such as unauthorized surveillance and the conveyance of hazardous materials [
6]. Consequently, the management of low-altitude airspace and the assurance of urban safety are becoming increasingly reliant on the advancement of reliable and efficient technologies for drone detection and identification [
7].
Currently, various UAV detection techniques have been developed, predominantly categorized into active methods such as radar detection, and passive methods including radio frequency (RF) detection [
8,
9,
10] and image recognition [
11,
12]. Radar detection offers significant advantages in terms of long-range detection capabilities and high sensitivity to moving targets by emitting electromagnetic waves and analyzing the reflected signals. Radar systems designed for aircraft detection face significant challenges in identifying drones, as these systems typically employ techniques to suppress unwanted clutter echoes originating from small, slow-moving, low-altitude flying objects—precisely the characteristics that distinguish drones. Additionally, the radar cross sections (RCS) of a medium-sized consumer drone closely resemble those of birds, which can lead to the misclassification of targets [
13,
14]. RF detection, which involves passively monitoring communication links between drones and their controllers to detect and locate targets, also presents inherent limitations. Notably, this approach is ineffective against drones operating autonomously or utilizing encrypted communications, and furthermore, RF signal decoding is prohibited in numerous countries.
Visual recognition demonstrates superior detection capabilities for low, slow, and small (LSS) UAV targets. Owing to its effectiveness, cost-efficiency, and precision, visual recognition has garnered significant interest from researchers in recent years for applications including the localization, tracking, and early warning of UAV events. To enhance the computational network’s focus on salient features while suppressing irrelevant information, C. Wang et al. [
15] implemented a lightweight UAV cluster identification method incorporating an attention mechanism. Extensive ablation studies conducted on the SIRSTD dataset substantiated the effectiveness of the ST-Trans approach in improving detection performance. Furthermore, Xiao et al. [
16] originally proposed the ST-Trans framework as a robust end-to-end solution for detecting small infrared targets within complex background environments. He et al. [
17] addressed the challenge of UAV detection under adverse weather conditions, such as dense fog, by developing a specialized dataset tailored for foggy environments to enhance recognition accuracy and efficiency. Additionally, they proposed an advanced image recognition method that integrates a content-aware dynamic sparse attention module with a dual-routing mechanism, enabling the flexible allocation of computational resources. Although vision-based drone tracking and recognition methods have significantly enhanced detection accuracy, their effectiveness remains constrained under challenging conditions such as fog, snow, rain, or nighttime, due to the inherent limitations of camera systems.
The identification and tracking of drones through the utilization of acoustic sensors and source localization can produce favorable results. To address the challenge of separating mixed signals amid interfering sources, Zahoor Uddin et al. [
18] proposed a multi-UAV detection methods utilizing acoustic signals. Additionally, they developed a Time-Varying Drone Detection Technique (TVDDT), which leverages variations in mixing coefficients to classification and monitors based on a minimum variation criterion within the channel. The proposed approach was assessed through simulation studies, demonstrating highly favorable performance outcomes. Lin Shi et al. [
19] proposed a methodology for UAV identification predicted on propeller noise, employing Mel Frequency Cepstral Coefficients (MFCC) for feature extraction and Hidden Markov Models (HMM) for classification. The features extracted from multiple training datasets were utilized to train the HMM-based classifier. Empirical findings suggest that this approach achieves a relatively high identification accuracy even in noisy environments. Furthermore, microphone arrays demonstrate significant efficacy for both detection and localization tasks due to their superior capabilities in spatial localization. Guo et al. [
20] conducted a study on adaptive beamforming techniques applied in both ideal and non-ideal conditions. Initially, beamforming was utilized to localize the sound source, with the acoustic signals captured via microphone arrays. Subsequently, the recorded sounds were classified employing HMM, followed by tracking through adaptive beamforming methods.
Blanchard et al. [
21] developed an acoustic detection array comprising a limited number of microphones turned to specific UAV frequency bands, enabling comprehensive acoustic field reconstruction in all spatial directions for the purpose of localizing and characterizing sound sources. Nevertheless, in complex urban settings, conventional approaches face significant challenges in accurately identifying and localizing a dynamically varying noise source due to the prevalence of reflection, diffractions, and multipath propagation effects, particularly in low-altitude noise scenarios such as UAV flights. Furthermore, the increasing diversity of UAV flight activities necessitates monitoring systems with enhanced real-time processing and multi-target recognition capabilities, thereby further complicating the task of noise source identification.
Among the aforementioned technical approaches, relying solely on a single detection method—such as recognition or acoustic signal analysis—often encounters significant limitations in practical applications. These limitations include constraints imposed by the operational environment, inadequate real-time responsiveness, and limited robustness against interference. In urban settings, static occlusions—such as buildings and trees—as well as dynamic occlusions caused by targets themselves, substantially impair the accuracy of visual detection. Additionally, nighttime and low-light conditions result in the considerable degradation of detection performance due to diminished image quality. These challenges are particularly pronounced in complex scenarios like urban low altitude flight, significantly undermining the robustness and practical applicability of UAV monitoring systems. Consequently, the integrated processing of image recognition and acoustic localization through multimodal fusion techniques, which leverage the complementary and redundant nature of these data sources, can markedly enhance target recognition accuracy, response time, and environmental adaptability. In the context, the acoustic modality offers valuable compensatory information when visual data are limited or compromised.
Numerous studies have investigated audio-visual fusion methodologies for the real-time detection of drones. Fredrik et al. proposed a multi-modal framework that primarily employs thermal infrared and visible light cameras, leveraging the YOLOv2 algorithm for the detection of drones, aircraft, and birds. Their approach also incorporated an acoustic component, utilizing a microphone in conjunction with a Long Short-Time Memory (LSTM) classifier to recognize acoustic signatures. To reduce false positive detections associated with manned aircraft, the system integrated Automatic Dependent Surveillance-Broadcast (ADS-B) data. The study identified that the principal sources of misclassification originated from insects and cloud formations, indicating that a single-microphone audio classification module exhibits limited efficacy. Moreover, the incorporation of heterogeneous sensors alongside the simultaneous operation of multiple detection models substantially elevates the overall complexity of the system [
22]. Liu et al. present an audio-assisted camera array system for drone detection, consisting of 30 high-definition cameras and 3 microphones mounted on a hemispherical frame. The system fuses HOG visual features with MFCC audio features using SVM classifiers to achieve real-time drone detection with 360-degree sky coverage [
12]. Experimental results demonstrate significant improvement in detection accuracy compared to vision-only methods, with positive sample precision increasing from 79.5% to 95.74%. However, the massive system requires 8 workstations for data processing and control.
Active detection techniques, including radar and RF systems, experience diminished stability during nighttime, adverse weather conditions, and in the presence of occlusion conditions. In contrast, passive sensing modalities, such as acoustic and visual methods, generally demonstrate greater robustness under these challenging circumstances. Nonetheless, each individual modality possesses intrinsic limitations, underscoring the critical role of multimodal fusion approaches to achieve dependable UAV detection within complex environments.
Consequently, this study introduces a novel approach for UAV detection that mitigates the shortcomings inherent in single-modal detection systems within complex urban settings. Conventional visual-based detection techniques frequently encounter difficulties under adverse conditions, including visual occlusion, low-light environments, and intricate backgrounds characterized by multiple sources of interference. To address these challenges, the research develops a multimodal fusion framework that integrates acoustic sensing with computer vision technologies. The proposed methodology not only improves the reliability and accuracy of UAV detection but also enhances the operational range of monitoring systems in practical scenarios, where environmental factors often impair the effectiveness of single-sensor approaches.
2. Materials and Methods
To improve the precision and reliability of UAV detection and localization systems operating in complex environments, this study proposes a multimodal monitoring framework that integrates acoustic beamforming imaging with visual-based recognition. This method leverages the complementary strengths of acoustic and visual sensors to enable efficient detection, classification, and localization of low-altitude UAV targets. Specifically, an array of acoustic sensors acquires multichannel sound signals from the surrounding environment. By employing the Capon beamforming algorithm, the system estimates the spatial spectrum through iterative calculations across grid points, where spectral peaks correspond to the estimated direction of arrival of the sound source. For the continuous narrowband acoustic signals generated by UAVs during flight, this method enhances their spatial features even under strong interference, thereby enabling accurate localization. Simultaneously, a visible-light camera captures aerial imagery synchronously. Using the YOLOv11 deep learning-based object detection algorithm, UAV targets are swiftly detected and classified. YOLOv11 excels at detecting small objects and processing data in real time, accurately segmenting UAV contours, and generating precise bounding boxes even under complex visual backgrounds.
In the fusion stage, this study adopts a spatial-domain alignment mechanism to align the detection outputs from the two sensor modalities. By using the camera’s imaging perspective parameters, the estimated acoustic source directions are projected onto the image plane, allowing the verification of the consistency between acoustic localization and visual detection results. This alignment significantly enhances target recognition accuracy and reduces the false alarm. For targets that generate a distinct acoustic signal yet appear visually indistinct or blurred, acoustic localization offers essential preliminary directional data. Conversely, when acoustic localization accuracy declines as a result of environmental noise, visual detection provides supplementary and complementary data. The integration of these multimodal inputs enhances spatial awareness for UAV detection and improves the system’s resilience in challenging acoustic and visual environments. Consequently, the approach presents a robust technical framework for accurate UAV monitoring and control within the urban airspace.
2.1. YOLOv11-Based UAV Target Detection Algorithm
YOLOv11, released by Ultralytics in September 2024 [
23], represents a significant advancement in real-time object detection technology. It achieves superior performance across a wide range of computer vision tasks, including object detection, instance segmentation, pose estimation, tracking, and classification [
24]. By incorporating the C3k2 (Cross Stage Partial with kernel size 2) block, SPPF (Spatial Pyramid Pooling-Fast), and C2PSA (Convolutional Block with Parallel Spatial Attention) modules, YOLOv11 substantially enhances the accuracy and efficiency of feature extraction. Furthermore, it adopts an improved loss function, center-based point prediction and refined decoding methods, together with advanced data augmentation strategies, leading to markedly better generalization ability and robustness, thereby extending its applicability to diverse and challenging scenarios.
2.1.1. C3k2 Module
The C3k2 module is a lightweight convolutional block first introduced in YOLOv11 to enhance feature extraction efficiency. It is derived from the CSP (Cross Stage Partial) architecture, where the feature map is split into two parts: one part undergoes a series of convolutional operations, while the other is directly passed through to the output stage. This partial computation strategy reduces redundancy, improves gradient flow, and lowers computational cost. Compared with the standard C3 block, C3k2 replaces the conventional 3 × 3 convolution kernels with 2 × 2 kernels, enabling finer feature granularity, faster inference speed and reduced parameter count while maintaining competitive representational capacity. This design makes C3k2 particularly suitable for real-time object detection tasks where both accuracy and efficiency are critical.
2.1.2. C2PSA Module
The C2PSA module is an innovative attention mechanism first introduced in YOLOv11 to improve spatial feature representation [
25]. Unlike traditional convolution blocks that process all spatial information uniformly, C2PSA employs two parallel branches: one branch focuses on local detail extraction through convolution operations, while the other applies a Parallel Spatial Attention (PSA) mechanism to selectively emphasize informative spatial regions and suppress irrelevant background. This dual-branch design slows the model to capture both fine-grained local patterns and broader contextual dependencies. By integrating spatial attention directly into the convolutional block, C2PSA enhances object localization accuracy, especially for small or densely packed objects, while maintaining computational efficiency. This innovation expands the model’s capability to adaptively focus on key features across diverse computer vision tasks.
2.1.3. Evaluation Metrics
- (1)
Precision
Precision measures the proportion of all predicted “UAV” detections that are true UAV targets. It is defined as follows:
where
(True Positive) is the number of samples correctly predicted as positive and truly positive;
(False Positive) is the number of samples predicted as positive but actually negative.
- (2)
Recall
Recall measures the model’s ability to identify all actual “UAV” targets. It is defined as
where
(False Negative)—the number of samples predicted as negative but actually positive.
comprehensively evaluates both precision and recall, reflecting the model’s overall balance between positive and negative samples in the UAV detection task. It is defined as the harmonic mean of precision and recall:
- (4)
mAP (mean Average Precision)
mAP is a comprehensive evaluation metric in object detection tasks, measuring the model’s overall detection performance across different categories and confidence thresholds.
denotes the average precision (AP) across all categories when the Intersection over Union (
) threshold between predicted and ground truth bounding boxes is set to 0.5:
where
(Average Precision) is obtained as the area under the precision-recall curve, with values closer to 1 indicating better performance.
refers to the mean
averaged over multiple
thresholds, ranging from 0.5 to 0.95 in steps of 0.05, representing a more stringent evaluation metric:
2.2. Principles of Capon-Based Microphone-Array Imaging
In the practical application scenarios for UAVs, there will be a large amount of background noise interference, and these factors may have an impact on the accurate identification and localization of the target sound source. As a small and highly mobile sound source, UAVs emit sound source signals with a relatively decentralized spectrum and weak intensity, and using the traditional energy spectrum estimation is difficult for accurately locating the UAV sound source in a high background noise environment. Therefore, the Capon-based beamforming algorithm is used in this paper [
26,
27]. This algorithm, as a typical high-resolution spatial spectrum estimation method, can minimize the interference and noise power from other directions while maintaining a distortion-free response to the target direction, and thus has significant advantages in terms of noise suppression and spatial resolution capability. Compared with the traditional delay-and-sum or traditional beam forming methods, the Capon algorithm can effectively inhibit the leakage of side flaps and improve the differentiation ability of close-range multiple sources, which is especially suitable for accurate acoustic source localization in complex multi-path environments in UAV monitoring. Based on this, this paper adopts the Capon algorithm to analyze the spatial spectrum of the multi-channel data collected by the acoustic sensor array, and extracts the DOA (direction-of-arrival) information of the UAV accordingly, so as to realize the high-precision acoustic localization.
In signal processing for UAV detection, the distance
from the sound source to the center of the array satisfies
where
—the maximum current size of the array; and λ—acoustic wavelength.
Thus, the source is in the far-field region of the array and the acoustic field is a plane wave, so the signal received at each array element has only a flavor difference related to the propagation direction, and its amplitude change can be ignored. Therefore, the received signal of the
-th sensor on the array from a narrowband plane wave in the direction can be expressed as
Combining the signals from all the array elements into vector form, the array received vector model is obtained as
where
—array pairs of normalized array manifold vectors from direction
is the incident signal from direction
—for background noise and other interfering signals.
The Capon algorithm suppresses interference and noise by constructing optimized weight vectors to minimize the output power in the non-target direction while ensuring a distortion-free response in the target direction, thus achieving a high-resolution spatial spectral estimation of the sound source.
First, the covariance matrix is constructed for the received data:
The optimal beamformer weight vector
is constructed for the direction
by satisfying the unit response constraint
while minimizing the output power:
The analytical solution of this optimization problem is
The corresponding spatial spectrum estimation expression is
where
—e spatial spectral power in direction
; and
Rx—data covariance matrix.
By scanning over the candidate directions and computing for each angle, a spatial power spectrum is generated. The direction corresponding to the spectral peak is then identified as the estimated direction of the acoustic source.
2.3. Fusion Strategies
To achieve synergistic recognition by combining acoustic source imaging and image recognition, this section implements a spatial fusion recognition framework, as depicted in
Figure 1. This framework incorporates multi-dimensional conditional criteria during the decision-making phase to enable the precise identification of UAV targets. The proposed mechanism is engineered to extract valid targets from multimodal redundant data by jointly assessing the spatial overlap confidence level derived from both acoustic and image modality. This integrative evaluation substantially enhances recognition accuracy and robustness. The fusion recognition decision is performed on a per-frame or per-time-window basis, with the decision criteria encompassing the following elements:
- (1)
Independent Confidence Evaluation
Image Confidence Threshold: If the YOLOv11 model identifies a UAV target within an image and the confidence score of the bounding box, denoted as Pimg, is greater than or equal to 0.4, the image modality is considered to contain a valid candidate target;
Acoustic Intensity Threshold: A definitive localization response is considered to exist for the source modality if the peak primary flap energy Sacoustic ≥ a threshold (e.g., =45 dB) in a direction in the Capon imaging direction map and the peak differs from the surrounding paraflaps by a set dynamic range (e.g., ≥3 dB).
- (2)
Spatial Consistency Constraint (Region Overlap Criterion)
The bounding box obtained from image recognition is projected onto the spatial coordinate system corresponding to the acoustic source map. The spatial angular offset or projection overlap between the center of the detected image target and the peak direction of the main lobe in the acoustic map is then computed. If the spatial overlap between the two satisfies any of the following criteria, the target is classified as spatially consistent:
Region IoU ≥ 0.3;
Angular deviation between the image target center and acoustic direction ≤ 5°.
- (3)
Redundant Compensation Strategy:
In the following scenarios, target confirmation can still be initiated when only a single modality criterion is satisfied, provided that a higher confidence threshold is met:
Vision-Dominated Compensation: If there is a high-confidence target for image recognition Pimg ≥ 0.6, but there is no significant main lobe in the acoustic source map, and the target does not appear in the image blindness or strong occlusion area, then target confirmation is triggered through the visual channel.
Acoustic-Dominated Compensation: In instances where Capon beamforming imaging exhibits a pronounced main lobe response and satisfies the criteria for dynamic range and directional resolution, yet the image recognition confidence score falls below 0.2 or the target (such as a distant or small object) remains undetected, a candidate target alert or assisted confirmation is initiated based on the directional information derived from the acoustic source. This approach integrates a trained YOLOv11 model with real-time acoustic source localization obtained from acoustic arrays, enabling the identification of UAV targets concurrent with real-time acoustic source localization.
2.4. Data Collection
2.4.1. Image Train Dataset
We constructed a dataset of 51,446 high-resolution images originally captured at 1920 × 1080 pixels, featuring UAVs across multiple aircraft models, diverse viewing angles, and complex environments, including urban, forested, and airport scenes. For model training, all images were resized to 640 × 480 pixels while preserving aspect ratio where possible. Each UAV instance was manually labeled in YOLO format (x, y, w, h), where (x, y) is the normalized center coordinate and (w, h) are the normalized width and height of the bounding box.
Figure 2 Dataset samples show diverse UAV types (multi-rotor and fixed-wing) captured under various challenging conditions including different environments (urban, forest, rural), illumination levels (bright daylight to dusk), observation distances, and background complexities. The images demonstrate the dataset’s coverage of real-world scenarios for robust UAV detection model training.
2.4.2. Measured Data and Experimental Equipment
In order to systematically evaluate the performance of the proposed acoustic source imaging and image recognition fusion method in real scenarios, this paper constructs a series of typical UAV flight test scenarios and conducts a large-scale empirical verification of the fusion recognition system. The test scenarios cover typical UAV activities such as static hovering, multi-target cooperative flight, occlusion interference, dynamic flight at different altitudes, dark environment, and complex background environment.
The drone models used for collecting test data samples are the DJI Matrice 300 RTK and the DJI Mavic 3 Enterprise (SZ DJI Technology Co., Ltd., Shenzhen, China).
4. Conclusions
This study presents a multimodal fusion technology framework for UAV target recognition, leveraging a joint recognition approach that combines the high-resolution Capon acoustic source imaging algorithm with the deep learning-based image detection model YOLOv11. This integrated system enables precise identification and spatial localization of UAV targets within the airspace. The framework incorporates a 144-channel, non-uniformly spaced spiral arm acoustic array design and employs a comprehensive spatial consistency verification method through fusion decision-making strategies that integrate spatial redundancy and overlapping criteria. The proposed framework demonstrates robust directional imaging performance in challenging acoustic environments while maintaining reliable visual recognition capabilities for distant and small-scale targets, exhibiting strong robustness and real-time response capabilities across diverse operational conditions.
To validate the effectiveness and practical applicability of the proposed method, comprehensive real-world testing experiments were conducted across multiple typical operational scenarios, including static hovering, occlusion and interference conditions, multi-target parallel flights, and low-light environments, covering common perceptual challenges encountered in UAV detection missions. Multiple repetitions of each scenario type were performed with no fewer than five independent flight test sessions per condition, resulting in approximately 6000 labeled image frames and over one hour of acoustic signal recordings for statistical analysis. The experimental results demonstrate that while the fusion system achieves exceptional performance under ideal conditions (>99% accuracy), its true value emerges under challenging scenarios where individual modalities face significant limitations. Specifically, the fusion approach demonstrates significant performance improvements across challenging scenarios, achieving 78% accuracy in both low-light conditions (compared to 25% for vision-only detection) and occlusion scenarios (where vision-only detection drops to 9.75% while acoustic-only detection maintains 99% accuracy), and reaching 99% accuracy in multi-target scenarios (where vision-only achieves 99% while acoustic-only performance degrades to 54% due to acoustic intensity variations), substantially outperforming individual modalities and validating the effectiveness of the complementary fusion strategy.
Overall, the proposed acousto-optic fusion recognition system significantly surpasses traditional single-modality recognition algorithms in terms of target detection accuracy, anti-interference capability, and environmental adaptability. Through the inter-modal information complementary mechanism, the system effectively leverages the inherent strengths of both acoustic and visual sensing modalities while mitigating their respective limitations in challenging operational environments. This adaptive fusion approach enables stable target perception and recognition capabilities in dynamic and complex scenarios where single-modality systems would otherwise fall. The study provides an efficient and reliable technical solution for UAV intrusion detection, low-altitude monitoring, and other critical applications, demonstrating substantial engineering value and establishing a solid foundation for future research in adaptive multimodal UAV detection systems.