Next Article in Journal / Special Issue
A Review of Reinforcement Learning for Multirotor UAVs from a Hierarchical Control Perspective: Biomimetic Architecture and Sim-to-Real
Previous Article in Journal
Joint Position–Orientation Deployment Design of UAV-Borne Linear-Array Angle-of-Arrival Sensors for Target UAV Localization
Previous Article in Special Issue
YOLO-DAA: Directional Area Attention for Lightweight Tiny Object Detection in Maritime UAV Imagery
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CPD-UAV: A Benchmark Dataset for Detecting Personnel Visually Blended with the Environment Under UAV Perspective

School of Information Engineering, Engineering University of PAP, Xi’an 710086, China
*
Author to whom correspondence should be addressed.
Drones 2026, 10(6), 447; https://doi.org/10.3390/drones10060447
Submission received: 21 April 2026 / Revised: 2 June 2026 / Accepted: 3 June 2026 / Published: 8 June 2026

Highlights

What are the main findings?
  • The establishment of CPD-UAV, a novel benchmark dataset comprising 1061 high-resolution images meticulously annotated to address the challenge of detecting visually blended individuals under complex UAV perspectives.
  • The development of the Residual Gated Alignment Module (RGAM), a lightweight, plug-and-play architectural component that significantly improves the structural integrity and precise boundary localization of minute targets.
What are the implication of the main findings?
  • The proposed CPD-UAV dataset provides a rigorous data platform that bridges the critical domain gap between conventional camouflaged object detection research and practical, real-world field search-and-rescue applications.
  • RGAM offers a highly cost-effective and efficient algorithmic solution to the “vanishing boundary” and extreme scale variation challenges, making it exceptionally well-suited for the resource constraints of intelligent aerial monitoring systems.

Abstract

Camouflaged object detection (COD) is important for intelligent UAV monitoring and search-and-rescue operations. However, existing benchmarks focus primarily on natural camouflage, creating a noticeable domain shift for specific applications such as the search and rescue of individuals visually similar to their surroundings due to their clothing. To investigate this shift, we introduce CPD-UAV, a benchmark comprising 1061 high-resolution images with detailed pixel-level annotations across diverse terrains and flight altitudes. Benchmarking of seven state-of-the-art models on this dataset reveals specific challenges. Specifically, the scale variations and “vanishing boundaries” inherent in aerial perspectives can lead to boundary localization inaccuracies. Furthermore, this evaluation observes the deceptive nature of traditional metrics, such as Mean Absolute Error (MAE), when targets occupy small image proportions. To address the degradation of weak target signals during feature integration, we propose a lightweight, plug-and-play component: the Residual Gated Alignment Module (RGAM). RGAM handles scale variations by establishing semantic anchors in deep network layers, mitigating signal dilution and highlighting micro-targets against complex backgrounds. By integrating RGAM into three representative baselines, we demonstrate that the enhanced architectures achieve a competitive performance level. Quantitative results show consistent improvements in structural integrity (structure-measure, S m ) and boundary localization. Ultimately, this work provides a practical data platform and an effective algorithmic solution for advancing aerial monitoring systems.

1. Introduction

Object detection, a cornerstone of computer vision, enables the real-time identification and localization of objects in complex environments, enhancing the automation and intelligence of various systems [1]. This technology has been deployed in numerous fields, including remote sensing and emergency search and rescue. In modern environmental and emergency operations, Unmanned Aerial Vehicles (UAVs) have emerged as valuable tools due to their high mobility and ability to monitor areas from high-altitude perspectives [2]. However, in practical missions, the detection of “camouflaged targets” remains a persistent challenge. Such targets are deliberately designed to blend into their surroundings, reducing their visual saliency to evade detection systems [3].
In recent years, Camouflaged Object Detection (COD) has gained traction in computer vision [4]. While the task of detecting visually blended individuals under UAV perspectives shares the fundamental segmentation formulation of traditional COD, it introduces a noticeable domain shift. Existing COD research primarily focuses on natural camouflage from ground-level perspectives [5]. In contrast, the UAV perspective introduces specific geometric distortions, top-down viewpoints, and large-scale variations. When algorithms developed for these natural scenes are applied to UAV perspectives and camouflaged individuals in real-world environments, they often encounter difficulties in handling artificial camouflage patterns and aerial geometric distortions.
Specifically, this aerial domain shift presents distinct challenges. First, UAVs operate from nadir or oblique perspectives, which alters traditional human silhouette features, limiting the effectiveness of many shape-based priors [6]. Second, due to frequent fluctuations in flight altitude, targets exhibit scale variations within the imagery, imposing higher requirements on the feature extraction capabilities of algorithms [7]. Furthermore, the similarity between engineered camouflage and complex backgrounds frequently leads to “vanishing boundaries,” where the transition between foreground and background is difficult to distinguish, contributing to the performance degradation of existing models [8]. At the technological frontier, optical metamaterials are specifically designed to manipulate light reflection and eliminate structural discrepancies, such as shadows or geometric anomalies [9]. By significantly exacerbating the signal sparsity of minute targets, such advancements represent the ultimate manifestation of the ’vanishing boundary’ problem. Recognizing these extreme technologies provides a valuable theoretical context, further justifying the necessity of specialized architectures—like our proposed RGAM module—to preserve weak target semantics against increasingly complex camouflage.
To investigate these specific challenges, we propose CPD-UAV: a benchmark dataset tailored to detecting visually blended individuals under UAV perspectives [10]. Utilizing the DJI Mavic 3T platform, we conducted field data collection across diverse wilderness environments to reflect the scenarios of real-world remote areas [11]. The dataset comprises 1061 high-resolution images, annotated with pixel-level masks, covering a wide array of terrains and flight altitudes. This dataset attempts to simultaneously address the dual factors of artificial camouflage and aerial geometric perspectives, providing a practical foundation for algorithm evaluation [12].
Contemporary COD networks typically rely on multi-scale feature fusion paradigms to localize targets. They routinely employ a top-down decoding strategy, where deep-layer features are progressively fused with shallow-layer features to delineate object boundaries [13]. However, when applied to UAV imagery, these conventional architectures often encounter performance bottlenecks. Because aerial targets exhibit scale variations and often occupy small proportions of the image, their deep-layer semantic representations are relatively weak [14]. During the standard top-down integration process, these fragile signals are sometimes diluted or masked by the abundant background noise present in the high-resolution shallow layers. This phenomenon exacerbates the aforementioned “vanishing boundaries” and can lead to localization inaccuracies [15].
To address these architectural bottlenecks and provide a viable solution for aerial COD, we introduce a plug-and-play architectural component: the Residual Gated Alignment Module (RGAM). Traditional top-down decoders may lose the weak signals of small targets during downsampling or introduce background noise during feature fusion [16,17]. RGAM addresses this by utilizing a residual gating mechanism in the deep layers, helping to ensure that the semantic anchors of tiny targets are not inadvertently suppressed against the shallow-layer noise, thereby mitigating the semantic gap between varying scales [18]. We integrate RGAM into three COD networks. Benchmarking on the CPD-UAV dataset reveals that the RGAM-enhanced variants achieve competitive performance levels, effectively mitigating multi-scale and vanishing boundary issues.
The primary contributions of this paper are as follows: (1) the establishment of the CPD-UAV benchmark dataset, comprising 1061 images with pixel-level annotations, which enriches the available data resources for detecting visually blended personnel under UAV perspectives; (2) the introduction of RGAM, an effective module designed to preserve small-target semantics and align heterogeneous features in aerial imagery; (3) an experimental evaluation demonstrating that integrating RGAM into existing models (SINet, BGNet, and PRNet) boosts detection performance on the CPD-UAV benchmark.

2. Related Works

2.1. Camouflaged Object Detection

COD aims to identify and segment highly concealed targets embedded in complex backgrounds. Given the homogeneity in texture, color, and luminance between the object and its environment, feature extraction remains a difficult task [19].
Early research primarily relied on hand-crafted descriptors, utilizing low-level visual cues such as contrast, edges, and color distributions to “break” the camouflage [20]. For instance, techniques like Local Binary Patterns (LBP) were employed to capture subtle textural discrepancies, while geometric gradient analysis was used to identify structural anomalies like shadows or convexities. In video sequences, optical flow was frequently adopted to detect background variations caused by motion. However, these traditional methods were highly sensitive to illumination changes and background noise, exhibiting limited robustness when targets were deeply integrated into their surroundings [21].
The advent of deep learning, particularly Convolutional Neural Networks (CNNs), has significantly advanced the performance of COD [22]. These models leverage powerful feature modeling capabilities to extract high-level semantic information. Currently, mainstream deep architectures often adopt a “coarse-to-fine” strategy, simulating predatory behavior by first locating candidate regions and subsequently refining local details [23]. Simultaneously, attention mechanisms have been extensively integrated to suppress background interference and enhance the discriminability of blurred boundaries. Furthermore, multi-task learning strategies, such as the simultaneous execution of edge detection and segmentation, have proven effective in utilizing complementary information to sharpen object contours [24].
Recently, researchers have begun exploring features beyond the spatial domain. Frequency-domain learning has emerged as a promising direction, analyzing statistical discrepancies in frequency components to capture subtle engineered textural patterns that are often invisible in the RGB space [25]. Additionally, Vision Transformer (ViT) architectures have demonstrated immense potential in capturing long-range contextual information through global modeling capabilities [26], facilitating a more comprehensive understanding of the relationship between complex backgrounds and concealed targets. The evolution of these technologies provides diversified technical pathways for addressing detection hurdles in extremely covert environments.

2.2. Object Detection Under UAV Perspectives

Object identification under Unmanned Aerial Vehicle (UAV) perspectives is a pivotal technology in modern intelligent monitoring. Compared to traditional horizontal views, UAVs offer distinct advantages such as high mobility, broad field-of-view, and the ability to capture information from a vertical dimension [27]. However, the imaging mechanism of UAV platforms introduces specific technical challenges. A primary challenge is large-scale variation. Frequent fluctuations in flight altitude cause the target size within imagery to vary drastically, imposing requirements on an algorithm’s capability to extract scale-robust features [28]. Furthermore, geometric distortion and varied perspectives constrain recognition accuracy. The nadir view alters the topological structure of human targets, rendering traditional shape priors less effective and resulting in sparse discriminative information [29]. Additionally, aerial targets often occupy only a few pixels, making feature identification susceptible to background noise coupling, non-uniform illumination, and motion blur [30].
Despite these complexities, current UAV-based research remains predominantly focused on object detection with bounding boxes. For instance, VisDrone emphasizes vehicle and pedestrian monitoring in urban environments, while UAVDT focuses on traffic flow statistics and tracking [31]. In these studies, researchers primarily explore the optimization of anchor matching for small objects and feature fusion efficiency. However, in scenarios such as specialized wilderness search-and-rescue, coarse bounding box localization is often insufficient [32]. The “vanishing boundary” phenomenon makes precise localization using only boxes difficult.
Compared to mature detection datasets, UAV-perspective resources dedicated to pixel-level mask annotations remain relatively scarce. While semantic segmentation has seen application in UAV remote sensing (e.g., land-cover classification and crop monitoring) [33], mainstream UAV segmentation datasets, such as UAVid [34] and the Semantic Drone Dataset [35], are primarily tailored for multi-class semantic segmentation of salient, easily distinguishable landscape elements (e.g., roads, buildings, standard vehicles, and uncamouflaged pedestrians). These datasets generally operate on the assumption of high visual contrast between objects and their surroundings, and often fail to encompass the complex textural coupling characteristics found in real-world rugged outdoor environments. In high-precision scenarios such as wilderness search-and-rescue, targets are highly coupled with complex backgrounds, rendering salient multi-class segmentation models less effective. Consequently, there is a clear need for a specialized binary segmentation dataset like CPD-UAV, which shifts the focus from broad landscape parsing to the precise pixel-level localization of highly concealed, visually blended targets [36].

2.3. Camouflaged Object Detection Datasets

High-quality datasets serve as the primary engine driving the evolution of deep learning algorithms. Due to the inherent difficulty and unique characteristics of camouflaged objects, public benchmarks in this field have only gained widespread attention in recent years. The current academic landscape is primarily defined by several core datasets. A comprehensive comparison of these mainstream benchmarks, alongside our proposed CPD-UAV dataset, is summarized in Table 1.
The CHAMELEON [37] dataset represents an early, small-scale public benchmark consisting of only 76 high-quality images. Primarily collected from the internet using the keyword “camouflaged animals,” it focuses on biological organisms possessing physiological concealment traits within natural environments. Each image provides object-level ground truth maps and boundary annotations. Despite its limited scale, as a pioneer in the field, it is frequently utilized by the academic community for the preliminary verification of a model’s effectiveness in capturing basic natural camouflage patterns.
The CAMO [38] dataset comprises 1250 images partitioned into two subsets: the CAMO camouflaged set (1000 images) and a non-camouflaged set derived from MS-COCO (250 images). A distinguishing feature of this dataset is its first-of-its-kind integration of natural biological camouflage (e.g., chameleons, insects) with artificial camouflage, such as specialized artificial patterns and body art. This hybrid design introduces engineered disruptive textures that are significantly more challenging than natural scenes, providing a rigorous foundation for testing a deep model’s ability to handle cross-domain camouflage strategies and suppress interference from salient non-camouflaged objects.
The COD10K [39] dataset stands as the largest and most comprehensive benchmark in the field of COD, featuring 10,000 meticulously selected images categorized into 5 super-classes and 78 fine-grained sub-classes. It offers high-precision annotations ranging from bounding boxes and object/instance-level masks to fine matting-level labels. The targets within this collection span large, medium, and small scales, significantly advancing research into multi-scale feature extraction and precise edge segmentation, making it the mainstream choice for training high-performance deep COD models.
The NC4K [40] dataset serves as the largest specialized testing benchmark to date, encompassing 4121 high-resolution camouflaged images downloaded from the internet. While predominantly featuring natural camouflage, it also includes a deliberate proportion of artificial samples. Unlike previous datasets primarily intended for training, NC4K was designed as a “testing ground” for validating model generalization. Through large-scale and visually deceptive samples, it effectively measures an algorithm’s robustness and stability when confronted with unknown complex backgrounds and low-discriminability boundaries.
The ACD1K [41] (Adaptive Camouflaged Dataset) serves as a dedicated benchmark tailored for advanced field search applications, comprising 1078 high-resolution images partitioned into 748 for training and 330 for testing. Unlike traditional datasets that predominantly feature natural animal camouflage, ACD1K focuses exclusively on human-engineered concealment, such as individuals equipped with modern tactical outdoor gear and Ghillie suits. It was specifically designed to bridge the critical domain gap between biological camouflage research and real-world wilderness search scenarios. By embedding artificial targets within highly complex and diverse terrains, it rigorously evaluates an algorithm’s capability to penetrate manufactured disruptive patterns and achieve precise segmentation amidst severe background interference.
Despite the significant progress driven by the aforementioned datasets, existing benchmarks predominantly focus on horizontal or close-up perspectives and are largely biased toward natural or basic artificial camouflaged targets. In real-world field monitoring and search applications, drones are frequently deployed, introducing critical challenges such as extreme scale variations, top-down viewing angles, and specialized scenarios (e.g., locating lost personnel wearing environment-matching outdoor gear). To bridge this critical gap, we propose CPD-UAV, a specialized dataset dedicated to detecting visually blended personnel from a UAV perspective. Detailed construction processes and statistics of CPD-UAV will be elaborated in the next section.

3. The Proposed Dataset

COD is a specialized task aimed at identifying and segmenting targets that are visually fused with their surroundings through engineered or natural patterns. While existing benchmarks like COD10K and NC4K have advanced the field, they primarily focus on natural animal species captured from horizontal, ground-level perspectives. There is currently no dedicated benchmark for highly concealed visually blended individual detection from Unmanned Aerial Vehicle (UAV) perspectives. This data gap severely limits the application effectiveness of intelligent reconnaissance algorithms for aerial platforms, as ground-based models often fail to generalize to the drastic scale variations and geometric distortions inherent in UAV imagery. To address this, we propose CPD-UAV, a high-resolution benchmark dataset designed to provide a rigorous platform for advancing specialized rescue-oriented COD research.

3.1. Data Collection and Annotation

The CPD-UAV dataset was constructed using the DJI Mavic 3T enterprise drone (DJI, Shenzhen, China) as the primary acquisition platform. The drone integrates a visual imaging system featuring a 1/2-inch CMOS wide-angle camera with 48 effective megapixels. Field operations were conducted in jungle and mountainous terrains to simulate wilderness search environments. These terrains offer complex backgrounds characterized by dense foliage, rocky outcrops, and light-and-shadow variations, providing practical conditions for evaluating camouflage performance.
During the field phase, 15 video sequences were recorded under favorable weather conditions, primarily during morning and noon periods to ensure sufficient natural lighting. To represent real-world Search and Rescue (SAR) situations, our data collection protocol adhered to standard aerial reconnaissance paradigms. Flight trajectories were designed to mimic systematic SAR sweep patterns. The flight altitude was dynamically adjusted between 20 and 80 meters, accompanied by varied oblique viewing angles, to replicate both the macroscopic scanning phase and the microscopic target confirmation phase typical in actual missions. Furthermore, to recreate the visual challenges of locating concealed personnel, target individuals were dressed in low-contrast outdoor gear and instructed to adopt irregular postures, such as crouching or hiding under foliage. This protocol ensures that the scale variations and ’vanishing boundaries’ captured in the dataset reflect the complexities of wilderness search operations.
To construct the image-based dataset, we sampled frames from the raw video footage at an interval of 10 frames. Additionally, to compensate for the potential lack of scene diversity in localized field collections, we supplemented the database with 132 camouflaged target images sourced from onboard vision sequences on the internet. Finally, a manual quality screening was performed to select representative images, resulting in a collection of 1061 images (as summarized in Figure 1).
The CPD-UAV dataset was annotated using the ISAT with Segment Anything (https://github.com/yatengLG/ISAT_with_segment_anything, accessed on 6 March 2026) interactive semi-automatic annotation software. This platform integrates the Segment Anything Model (SAM), allowing for pixel-level masks to be generated through human-computer interaction. All annotations were carefully refined to ensure that the masks align with the silhouettes of the target individuals, even in “vanishing boundary” regions where targets are coupled with complex backgrounds (as illustrated in Figure 2).
To maintain the consistency of the ground truth, each image underwent a “multiple independent annotations–cross-validation” protocol. Professional annotators were instructed to enclose visible target components, including backpacks and equipment, to maintain profile integrity. A senior verifier subsequently reviewed all masks to resolve discrepancies. To quantitatively validate the quality and precision of this annotation process, we randomly sampled a subset of 30 images to conduct an Inter-Annotator Agreement evaluation. A second independent annotator provided blind annotations for this subset. We calculated the Intersection over Union (IoU) between the two independent sets of masks, achieving a Mean IoU of 90.89%. This high level of agreement provides rigorous quantitative evidence that our ground truth masks are highly consistent and reliable, effectively capturing the ambiguous contours of camouflaged targets. Ultimately, the final dataset was partitioned into 848 images for training, 106 for validation, and 107 for testing.

3.2. Dataset Properties and Statistics

The CPD-UAV dataset is characterized by its high resolution, multi-scene diversity, and professional-grade annotations. All images in the dataset maintain a high-definition resolution of 1920 × 1080 pixels, ensuring that fine-grained textural details of camouflaged targets are preserved. The final collection comprises 1061 images, featuring a strategic blend of 929 field-captured samples and 132 high-quality web-sourced images. This hybrid composition ensures that the dataset encompasses both authentic mission-specific textures and broader environmental variations. The targets are partitioned into a standard hierarchy: 848 images for training, 106 for validation, and 107 for testing.
To ensure the robust generalization capability of camouflage object detection models, the CPD-UAV dataset encompasses a wide variety of terrain backgrounds. As illustrated in Figure 3a, the dataset features four primary scenarios: Jungle (46%), Snow (6.5%), Desert (5.5%), and Grassland (42%). This specific distribution reflects our practical data acquisition strategy. The vast majority of the images (Jungle and Grassland) were strictly captured via our own UAV flights. Due to the geographical and natural environmental constraints of our field collection sites, vegetation-rich environments naturally dominate the dataset. To mitigate the domain limitation caused by regional factors and to ensure comprehensive scene diversity, we strategically supplemented the dataset with high-quality, web-sourced airborne images featuring camouflaged targets in extreme environments (Snow and Desert). Although this results in a quantitatively imbalanced proportion, this hybrid construction is highly effective. The abundant field-captured vegetation data provides a solid foundation for learning complex structural disruptions (e.g., leaves and shadows), while the supplementary extreme scenarios serve as crucial regularization samples to prevent models from merely overfitting to green or brown color priors.
In addition to scene variations, the spatial location and relative scale of the targets significantly impact detection difficulty. Existing datasets often suffer from a “center bias,” where photographers unconsciously place the target in the middle of the frame. Figure 3b provides a comprehensive visualization of both attributes for the CPD-UAV dataset. In this scatter plot, the horizontal and vertical axes represent the normalized x and y coordinates of the target centroids (ranging from 0 to 1), demonstrating that the camouflaged objects are widely and randomly distributed across the entire frame without obvious geometric clustering. This robust spatial diversity effectively mitigates center bias and forces models to develop comprehensive global search capabilities rather than exploiting spatial priors. Simultaneously, the color intensity of each scatter point represents the relative scale of the target, defined as the ratio of the target’s pixel area to the total image area. As indicated by the color bar, the vast majority of the points are concentrated in the dark blue spectrum, revealing that a significant portion of the targets occupy less than 2.5% (0.025 on the color scale) of the image area. This prevalence of minute instances, primarily caused by the dynamic high-altitude UAV perspective, poses a formidable challenge for current visual models in extracting fine-grained, multi-scale camouflage features.
To provide a more detailed quantification of this scale distribution, we further present the statistical distribution of relative target areas in Figure 4. As shown in the histogram, the dataset exhibits a long-tail distribution characteristic. The vast majority of targets are categorized as ‘Tiny’ (relative area < 0.1 % ) or ‘Small’ (<0.5%), with only a minor fraction extending into the medium scale. This statistical breakdown confirms that micro-targets under dynamic UAV altitudes constitute a primary challenge of the CPD-UAV benchmark, providing a practical data foundation for evaluating multi-scale feature alignment.

3.3. Benchmark Evaluation on CPD-UAV

To ensure a rigorous and fair comparison, all seven state-of-the-art (SOTA) models—SINet [39], C2FNet [42], ZoomNet [43], BGNet [44], PRNet [45], SDRNet [46], and SAM2-UNet [47]—were re-trained from scratch on the proposed CPD-UAV dataset using their respective publicly available source codes. To avoid evaluation bias introduced by manual tuning, all models were trained strictly utilizing the default training hyperparameters and standard backbone architectures recommended in their official implementations. The detailed structural parameter counts (Params) and computational complexities (GFLOPs) for these configured models are systematically documented in Table 2. Furthermore, all benchmarking experiments were conducted on a HUAWEI MateBook GT 14 laptop equipped with an Intel Core Ultra 7 CPU, 32 GB of RAM, and an external NVIDIA GeForce RTX 3090 GPU, running on the Windows 11 operating system with the PyTorch framework (versions configured as specified in each model’s official repository).
Dataset Scale and Generalization Analysis. We acknowledge that CPD-UAV, comprising 1061 high-resolution images, is a relatively small-scale benchmark compared to generalized datasets like COD10K. In the field of COD, small-sample datasets often face potential challenges such as restricted scene diversity and the risk of over-fitting. However, in the context of specialized field search operations, data acquisition is inherently constrained by operational environments and security protocols. CPD-UAV serves as a domain-specific benchmark that captures the “vanishing boundary” and “scale variation” characteristics of the UAV perspective—attributes that are frequently diluted in generalized datasets. The performance of SOTA models on this benchmark suggests that a primary bottleneck in aerial COD is not merely the volume of data, but rather the structural limitations of existing architectures in handling the specific geometric distortions and signal sparsity of the UAV viewpoint.
Performance and Metric Paradox Analysis. As demonstrated in Table 2, although advanced architectures such as SDRNet and SAM2-UNet achieve high structural integrity ( S m > 0.93 ), the overall evaluation highlights a “background dominance” phenomenon inherent in high-altitude imagery. Specifically, since visually blended individuals occupy a small fraction of the total image area (often less than 2.5%), the Mean Absolute Error (MAE) becomes sensitive to the background-to-foreground ratio. In this context, even a model that fails to localize the target and outputs a near-blank mask could still incur a low MAE (e.g., <0.04) due to high agreement with background pixels. This “metric paradox” implies that while a low MAE is a necessary condition for success, it is not a sufficient indicator of segmentation quality. Instead, the structural measure ( S m ) and F-measure ( F m ) provide a more reliable assessment. Furthermore, the performance gap between region-based structural metrics and boundary-sensitive metrics (e.g., F m ) across baseline models provides quantitative evidence of the ’vanishing boundary’ challenge.
To provide an intuitive visual explanation of the quantitative results, Figure 5 presents a qualitative comparison of the evaluated models across representative aerial scenarios. As illustrated, algorithms like SINet and PRNet frequently generate diffuse, unstructured activation maps, struggling to delineate the camouflaged targets, which directly corroborates their lower structure ( S m ) and boundary ( F m ) scores. More importantly, the visual results illustrate the previously discussed “metric paradox.” In challenging scenarios with dense clutter or scale variations (e.g., the rocky terrains in the 4th and 5th rows), several models either suffer from false positives or fail to detect the target, outputting nearly blank masks. While these blank masks mathematically yield low pixel-level errors (MAE) due to the predominantly dark Ground Truth, the visual comparison exposes their functional failure in pinpointing the targets. Qualitatively, the manifestation of model failure is often not entirely missing the target, but rather generating diffuse activation maps without crisp contours against homogeneous backgrounds. This analytical combination of metric disparities and visual diffusion confirms that overcoming ’vanishing boundaries’ remains a primary bottleneck in aerial COD, providing a technical motivation for our proposed RGAM module.

4. Proposed Method

4.1. Overall Framework Formulation

Contemporary COD models predominantly adopt an encoder–decoder architecture. Formally, given an input aerial image I R H × W × 3 , a hierarchical encoder (e.g., ResNet) is employed to extract a set of multi-scale features, representing low-level spatial details to high-level semantic abstractions. Subsequently, a top-down decoder progressively fuses these features to predict the final camouflage mask.
While this paradigm achieves satisfactory results in generic scenarios, it exhibits noticeable limitations in UAV-based detection. Due to scale variations and vanishing boundaries in aerial imagery and vanishing boundaries in aerial imagery, the weak activation signals of minute targets are frequently diluted or completely lost during the top-down multi-scale fusion process. Therefore, rather than designing a cumbersome, specialized new network from scratch, we propose a lightweight, plug-and-play architectural component: the Residual Gated Alignment Module (RGAM). RGAM is specifically formulated to replace redundant fusion layers within the decoding stage of existing architectures. It acts as a robust semantic anchor, precisely capturing and preserving high-frequency target features from deep layers while explicitly filtering out background noise from shallow layers.

4.2. Residual Gated Alignment Module

To tackle the dilemma of small target signal degradation during downsampling, the proposed Residual Gated Alignment Module (RGAM) is meticulously designed as a holistic architectural component. It effectively aligns heterogeneous features across adjacent stages through a sequence of channel projection, semantic gating, and efficient fusion. As illustrated in Figure 6, RGAM takes two inputs: a deep semantic feature X d e e p R 512 × H × W and a shallower spatial feature X s h a l l o w R 256 × 2 H × 2 W .
Channel Projection and Spatial Alignment. Initially, both features possess disparate channel capacities and spatial resolutions. To map them into a unified representational space, we first project them to a predefined channel dimension C = 128 using 1 × 1 convolutions followed by Batch Normalization ( BN ) and GELU activation ( δ ):
X d e e p = δ ( BN ( C o n v 1 × 1 ( X d e e p ) ) )
X s _ r e d = δ ( BN ( C o n v 1 × 1 ( X s h a l l o w ) ) )
Unlike the traditional ReLU, which applies a hard truncation ( max ( 0 , x ) ) and risks causing “dead neurons” that permanently discard weak boundary signals, we employ the exact version of the GELU function. Mathematically, it is defined based on the standard normal cumulative distribution, which can be computed using the error function (erf):
δ ( x ) = x Φ ( x ) = 0.5 x 1 + erf x 2
This stochastic gating mechanism provides a smooth, non-monotonic nonlinearity. When processing the high-resolution imagery of the CPD-UAV dataset, this smoothness is crucial. Instead of harshly zeroing out negative responses, GELU gently preserves the subtle, fine-grained semantic gradients of minute targets. By retaining these fragile activation signals against overwhelming background noise, GELU effectively mitigates the “vanishing boundary” problem, ensuring that the faint transitions between camouflaged targets and their surroundings are not lost during feature extraction.
Since the projected deep feature X d e e p encapsulates abstract semantics but lacks the spatial scale of X s _ r e d , we subsequently apply bilinear interpolation U ( · ) to upsample it by a factor of 2, achieving spatial alignment:
X d _ u p = U ( X d e e p ) R 128 × 2 H × 2 W
Residual Gating Mechanism. The core innovation of RGAM lies in how it handles the aligned features. Straightforward concatenation or addition often introduces severe aliasing and overwhelms the minute targets with shallow background noise. To circumvent this, we utilize the upsampled deep feature X d _ u p to dynamically generate a spatial attention map G via a Sigmoid function σ . This gate map serves as a semantic anchor, which is then used to explicitly modulate the dimension-reduced shallow feature X s _ r e d . To preserve essential local boundaries while amplifying the weak signals of camouflaged objects, a residual connection is established:
G = σ ( X d _ u p )
X s _ a l i g n e d = X s _ r e d + ( X s _ r e d G )
where ⊗ denotes element-wise multiplication. By performing this residual gating, RGAM filters out irrelevant clutter while enforcing the semantic integrity of the targets. This mechanism is particularly crucial under localized low-visibility conditions, such as deep shadows or uneven terrain illumination, where the target’s Signal-to-Noise Ratio (SNR) is drastically reduced. In such low-SNR scenarios, traditional fusion strategies easily drown faint boundary signals in amplified background clutter. By utilizing the deep semantic anchor to explicitly suppress the overwhelming noise while reinforcing the weak target signals via the residual pathway, RGAM effectively maintains algorithmic robustness against illumination-induced ‘vanishing boundaries’.
Efficient Feature Fusion and Prediction. The upsampled deep feature X d _ u p and the aligned shallow feature X s _ a l i g n e d are concatenated along the channel dimension ( d i m = 1 ) to form X c a t R 256 × 2 H × 2 W . To ensure computational efficiency without sacrificing modeling capacity, X c a t is smoothed through an efficient fusion block ( F f u s e _ c o n v ). Specifically, this block comprises two stages: a 3 × 3 Depthwise Convolution (DwConv, with g r o u p s = 256 ) to capture spatial correlations independently per channel, followed by a 1 × 1 Pointwise Convolution (PwConv) to facilitate cross-channel information exchange and reduce the channel dimension back to 128:
X o u t = F f u s e _ c o n v [ X d _ u p , X s _ a l i g n e d ]
where [ · , · ] denotes the concatenation operation. Finally, the output feature X o u t branches into two distinct paths. One path (‘Out’) propagates the feature to the next network stage. The other path passes through a Prediction Head—consisting of a 1 × 1 convolution to compress the channel to C = 1 , followed by bilinear interpolation to the original training image size—yielding the final intermediate Prediction Map.

4.3. Effectiveness and Generalization of RGAM

To validate the plug-and-play capability and the effectiveness of the proposed Residual Gated Alignment Module (RGAM), we conduct comprehensive ablation studies across three representative baseline architectures: SINet, BGNet, and PRNet. As reported in Table 3, integrating RGAM into the decoding stages of these models yields consistent performance improvements, achieving a competitive balance for UAV-perspective detection.
To achieve these gains without imposing a heavy computational burden, we carefully tailored the integration strategy of RGAM. Specifically, for SINet, we replaced its computationally expensive Receptive Field (RF) blocks and complex Partial Decoder Components (PDC) with RGAM. For BGNet, we substituted the original Context-aware Attention Module (CAM) and basic fusion layers with RGAM. For PRNet, RGAM replaced the standard concatenation-based fusion pathways prior to the MLP-based refinement stages.
To isolate the performance gains brought by RGAM from the benefits of removing these heavier components, we introduce the “-Light” variants as controlled baselines. The “-Light” versions represent the pruned architectures where the aforementioned cumbersome components have been removed (and replaced with basic connections to maintain dimensional alignment), but the RGAM module has not been added. As shown in Table 3, compared to the “-Light” baselines, the integration of RGAM yields accuracy improvements under a controlled computational budget (e.g., identical 48.07 GFLOPs for BGNet and a 0.08M parameter addition). Furthermore, it is a well-known deployment phenomenon that simply pruning a network (e.g., PRNet-Light and BGNet-Light) can harm practical hardware efficiency (FPS) due to unoptimized, fragmented memory access operations. Integrating RGAM resolves this memory bottleneck through an optimized feature flow, recovering and improving the practical inference speed (e.g., PRNet-RGAM to 26.20 FPS, and SINet-RGAM to 72.55 FPS). This confirms RGAM as a cost-effective, hardware-friendly module suited for the resource constraints of UAV platforms. Indeed, the successful deployment of sophisticated detection modules in practical field missions requires a strict synergy between high-performance algorithms and optimized electronic components. Such hardware-software co-design is crucial to ensure the operational efficiency and sustainability of modern drone prototypes [48]. By fundamentally resolving memory access bottlenecks and accelerating inference, RGAM seamlessly aligns with the wider context of sustainable electronic optimization in UAV systems.
Furthermore, as introduced in Section 3.3, relying solely on MAE can be deceptive due to the “metric paradox”. The ablation results demonstrate that RGAM plays a role in mitigating this evaluation bias. When equipped with RGAM, the models not only experience a further reduction in MAE, but more importantly, exhibit improvements in core structural metrics (i.e., S m and E m ). The simultaneous improvement across these indicators proves that the model’s capability to differentiate visually blended targets from complex terrain clutter has been enhanced, alleviating the evaluation bias caused by “background dominance.” Nevertheless, we objectively observe that in certain complex scenarios with weak target signals, the phenomenon of missing small targets still persists to some extent, which remains an open challenge for future research.
To visually corroborate the quantitative evaluations, Figure 7, Figure 8 and Figure 9 provide qualitative comparisons among the baseline models, their “-Light” variants, and the RGAM-integrated counterparts. Given the difficulty of visually evaluating minute camouflaged objects from a global aerial perspective, local zoomed-in patches (highlighted with red boxes) focused on the target regions are embedded within the corresponding prediction masks. Note that these patches are omitted only under specific conditions: (1) when a model suffers from a complete missed detection resulting in a blank mask (specifically, the fourth row of BGNet-Light in Figure 8), or (2) when the target occupies a relatively large proportion of the image and the macroscopic visualization is sufficient (specifically, the third row in Figure 9). Where the local patches are provided, they show that the RGAM module helps the networks achieve sharper boundaries and better background noise suppression compared to the original baselines and the “-Light” variants.
The original SINet baseline and SINet-Light (Figure 7) exhibit noticeable limitations when processing minute targets against complex backgrounds. As seen in the magnified patches (e.g., the first and second rows), the baseline often loses the target signal or generates diffuse, structureless noise. Following the integration of the RGAM module, the network successfully recaptures these weak signals, improving the sharpness of the predicted boundaries while reducing background interference.
Figure 8 validates the generalization efficacy of the RGAM module on the BGNet architecture. Except for the fourth row of BGNet-Light, which presents a completely blank mask due to a missed detection and thus lacks a zoomed-in patch, all other rows provide local close-ups for comparison. Visual inspection of these patches reveals that although the baseline BGNet and BGNet-Light can localize the targets, their predicted boundaries remain blurred and lack the edge sharpness achieved by BGNet+RGAM. In contrast, BGNet+RGAM delivers noticeably clearer boundaries, demonstrating the module’s capability in refining target structures and suppressing background noise.
Figure 9 presents the detection results based on the PRNet architecture. In this comparison, the third row contains a relatively large target where the macroscopic masks are clear enough, so zoomed-in patches are naturally omitted. For the other rows containing minute targets, both PRNet and PRNet-Light frequently introduce false-positive activations or irregular target shapes within their zoomed patches. The PRNet+RGAM configuration demonstrates a better capability to inhibit these erroneous background activations, reducing the false detection rate and improving the boundary clarity of the targets.
Mechanistically, this robust generalization stems from the residual gating operation within RGAM. In the original top-down decoders of the baseline models, the weak activation signals of small targets are easily overwhelmed by the vast background noise present in shallow spatial features. By embedding RGAM, the deep semantic features are leveraged to generate a spatial attention map (G). This map functions as a precise semantic anchor, explicitly modulating the shallow features ( X s _ r e d ) to filter out irrelevant high-resolution distractors while amplifying the crucial localized signals of the targets. Although a negligible drop in F m is observed for SINet (from 0.674 to 0.671), the overwhelming gains in background suppression, structural integrity, and computational speed demonstrate that RGAM effectively bridges the semantic gap across heterogeneous features, mitigating the “vanishing boundary” problem inherent in aerial imagery.

4.4. Cross-Dataset Generalization Analysis

To explicitly address whether the proposed RGAM generalizes beyond the specific UAV perspective and extreme scale variations of the CPD-UAV benchmark, we conducted rigorous cross-dataset validation experiments. Following standard camouflage object detection protocols, the baseline models and their RGAM-integrated counterparts were retrained on a widely used public COD training set (comprising 4040 diverse natural camouflaged images) using the exact same hyperparameters and training epoch configurations as our primary experiments. The models were subsequently evaluated on the widely recognized COD10K testing set.
As reported in Table 4, the integration of RGAM yields consistent improvements across the primary structural metrics on this generalized dataset. Notably, the enhanced-alignment measure ( E m ) and structure-measure ( S m ) improved across all three evaluated architectures. It is worth noting that the overall performance scores for SINet are relatively low. This is primarily because we strictly maintained the same limited training epochs used for the smaller CPD-UAV dataset to ensure controlled experimental conditions. For a significantly larger training set of 4040 images, this restricted epoch count inevitably leads to under-training. Nevertheless, even under these suboptimal training conditions, the SINet+RGAM variant still achieves significant relative improvements over its baseline (e.g., E m boosting from 0.594 to 0.643). This confirms that RGAM effectively accelerates feature alignment and enhances structural representation even when the base model is under-fitted.
Furthermore, while we observe minor metric fluctuations typical of architectural trade-offs—such as a slight decrease in F m for PRNet (0.813 to 0.811) and an increase in MAE for SINet—these are acceptable consequences of RGAM’s residual gating mechanism. This mechanism acts as a spatial filter that prioritizes global structural integrity over localized pixel-level sharpness. Overall, these cross-dataset evaluations confirm that RGAM is not over-fitted to aerial imagery, but rather serves as a robust, plug-and-play architectural component capable of generalizing to conventional ground-level natural camouflage scenarios.

5. Conclusions

5.1. Contributions and Impact

This paper explores the domain shift between conventional COD and practical field search operations by introducing CPD-UAV, a specialized benchmark dataset tailored for detecting visually blended personnel from UAV perspectives. Comprising 1061 annotated images across diverse scenarios, CPD-UAV helps evaluate the performance limitations of existing state-of-the-art models. Our evaluation observes the ’metric paradox’ in aerial imagery, demonstrating that traditional pixel-level error metrics can mask functional localization failures.
To tackle the degradation of minute target signals during top-down multi-scale fusion, we propose a lightweight architectural component: the Residual Gated Alignment Module (RGAM). By utilizing deep-layer representations as semantic anchors to filter shallow-layer background noise, RGAM helps preserve the structural details of micro-targets. Evaluations demonstrate that integrating RGAM into representative baselines yields consistent improvements in structural integrity ( S m ) and boundary localization ( F m ), while reducing computational overhead (GFLOPs) and boosting inference speed (FPS). Concurrently, the efficient RGAM provides a practical solution for handling the multi-scale segmentation challenges specific to this domain.
Beyond the immediate algorithmic improvements, this study demonstrates that penetrating advanced camouflage requires pushing beyond the pixel limit. Detecting visually blended targets necessitates an analysis of global contextual relationships, empowering the model to ’reason’ and infer the presence of a target even when clear boundaries and colors are non-existent. Furthermore, the efficiency of specialized components like GELU underscores the critical importance of managing weak signals and complex gradients to prevent the loss of vital semantic cues. Ultimately, these contextual reasoning capabilities hold profound implications for the broader field of public safety and emergency response. By enabling highly reliable micro-target detection, such technologies can significantly enhance the effectiveness of autonomous UAVs in critical humanitarian missions, such as disaster relief and wilderness search and rescue, ensuring intelligent systems can effectively safeguard human lives.

5.2. Limitations and Future Work

Despite the utility of the CPD-UAV benchmark and the proposed RGAM, we acknowledge limitations in this study. Firstly, the dataset’s scale (1061 images) is relatively small for deep learning paradigms, and due to geographical constraints, the scene distribution is currently dominated by vegetation-rich environments (e.g., jungle and grassland). Secondly, to isolate and evaluate the algorithms’ capability to penetrate optical camouflage patterns without the confounding variables of image degradation, our data collection was primarily conducted under favorable daylight visibility. Consequently, the dataset currently lacks environmental variations such as fog, rain, extreme low-light conditions, or seasonal variations, which are highly relevant to real-world SAR missions.
In future work, we plan to address these data limitations by incorporating advanced data augmentation and synthetic data generation techniques (e.g., diffusion models) to synthetically scale up the training samples. Furthermore, additional field acquisitions will be conducted to encompass a wider variety of environments—such as urban ruins and aquatic scenes—alongside diverse adverse weather conditions and varying illumination levels. Additionally, future algorithmic research will focus on exploring temporal dynamics in UAV video streams and multi-modal feature fusion strategies, thereby contributing to the continuous development of robust aerial monitoring systems.

Author Contributions

Conceptualization, X.Z. and Y.P.; methodology, X.Z., H.H. and Y.P.; software, X.Z., H.H. and X.Y.; validation, W.K., W.T. and Q.L.; formal analysis, W.K. and X.Z.; investigation, Q.L. and L.H.; resources, Y.P. and L.H.; data curation, W.T. and X.Y.; writing—original draft preparation, X.Z., W.K. and H.H.; writing—review and editing, Y.P., X.Z., W.K., W.T., Q.L., H.H., L.H. and X.Y.; visualization, X.Z., Q.L. and W.K.; supervision, Y.P.; project administration, Y.P.; funding acquisition, Y.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Research and Innovation Team Project on Theory and Practice of Intelligent Command Information Systems in Engineering University of PAP; Comprehensive Equipment Research Project [grant no. WJ2025C0401013]; and Basic Frontier Innovation Project [grant no. WJY202509].

Data Availability Statement

The project repository (https://github.com/18510004596/CPD-UAV, accessed on 18 April 2026) currently provides a representative subset of source images and pixel-level annotations for preliminary evaluation and demonstration. To ensure ethical compliance and mitigate potential dual-use risks, the full CPD-UAV dataset, consisting of 1061 high-resolution images and complete masks, will not be made publicly available. Researchers requiring access to the complete dataset must submit a formal request to the corresponding author. Access will be granted strictly on a case-by-case basis, subject to a rigorous evaluation by our research team to ensure the intended usage is confined to verified, non-commercial academic research in humanitarian fields, such as search and rescue.

DURC Statement

Current research is limited to the field of wilderness search and rescue (SAR) technology, which is beneficial for improving public safety by locating missing persons wearing low-contrast gear, and does not pose a threat to public health or national security. Authors acknowledge the dual-use potential of the research involving the detection of visually camouflaged individuals and confirm that all necessary precautions have been taken to prevent potential misuse, including systematically replacing combat-oriented terminology with neutral terms and restricting complete dataset access to verified researchers via formal request. As an ethical responsibility, authors strictly adhere to relevant national and international laws about DURC. Authors advocate for responsible deployment, ethical considerations, regulatory compliance, and transparent reporting to mitigate misuse risks and foster beneficial outcomes.

Acknowledgments

The authors would like to thank all coordinators and supervisors involved and the anonymous reviewers for their detailed comments that helped to improve the quality of this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
  2. Tang, G.; Ni, J.; Zhao, Y.; Gu, Y.; Cao, W. A survey of object detection for UAVs based on deep learning. Remote Sens. 2024, 16, 149. [Google Scholar] [CrossRef]
  3. Luo, Z.; Liu, N.; Zhao, W.; Yang, X.; Zhang, D.; Fan, D.P.; Khan, F.; Han, J. VSCode: General visual salient and camouflaged object detection with 2D prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Denver, CO, USA, 3–7 June 2024; pp. 28726–28736. [Google Scholar]
  4. Liu, K.; Li, A.; Yang, S.; Wang, C.; Zhang, Y. Multi-scale attention and boundary-aware network for military camouflaged object detection using unmanned aerial vehicles. Signal Image Video Process. 2025, 19, 184. [Google Scholar] [CrossRef]
  5. He, C.; Li, K.; Zhang, Y.; Zhang, Y.; You, C.; Guo, Z.; Li, X.; Danelljan, M.; Yu, F. Strategic preys make acute predators: Enhancing camouflaged object detectors by generating camouflaged objects. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024; pp. 36657–36675. [Google Scholar]
  6. Xu, Z.; Zhao, H.; Liu, P.; Wang, L.; Zhang, G.; Chai, Y. SRTSOD-YOLO: Stronger real-time small object detection algorithm based on improved YOLO11 for UAV imageries. Remote Sens. 2025, 17, 3414. [Google Scholar] [CrossRef]
  7. Jiang, L.; Yuan, B.; Du, J.; Chen, B.; Xie, H.; Tian, J.; Yuan, Z. MFFSODNet: Multiscale feature fusion small object detection network for UAV aerial images. IEEE Trans. Instrum. Meas. 2024, 73, 5015214. [Google Scholar] [CrossRef]
  8. Wang, P.; Zhao, Y.; Hu, Z. Boundary-aware camouflaged object detection via spatial-frequency domain supervision. Electronics 2025, 14, 2541. [Google Scholar] [CrossRef]
  9. Ni, X.; Wong, Z.J.; Mrejen, M.; Wang, Y.; Zhang, X. An ultrathin invisibility skin cloak for visible light. Science 2015, 349, 1310–1314. [Google Scholar] [CrossRef]
  10. Yuan, C.; Liu, L.; Li, Y.; Li, J. SAM2-DFBCNet: A camouflaged object detection network based on the Heira architecture of SAM2. Sensors 2025, 25, 4509. [Google Scholar] [CrossRef]
  11. Khan, A.; Khan, M.; Gueaieb, W.; El Saddik, A.; De Masi, G.; Karray, F. CamoFocus: Enhancing camouflage object detection with split-feature focal modulation and context refinement. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 1434–1443. [Google Scholar]
  12. Hwang, K.-S.; Ma, J. Military camouflaged object detection with deep learning using dataset development and combination. J. Def. Model. Simul. 2026, 23, 67–78. [Google Scholar] [CrossRef]
  13. Fan, D.P.; Ji, G.P.; Cheng, M.M.; Shao, L. Concealed object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6024–6042. [Google Scholar] [CrossRef]
  14. Wan, Z.; Lan, Y.; Xu, Z.; Shang, K.; Zhang, F. DAU-YOLO: A lightweight and effective method for small object detection in UAV images. Remote Sens. 2025, 17, 1768. [Google Scholar] [CrossRef]
  15. Sun, Y.; Wang, S.; Chen, C.; Xiang, T.Z. Boundary-guided camouflaged object detection. arXiv 2022, arXiv:2207.00794. [Google Scholar] [CrossRef]
  16. Zhang, X.; Gao, M.; Gao, G.; Wang, X.; Wang, Q. Edge-guided multilevel feature fusion network for lightweight camouflaged object detection. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–7. [Google Scholar]
  17. Shang, Y.; Wang, L.; Dong, J.; Dong, X. Boundary-aware distracted attention network for camouflaged object detection. IEEE Trans. Artif. Intell. 2026, in press. [Google Scholar]
  18. Alghamdi, L.; Usman, M.; Anwar, H.; Bais, A.; Anwar, S. MSRNet: A multi-scale recursive network for camouflaged object detection. arXiv 2025, arXiv:2511.12810. [Google Scholar] [CrossRef]
  19. Xiao, F.; Hu, S.; Shen, Y.; Fang, C.; Huang, J.; He, C.; Tang, L.; Yang, Z.; Li, X. A survey of camouflaged object detection and beyond. arXiv 2024, arXiv:2408.14562. [Google Scholar] [CrossRef]
  20. Khan, A.; Ullah, H.; Munir, A. LiteCOD: Lightweight camouflaged object detection via holistic understanding of local-global features and multi-scale fusion. AI 2025, 6, 197. [Google Scholar] [CrossRef]
  21. Sun, Y.; Xuan, H.; Yang, J.; Luo, L. GLCONet: Learning multisource perception representation for camouflaged object detection. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 13262–13275. [Google Scholar] [CrossRef]
  22. Liang, Y.; Qin, G.; Sun, M.; Wang, X.; Yan, J.; Zhang, Z. A systematic review of image-level camouflaged object detection with deep learning. Neurocomputing 2024, 566, 127050. [Google Scholar] [CrossRef]
  23. Yan, J.; Le, T.N.; Nguyen, K.D.; Tran, M.T.; Do, T.T.; Nguyen, T.V. MirrorNet: Bio-inspired camouflaged object segmentation. IEEE Access 2021, 9, 43290–43300. [Google Scholar] [CrossRef]
  24. Zhu, H.; Li, P.; Xie, H.; Yan, X.; Liang, D.; Chen, D.; Wang, M.; Qin, J. I can find you! Boundary-guided separated attention network for camouflaged object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Virtual, 22 February–1 March 2022; pp. 3608–3616. [Google Scholar]
  25. Le, M.Q.; Tran, M.T.; Le, T.N.; Nguyen, T.V.; Do, T.T. CamoFA: A learnable Fourier-based augmentation for camouflage segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 3427–3436. [Google Scholar]
  26. Yin, B.; Zhang, X.; Fan, D.P.; Jiao, S.; Cheng, M.M.; Van Gool, L.; Hou, Q. CamoFormer: Masked separable attention for camouflaged object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10623294. [Google Scholar] [CrossRef]
  27. Habash, N.; Abu Alqumsan, A.; Zhou, T. Recent real-time aerial object detection approaches, performance, optimization, and efficient design trends for onboard performance: A survey. Sensors 2025, 25, 7563. [Google Scholar] [CrossRef]
  28. Huang, M.; Jiang, W. DMS-YOLO: Small target detection algorithm based on YOLOv11. PLoS ONE 2026, 21, e0341991. [Google Scholar] [CrossRef]
  29. Berndt, J.; Meissner, H.; Kraft, T. On the accuracy of YOLOv8-CNN regarding detection of humans in nadir aerial images for search and rescue applications. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2023, 48, 139–146. [Google Scholar] [CrossRef]
  30. Wang, X.; Fang, H.; Li, Q.; Wang, L.; Chang, Y.; Yan, L. Blur-robust detection via feature restoration: An end-to-end framework for prior-guided infrared UAV target detection. Proc. AAAI Conf. Artif. Intell. 2026, 40, 10181–10189. [Google Scholar] [CrossRef]
  31. Nikouei, M.; Baroutian, B.; Nabavi, S.; Taraghi, F.; Aghaei, A.; Sajedi, A.; Moghaddam, M.E. Small object detection: A comprehensive survey on challenges, techniques and real-world applications. Intell. Syst. Appl. 2025, 200561. [Google Scholar] [CrossRef]
  32. Liu, J.; Kong, L.; Chen, G. Improving SAM for camouflaged object detection via dual stream adapters. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–23 October 2025; pp. 21906–21916. [Google Scholar]
  33. Zheng, Z.; Yuan, J.; Yao, W.; Yao, H.; Liu, Q.; Guo, L. Crop classification from drone imagery based on lightweight semantic segmentation methods. Remote Sens. 2024, 16, 4099. [Google Scholar] [CrossRef]
  34. Lyu, Y.; Vosselman, G.; Xia, G.S.; Yilmaz, A.; Yang, M.Y. UAVid: A semantic segmentation dataset for UAV imagery. ISPRS J. Photogramm. Remote Sens. 2020, 165, 108–119. [Google Scholar] [CrossRef]
  35. Cai, W.; Jin, K.; Hou, J.; Guo, C.; Wu, L.; Yang, W. VDD: Varied Drone Dataset for semantic segmentation. arXiv 2023, arXiv:2305.13608. [Google Scholar] [CrossRef]
  36. Huang, S.; Hu, M.; Zou, L.; Chi, H.; Li, Z.; Gao, F.; Yang, F.; Wu, Q.; Chen, K. UAV-CB: A Complex-Background RGB-T Dataset and Local Frequency Bridge Network for UAV Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–25 June 2026; pp. 40468–40478. [Google Scholar]
  37. Skurowski, P.; Abdulameer, H.; Błaszczyk, J.; Depta, T.; Kornacki, A.; Kozieł, P. Animal camouflage analysis: Chameleon database. Unpublished manuscript 2018, 2, 7.
  38. Le, T.N.; Nguyen, T.V.; Nie, Z.; Tran, M.T.; Sugimoto, A. Anabranch network for camouflaged object segmentation. Comput. Vis. Image Underst. 2019, 184, 45–56. [Google Scholar] [CrossRef]
  39. Fan, D.P.; Ji, G.P.; Sun, G.; Cheng, M.M.; Shen, J.; Shao, L. Camouflaged object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 2774–2784. [Google Scholar]
  40. Lv, Y.; Zhang, J.; Dai, Y.; Li, A.; Liu, B.; Barnes, N.; Fan, D.P. Simultaneously localize, segment and rank the camouflaged objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 11591–11601. [Google Scholar]
  41. Haider, A. Adaptive Camouflaged Dataset (ACD1K). Kaggle. 2023. Available online: https://www.kaggle.com/datasets/aalihhiader/military-camouflage-soldiers-dataset-mcs1k (accessed on 21 April 2026).
  42. Chen, G.; Liu, S.J.; Sun, Y.J.; Ji, G.P.; Wu, Y.F.; Zhou, T. Camouflaged object detection via context-aware cross-level fusion. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6981–6993. [Google Scholar] [CrossRef]
  43. Pang, Y.; Zhao, X.; Xiang, T.Z.; Zhang, L.; Lu, H. Zoom in and out: A mixed-scale triplet network for camouflaged object detection. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 2160–2170. [Google Scholar]
  44. Chen, T.; Xiao, J.; Hu, X.; Zhang, G.; Wang, S. Boundary-guided network for camouflaged object detection. Knowl.-Based Syst. 2022, 248, 108901. [Google Scholar] [CrossRef]
  45. Hu, X.; Zhang, X.; Wang, F.; Sun, J.; Sun, F. Efficient camouflaged object detection network based on global localization perception and local guidance refinement. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 5452–5465. [Google Scholar] [CrossRef]
  46. Guan, J.; Fang, X.; Zhu, T.; Qian, W. SDRNet: Camouflaged object detection with independent reconstruction of structure and detail. Knowl.-Based Syst. 2024, 299, 112051. [Google Scholar] [CrossRef]
  47. Chen, T.; Lu, A.; Zhu, L.; Ding, C.; Yu, C.; Ji, D.; Li, Z.; Sun, L.; Mao, P.; Zang, Y. Sam2-adapter: Evaluating & adapting segment anything 2 in downstream tasks: Camouflage, shadow, medical image segmentation, and more. arXiv 2024, arXiv:2408.04579. [Google Scholar] [CrossRef]
  48. Bibbo, L.; Genovese, E.; Maesano, C.; Calluso, S.; Barrile, G.; Meduri, G.M.; Bilotta, A.; Caroti, G.; Piemonte, A.; Barrile, V. Electronic components and key algorithms for a prototype drone: Economic and sustainability advantages. AIMS Electron. Electr. Eng. 2026, 10, 92–128. [Google Scholar] [CrossRef]
Figure 1. The data construction pipeline of CPD-UAV. The raw UAV videos undergo 10-frame sampling and screening before merging with web-sourced images. The integrated data is then processed through SAM-assisted annotation and expert cross-validation to ensure pixel-level ground truth.
Figure 1. The data construction pipeline of CPD-UAV. The raw UAV videos undergo 10-frame sampling and screening before merging with web-sourced images. The integrated data is then processed through SAM-assisted annotation and expert cross-validation to ensure pixel-level ground truth.
Drones 10 00447 g001
Figure 2. The pixel-level annotation pipeline of the CPD-UAV dataset. From left to right: (1) original image captured by UAV; (2) SAM prompt showing the initial interactive segmentation mask; (3) refined mask generated after manual correction; (4) zoomed-in details highlighting the boundary alignment; and (5) the final binary GT (Ground Truth).
Figure 2. The pixel-level annotation pipeline of the CPD-UAV dataset. From left to right: (1) original image captured by UAV; (2) SAM prompt showing the initial interactive segmentation mask; (3) refined mask generated after manual correction; (4) zoomed-in details highlighting the boundary alignment; and (5) the final binary GT (Ground Truth).
Drones 10 00447 g002
Figure 3. Statistical properties of the proposed CPD-UAV dataset. (a) The proportion of diverse terrain backgrounds, highlighting the scene richness. (b) The joint distribution of target centroids and relative scales, illustrating the highly random spatial distribution and the prevalence of extremely small targets due to the dynamic UAV altitude.
Figure 3. Statistical properties of the proposed CPD-UAV dataset. (a) The proportion of diverse terrain backgrounds, highlighting the scene richness. (b) The joint distribution of target centroids and relative scales, illustrating the highly random spatial distribution and the prevalence of extremely small targets due to the dynamic UAV altitude.
Drones 10 00447 g003
Figure 4. The quantitative histogram of relative target areas in the CPD-UAV dataset. The distribution exhibits a long-tail characteristic, demonstrating the prevalence of tiny (<0.1%) and small (<0.5%) targets resulting from dynamic UAV flight altitudes.
Figure 4. The quantitative histogram of relative target areas in the CPD-UAV dataset. The distribution exhibits a long-tail characteristic, demonstrating the prevalence of tiny (<0.1%) and small (<0.5%) targets resulting from dynamic UAV flight altitudes.
Drones 10 00447 g004
Figure 5. Qualitative visual comparison of seven state-of-the-art models on the proposed CPD-UAV dataset. The results demonstrate that under extreme scale variation and complex background clutter, most models exhibit unsatisfactory performance in small target detection, with inaccurate boundary localization and a certain degree of false and missed detections.
Figure 5. Qualitative visual comparison of seven state-of-the-art models on the proposed CPD-UAV dataset. The results demonstrate that under extreme scale variation and complex background clutter, most models exhibit unsatisfactory performance in small target detection, with inaccurate boundary localization and a certain degree of false and missed detections.
Drones 10 00447 g005
Figure 6. Architecture of the proposed Residual Gated Alignment Module (RGAM). RGAM performs channel alignment, spatial upsampling, residual gated feature enhancement, and multi-scale feature fusion for small object detection.
Figure 6. Architecture of the proposed Residual Gated Alignment Module (RGAM). RGAM performs channel alignment, spatial upsampling, residual gated feature enhancement, and multi-scale feature fusion for small object detection.
Drones 10 00447 g006
Figure 7. Qualitative comparison of detection results among the baseline SINet, SINet-Light, and SINet+RGAM. Local zoomed-in patches (red boxes) are embedded for minute targets, demonstrating RGAM’s capability in recovering weak target structures and reducing missed detections.
Figure 7. Qualitative comparison of detection results among the baseline SINet, SINet-Light, and SINet+RGAM. Local zoomed-in patches (red boxes) are embedded for minute targets, demonstrating RGAM’s capability in recovering weak target structures and reducing missed detections.
Drones 10 00447 g007
Figure 8. Qualitative comparison of detection results among the baseline BGNet, BGNet-Light, and BGNet+RGAM. Local zoomed-in patches (red boxes) are embedded for minute targets to illustrate RGAM’s capability in refining boundaries and suppressing background noise. Note that the patch is omitted in the fourth row of BGNet-Light due to a complete missed detection (blank mask).
Figure 8. Qualitative comparison of detection results among the baseline BGNet, BGNet-Light, and BGNet+RGAM. Local zoomed-in patches (red boxes) are embedded for minute targets to illustrate RGAM’s capability in refining boundaries and suppressing background noise. Note that the patch is omitted in the fourth row of BGNet-Light due to a complete missed detection (blank mask).
Drones 10 00447 g008
Figure 9. Qualitative comparison of detection results among the baseline PRNet, PRNet-Light, and PRNet+RGAM. Local zoomed-in patches (red boxes) are applied to minute targets to highlight boundary improvements. These patches are omitted for the relatively large target in the third row, where macroscopic visualization is sufficient.
Figure 9. Qualitative comparison of detection results among the baseline PRNet, PRNet-Light, and PRNet+RGAM. Local zoomed-in patches (red boxes) are applied to minute targets to highlight boundary improvements. These patches are omitted for the relatively large target in the third row, where macroscopic visualization is sufficient.
Drones 10 00447 g009
Table 1. Comparison of Mainstream COD (camouflaged object detection) Datasets and CPD-UAV.
Table 1. Comparison of Mainstream COD (camouflaged object detection) Datasets and CPD-UAV.
DatasetYearScale (Images)Target TypePerspectiveAnnotationKey Characteristics
CHAMELEON201876Natural AnimalsHorizontalPixel-level MaskEarly small-scale validation
CAMO20191250Animals & Artificial ObjectsHorizontalPixel-level MaskNatural and artificial hybrid
COD10K202110,000Various Natural ObjectsHorizontal/
Close-up
Pixel & Matting-levelLargest scale, multi-category
NC4K20214121Various Natural ObjectsDiverse HorizontalPixel-level MaskLargest specialized test set
ACD1K20241078Artificial CamouflageDiverse HorizontalPixel-level Mask & BBoxAdvanced artificial camouflage
CPD-UAV (Ours)20261061Visually Blended IndividualsUAV (Aerial)Pixel-level MaskUAV perspective, extreme scale variation
Table 2. Quantitative baseline evaluation of 7 state-of-the-art models on the CPD-UAV dataset. All models were re-trained and tested on CPD-UAV under identical experimental settings to ensure a fair comparison. ↑ denotes higher values are better, and ↓ denotes lower values are better.
Table 2. Quantitative baseline evaluation of 7 state-of-the-art models on the CPD-UAV dataset. All models were re-trained and tested on CPD-UAV under identical experimental settings to ensure a fair comparison. ↑ denotes higher values are better, and ↓ denotes lower values are better.
ModelYearSmEmFmMAE ↓GFLOPs ↓Params (M) ↓FPS ↑
SINet20200.8270.8170.6740.02027.3148.9531.56
C2FNet20210.9370.9730.8740.00218.1925.2125.52
ZoomNet20220.9190.9450.8560.003101.8032.3820.51
BGNet20220.7390.9220.7950.03058.5077.8041.04
PRNet20240.6890.8910.7450.04421.0214.1226.00
SDRNet20240.9320.9720.8650.001106.26126.0414.17
SAM2-UNet20240.9340.9800.8670.002159.52216.4022.27
Table 3. Quantitative ablation study of the proposed RGAM module on three representative state-of-the-art models. The best results in each comparison group are highlighted in bold. ↑ denotes higher values are better, and ↓ denotes lower values are better.
Table 3. Quantitative ablation study of the proposed RGAM module on three representative state-of-the-art models. The best results in each comparison group are highlighted in bold. ↑ denotes higher values are better, and ↓ denotes lower values are better.
ModelSmEmFmMAE ↓GFLOPs ↓Params (M) ↓FPS ↑
SINet0.8270.8170.6740.02027.3148.9531.56
SINet-Light0.8040.8690.6400.00833.9346.3646.67
SINet-RGAM0.8280.8800.6710.00422.4546.4372.55
BGNet0.7390.9220.7950.03058.5077.8041.04
BGNet-Light0.9090.9340.8270.00248.0776.1836.41
BGNet-RGAM0.9300.9590.8700.00148.0776.2648.09
PRNet0.6890.8910.7450.04421.0214.1226.00
PRNet-Light0.7150.8940.7320.04015.8712.0516.50
PRNet-RGAM0.7500.9120.7680.03616.3312.2426.20
Table 4. Quantitative cross-dataset validation on the COD10K testing set. Models were trained on 4040 public COD images using the same hyperparameter settings as the main experiments. ↑ denotes higher values are better, and ↓ denotes lower values are better. The best results in each comparison group are highlighted in bold.
Table 4. Quantitative cross-dataset validation on the COD10K testing set. Models were trained on 4040 public COD images using the same hyperparameter settings as the main experiments. ↑ denotes higher values are better, and ↓ denotes lower values are better. The best results in each comparison group are highlighted in bold.
ModelSmEmFmMAE ↓
SINet0.6070.5940.3580.095
SINet+RGAM0.6280.6430.3880.121
BGNet0.7810.8190.6630.043
BGNet+RGAM0.7830.8390.6720.042
PRNet0.8680.9310.8130.024
PRNet+RGAM0.8700.9340.8110.023
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, X.; Kang, W.; Peng, Y.; Tang, W.; Li, Q.; Hao, H.; Hou, L.; Ying, X. CPD-UAV: A Benchmark Dataset for Detecting Personnel Visually Blended with the Environment Under UAV Perspective. Drones 2026, 10, 447. https://doi.org/10.3390/drones10060447

AMA Style

Zhang X, Kang W, Peng Y, Tang W, Li Q, Hao H, Hou L, Ying X. CPD-UAV: A Benchmark Dataset for Detecting Personnel Visually Blended with the Environment Under UAV Perspective. Drones. 2026; 10(6):447. https://doi.org/10.3390/drones10060447

Chicago/Turabian Style

Zhang, Xuekai, Wenchao Kang, Yueping Peng, Wei Tang, Qilong Li, Hexiang Hao, Liming Hou, and Xin Ying. 2026. "CPD-UAV: A Benchmark Dataset for Detecting Personnel Visually Blended with the Environment Under UAV Perspective" Drones 10, no. 6: 447. https://doi.org/10.3390/drones10060447

APA Style

Zhang, X., Kang, W., Peng, Y., Tang, W., Li, Q., Hao, H., Hou, L., & Ying, X. (2026). CPD-UAV: A Benchmark Dataset for Detecting Personnel Visually Blended with the Environment Under UAV Perspective. Drones, 10(6), 447. https://doi.org/10.3390/drones10060447

Article Metrics

Back to TopTop