A Novel Object Detection-Based Air-to-Ground Target Search and Localization Strategy

Li, Haoran; Zhang, Qinling; Zhen, Mi

doi:10.3390/drones10050375

Open AccessArticle

A Novel Object Detection-Based Air-to-Ground Target Search and Localization Strategy

by

Haoran Li

^1,*

,

Qinling Zhang

¹

and

Mi Zhen

²

¹

Research Institute of Unmanned Aerial Vehicle, Beihang University, Beijing 100191, China

²

Aerospace Science and Industry Intelligent Operations Research and Information Security Research Institute (Wuhan), Wuhan 430040, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(5), 375; https://doi.org/10.3390/drones10050375

Submission received: 6 March 2026 / Revised: 30 April 2026 / Accepted: 4 May 2026 / Published: 13 May 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The RepViT-enhanced detection model (RepViT-M1.5-YOLOv4-CBAM) achieves an mAP_0.5 of 98.58% at 18.70 FPS on a custom emergency rescue dataset, improving real-time detection speed by 2.0 FPS over the standard YOLOv4 baseline, while the CBAM attention module enhances focus on discriminative target regions in complex backgrounds and occluded scenarios.
A tiered search localization strategy is proposed that assigns rescue personnel, supply boxes, and vehicles to priority tiers and applies the nearest-neighbor principle to plan the UAV search route; four diverse simulation scenarios confirm its correctness and robustness.

What are the implications of the main findings?

The lightweight and accurate RepViT-enhanced model enables UAVs to perform reliable real-time ground target detection on resource-constrained onboard microcomputers (22.76 M parameters, 22.46 G FLOPs), making it well-suited for deployment in time-critical applications such as emergency rescue and material delivery.
The tiered search strategy provides a systematic and priority-aware framework for autonomous UAV operations in complex environments, offering practical guidance for future multi-target search and localization missions while identifying dynamic re-routing as a key direction for future work.

Abstract

The ability of uncrewed aerial vehicles (UAVs) to hover, recognize, and localize ground targets is crucial for efficient and accurate intelligent low-altitude operations, such as material delivery, emergency rescue, and firefighting. This paper presents a technical solution for low-altitude UAV target recognition and search localization. The core algorithm is a RepViT-enhanced detection model, which integrates the Re-Parameterization Vision Transformer (RepViT) lightweight neural network with an efficient object detection framework, further augmented by the Convolutional Block Attention Module (CBAM) to improve detection accuracy. The search localization strategy implements a tiered approach for exploring nearby areas from the current position, assigning targets to priority tiers and visiting them in order of priority. Experimental results demonstrate that the RepViT-enhanced model achieves a mean average precision (mAP) of 98.58% on a custom emergency rescue dataset, improving real-time detection speed by two frames per second (18.70 FPS vs. 16.70 FPS for the standard YOLOv4 baseline). Thus, the proposed method effectively enhances both detection accuracy and speed, enabling better target search and localization in complex environments. The search strategy was validated through simulations, confirming its feasibility.

Keywords:

RepViT-enhanced; UAV target detection; edge-deployable; target search; emergency rescue

1. Introduction

Unmanned aerial vehicles (UAVs) have been widely used in industries, such as aerial photography [1], urban management, agriculture [2], geology [3], meteorology, and electricity production. In recent years, the demand for high autonomy, augmented performance, and networked systems utilization has accelerated the development of UAV technology. In certain application fields, such as logistics, emergency rescue, public safety, and military, UAVs are frequently employed for hovering flights, close-to-ground delivery, deployment, and recovery operations. However, several challenges are faced by UAVs during image data capturing, including small target pixels, target occlusion, and substantial variations in target sizes [4]. These technological difficulties hinder the precise, fast, and reliable target recognition and localization on the ground.

In recent years, deep learning has been widely applied to object detection, achieving good results. The object detection algorithms in this field are divided into two main categories: single- and two-stage algorithms. Typical two-stage algorithms include R-CNN [5], SPP-Net [6], and Faster R-CNN [7]. Despite their high detection accuracy, these algorithms tend to be slower and often fail to meet real-time detection requirements. Single-stage object detection algorithms are more commonly applied in real-time detection scenarios, among which the YOLO series [8] laid the foundation for the advancement of object detection algorithms. Among the YOLO series, YOLOv4 strikes a balance between speed and accuracy, with an architecture comprising three cooperating modules: the CSPDarknet53 backbone for multi-scale feature extraction, the SPP module for receptive field aggregation, and the PANet neck for multi-scale feature fusion. The architectural details and our proposed modifications to this baseline are described in Section 2.1.

Further research has been conducted for the continuous improvement in detection speed, accuracy, and ease of use of YOLOv4-based algorithms. Starting with Scaled YOLOv4 [9], all official YOLO models have fine-tuned the trade-off between speed and accuracy, providing different model scales to suit specific applications and hardware requirements. These versions commonly provide lightweight models optimized for edge devices, effectively reducing computational complexity and speeding up processing time. For example, in [10], the lightweight MobileNetv3 network was introduced into the backbone network of YOLOv4, while the novel K-means++ clustering algorithm was implemented to regenerate the initial anchor boxes, reducing model complexity and improving detection accuracy. Additionally, in [11], multi-scale convolutions were introduced into the feature extraction network MSDarkNet-53, and the CBAM attention mechanism was added between convolution modules, effectively improving small object detection accuracy. However, the problem of missing heavily occluded targets remains unsolved. In [12], the CBAM attention mechanism and small object detection layers were added to the original YOLOv5 framework, enhancing the ability to capture objects. Ultralytics has successively released the PyTorch 2.1.0-based YOLOv5 and its lightweight improved version, YOLOv8, with the latter optimizing computational efficiency through the C2f module. In 2024, the newly launched YOLOv11 significantly enhanced resource utilization through architectural simplification. Additionally, the original creator of YOLOv4 introduced the YOLOv7 and YOLOv9 models. YOLOv7 incorporates an Extended Efficient Layer Aggregation Networks (E-ELAN) structure to strengthen feature extraction capabilities, while YOLOv9 combines the strengths of CNNs and Transformers, markedly improving long-range target recognition performance. Nevertheless, there remains a limited selection of highly efficient object detection algorithms currently deployable on edge devices. Although YOLOv8 and YOLOv11 offer improved accuracy on standard benchmarks, they introduce higher computational costs that exceed the constraints of the lightweight microcomputers typically embedded in UAV platforms (e.g., Jetson Nano-class devices with limited GPU memory). YOLOv4, by contrast, provides a well-established and highly configurable framework that allows backbone replacement without architectural redesign, making it particularly suited for the targeted lightweight modifications explored in this study. Specifically, by replacing the standard CSPDarknet53 backbone of YOLOv4 with the RepViT architecture, we exploit the superior inference efficiency of re-parameterization on mobile hardware while retaining the well-validated SPP and PANet feature fusion pipeline.

This paper presents a novel approach for the automated operation of UAVs, focusing on the application of image processing techniques for the recognition and localization of various targets. In this context, “small targets” refer to objects whose bounding boxes occupy fewer than

32 \times 32

pixels in the input image, consistent with the definition adopted in MS COCO benchmark evaluations [4]. By classifying targets into distinct categories, the UAV can execute ground operations in a systematic manner, employing a deep search algorithm to optimize its approach. The proposed method finds its primary applications in multimodal logistics [13] and emergency rescue scenarios [14] utilizing life detection radar. Despite advancements in enhancing the accuracy of UAV image detection, several challenges persist. The necessity for UAVs to capture images of ground targets from elevated altitudes while maintaining all-weather and long-range capabilities introduces complications influenced by environmental variables such as lighting conditions, scale, occlusion, shadows, stains, angles, background interference, and various forms of image noise, including sampling, filtering, and compression artifacts. These factors can significantly impair detection accuracy, complicating the achievement of optimal outcomes. Furthermore, enhancing detection speed on onboard microcomputers continues to pose a significant challenge.

To overcome these challenges, this paper introduces a UAV ground target detection and localization method based on a RepViT-enhanced model integrated with the CBAM attention mechanism, evaluated on both a custom emergency rescue dataset and the public PASCAL VOC benchmark. The main contributions of this paper are as follows:

A RepViT-enhanced detection framework is proposed that replaces the standard CSPDarknet53 backbone of YOLOv4 with the lightweight Re-Parameterization Vision Transformer (RepViT-M1.5), while retaining the SPP and PANet feature fusion modules. This design delivers competitive detection accuracy (mAP of 98.58% on the custom dataset) while maintaining edge-deployable inference speed (18.70 FPS on an onboard UAV microcomputer), outperforming MobileNet-series backbones on the same hardware.
The Convolutional Block Attention Module (CBAM) is inserted at the junction between the backbone and neck to enhance spatial and channel-wise feature weighting, improving the model’s ability to focus on small and partially occluded ground targets. A custom data augmentation pipeline—combining mosaic augmentation, Gaussian noise injection, and rotation—is also developed to improve robustness under varying illumination and occlusion conditions.
A tiered target search strategy is devised that prioritizes rescue personnel (Tier 1) over supply boxes (Tier 2) and vehicles (Tier 3), using a nearest-neighbor traversal within each tier. Simulation results confirm the feasibility of this approach for systematic UAV-based search and localization in emergency rescue scenarios.

2. Materials and Methods

This section presents the three core components of the proposed framework. Section 2.1 describes the RepViT-enhanced detection backbone and its integration with SPP and PANet modules. Section 2.2 details the incorporation of the CBAM attention module. Section 2.3 introduces the tiered target search strategy and its nearest-neighbor traversal algorithm.

2.1. RepViT-Enhanced Algorithm

Proposed in 2023, RepViT [15] is a lightweight network structure that represents a further improvement on the existing lightweight network model, MobileNetv3 [16]. By combining CNN and ViT [17], RepViT forms a lightweight model that possesses both the local perceptual ability of CNNs and the global abstraction capability of ViTs. The architecture of RepViT is specifically crafted to ensure optimal performance while accommodating the limitations of computational power and memory inherent in mobile devices. This design not only showcases enhanced performance but also achieves reduced latency when operating on devices with restricted resources. Building on this advantage, the RepViT-enhanced model has been introduced to meet the requirements of lightweight computing platforms utilized in UAV technology.

The baseline YOLOv4 network used in this study follows a three-stage architecture. As shown in Figure 1, it consists of the CSPDarknet53 backbone for feature extraction, the SPP module for multi-scale receptive field aggregation, and the PANet neck for feature pyramid fusion, followed by the detection head. The CSPDarknet53 backbone is built from Resblock_body units, illustrated in Figure 2, which employ residual connections to improve gradient flow while reducing feature redundancy across stages.

The proposed RepViT-enhanced model, illustrated in Figure 3, operates by processing input images and utilizing the RepViT feature extraction network to derive three valid feature layers.

These layers are subsequently fed into the Spatial Pyramid Pooling (SPP) layer and the Path Aggregation Network (PANet) layer to enhance the features further. The Neck part uses PANet [18] for feature aggregation and introduces an SPP [6] structure to enhance the receptive field, achieving a significant increase in accuracy with only a small increase in the computational cost. Ultimately, the Head is employed for the final tasks of classification and localization. The Head component incorporates an anchor-based detection step alongside three tiers of detection resolution. It produces potential bounding boxes within the original image by utilizing the feature maps and subsequently forecasts the classes of the identified objects.

This model demonstrates enhanced performance in the detection of specific objects. Unlike other lightweight models, the RepViT network incorporates the RepViTBlock, which combines a token_mixer [19] and a channel_mixer. The RepViTBlock structure is depicted in Figure 4, showing the two variants for stride = 2 and stride = 1. The study also employed a commonly utilized reparameterization method for the depthwise (DW) layers, which significantly improves the model’s learning capabilities throughout the training process. This approach effectively reduces the computational and memory overhead linked to skip connections during inference, thereby offering distinct benefits for the microcomputers implemented in the UAVs used in this research. Specifically, the reparameterization technique converts the multi-branch training-time structure (parallel DW conv, 1 × 1 conv, and identity shortcut) into a single DW convolution at inference time, eliminating the extra memory reads associated with residual connections and reducing inference latency by approximately 10–15% on ARM Cortex-A class processors compared to equivalent non-reparameterized counterparts [15]. The input preprocessed image has a size of 416 × 416 pixels with three channels. It first passes through the Cov2d_BN stem module, which stacks two 3 × 3 convolutions. The number of filters in the first convolution is set to 24, resulting in stage 1, where the image size becomes 104 × 104. Next, the image goes through multiple RepViTBlock modules. The RepViTBlock combines token_mixer and channel_mixer and includes depthwise separable convolutions (3 × 3 DW), 1 × 1 convolutions, optional squeeze-and-excitation (SE) modules, and feed-forward networks (FFN). This process spans three stages: stage 2, stage 3, and stage 4, with image sizes of 52 × 52, 26 × 26, and 13 × 13, respectively. In each stage, the spatial dimensions are reduced through downsampling. The downsampling layer structure consists of a DW convolution with stride = 2 and a 1 × 1 pointwise convolution after each RepViTBlock, performing spatial downsampling and channel modulation. Finally, a feed-forward network (FFN) module is added.

Upon acquiring three valid feature layers, these layers are subsequently fed into the additional SPP and PANet feature fusion modules within the Neck module. Initially, the convolutional layers derived from the final fully connected layer of the RepViT network are subjected to max-pooling operations of sizes 5 × 5, 9 × 9, and 13 × 13, resulting in three distinct convolutional layers. These layers are then concatenated with the input convolutional layers and directed into the PANet architecture. The structure incorporates two processes of upsampling and downsampling, facilitating the integration of features. Moreover, the feature maps undergo processing through FPN layers, along with supplementary bottom-up feature fusion layers introduced by PANet, adaptive feature pooling layers, and ultimately the prediction head. Finally, the features are transmitted to the head, thereby finalizing the prediction for both target classification and localization.

2.2. CBAM Module

In this study, we integrated the CBAM attention mechanism module into its overall framework. This module synergizes channel and spatial attention, thereby enhancing the model’s feature extraction capabilities, which in turn leads to a notable improvement in overall performance.

The architecture of the CBAM module [20], illustrated in Figure 5, comprises two key components: the channel attention module (CAM) and the spatial attention module (SAM). For an input feature map of dimensions

H \times W \times C

, the channel attention mechanism is initiated by applying both max-pooling and average pooling operations along the channel axis. The outputs from these pooling methods are subsequently processed through a shared multi-layer perceptron (MLP) to derive the maximum and average feature vectors. These vectors are then combined through element-wise summation and normalized via a sigmoid function, resulting in a

1 \times 1 \times C

channel attention weight. This weight is applied to the input feature map through element-wise multiplication, producing the channel-attended feature map.

Subsequently, the spatial attention module generates a spatial feature map of size

H \times W \times 1

by employing max-pooling and global average pooling techniques. The resulting feature maps are concatenated to create an

H \times W \times 2

feature map that encapsulates spatial information. This concatenated map is then processed through a 7 × 7 convolutional layer to derive the spatial attention weights, which are normalized using the sigmoid function, yielding a

H \times W \times 1

spatial attention weight. This weight is also applied to the input feature map through element-wise multiplication, resulting in the spatial-attended feature map.

Ultimately, the sequential integration of the channel attention and spatial attention modules culminates in a comprehensive CBAM module, designed for seamless incorporation into various architectures.

The placement of the CBAM module is displayed in Figure 6. After inputting the valid feature layers derived from the backbone into the neck, three CBAM modules are incorporated at the three detection scales to broaden the receptive field and augment the model’s capacity to focus on relevant target features while suppressing irrelevant background clutter [21]. This channel- and spatial-wise reweighting is particularly beneficial in scenarios with complex backgrounds, severe occlusion, or large intra-class appearance variation, where it helps the model allocate representational capacity toward discriminative target regions. In terms of computational overhead, each CBAM module introduces a modest additional cost: for a feature map of size

H \times W \times C

, the channel attention adds

2 C^{2} / r

multiplications (where r is the reduction ratio, typically 16), and the spatial attention adds

H \times W \times 2

multiplications for the

7 \times 7

convolution. Across the three scales (13 × 13, 26 × 26, 52 × 52), the total additional FLOPs amount to approximately

0.08

G, representing less than 0.4% of the model’s total 22.46 G FLOPs. The corresponding parameter increase is 0.08 M (from 22.76 M to 22.84 M), confirming that CBAM integration does not meaningfully compromise the model’s lightweight characteristics.

2.3. Target Detection-Based Search Localization Strategy

In this paper, the results of RepViT-enhanced target detection were used to classify and locate rescue personnel, supply boxes, and vehicles. The rescue personnel were assigned to the first tier, the supply boxes to the second tier, and the vehicles to the third tier. The targets were searched in layers following the order of first, second, and third tiers, with the drone hovering in sequence and waiting for the operator’s commands in the shortest time.

Let the first tier be denoted as

T_{1}

, the second tier as

T_{2}

, and the third tier as

T_{3}

, and let the UAV’s initial position be

P_{0}

. As visible in Figure 7, the UAV followed a search route where it searched each tier in order. After completing the search for the first-tier targets, the last target of the first tier was taken as the new starting point, and the UAV proceeded to search the next tier’s targets. The nearest-neighbor principle was implemented in the present study, based on which the UAV prioritized searching for the nearest target within each tier.

Starting from the initial position, the UAV first expanded to the closest node in the first tier. It selected the unvisited node with the shortest distance in that tier and updated the distances to the neighboring nodes until all nodes in the first tier were visited. The last visited node was then taken as a new starting point to expand to the second tier’s nodes, and similarly for the third tier. Let the path be the set of visited points, the set Q be the set of unvisited points, and

d_{i j}

be the Euclidean distance from the current node to other nodes in the current tier. The algorithm steps are as follows:

The state with $p a t h = \emptyset$ was initialized, and the starting point from the unvisited set Q was removed and added to the path.
The distances from the starting point to all nodes in the first tier were calculated, and the node with the shortest distance was determined. Then, this node was removed from the unvisited set Q and added to the path, updating the distances to the neighboring nodes in the first tier.
After searching the first tier, the last node visited was taken as the new starting point, and step 2 was repeated to search the second tier, updating the unvisited set Q and the path.
After searching the second tier, the last node visited was taken as the new starting point, and step 2 was repeated to search the third tier, updating the unvisited set Q and the path.
When all points had been visited, the algorithm ended. The final planned path was stored in the path queue.

Computational complexity. For a total of N target nodes partitioned into K tiers, the per-tier nearest-neighbor greedy search has time complexity

O (N_{k}^{2})

, where

N_{k}

is the number of nodes in tier k. The overall complexity across all tiers is

O (\sum_{k = 1}^{K} N_{k}^{2}) \leq O (N^{2})

, which is well within real-time onboard computation budgets for the target counts encountered in typical emergency rescue scenarios (

N \leq 30

). The path computation on an ARM Cortex-A72 processor (as found in the Raspberry Pi 4, a representative UAV companion computer) completes in under 1 ms for

N = 30

, confirming suitability for onboard deployment.

Limitations of the nearest-neighbor strategy. The greedy nearest-neighbor heuristic does not guarantee a globally optimal (shortest) path within each tier, as it may produce suboptimal routes in clustered target configurations. For small N (as in the current application), the deviation from the optimal tour is typically less than 20% [15], which is acceptable given the real-time constraints. However, for larger-scale deployments, replacing the greedy heuristic with a 2-opt local search post-processing step could further reduce the total path length.

Static assumption and dynamic re-routing. The current implementation assumes that all target positions are known prior to mission execution—that is, the RepViT detection model processes all visible targets at mission start, and the path is pre-computed before flight. Consequently, the system does not support dynamic re-routing: if a new high-priority target (e.g., a previously occluded Tier 1 rescue survivor) becomes visible mid-mission, the algorithm has no mechanism to interrupt a lower-tier traversal and redirect the UAV. Addressing this limitation through an online re-planning module—triggered whenever a new target is detected during flight—is identified as an important direction for future work (see Section 4).

3. Results and Discussion

To validate the effectiveness of the proposed RepViT-enhanced model in this study, we established control groups using typical lightweight networks—the MobileNetVx-enhanced models. Specifically, we maintained the same Neck and Head structures while replacing only the backbone network with MobileNetVx variants for comparative experiments. Furthermore, this study employs RepViT-M1.5 as the base model, an efficient variant of the re-parameterized vision transformer architecture that achieves optimal balance between computational efficiency and detection accuracy.

3.1. Dataset for Specific Scenarios and Specific Targets

In a self-built simulated emergency rescue scenario in the laboratory, aerial images captured by UAVs contained three target classes: rescue personnel (fireman), red-marked supply boxes (redbox), and vehicles (car). These images underwent a series of preprocessing steps, including the addition of Gaussian noise, mosaic augmentation, and rotation, as shown in Figure 8, Figure 9 and Figure 10. A comprehensive dataset comprising 3500 images was collected [22], simulating various operational conditions from the UAV’s perspective, including variations in shooting altitude, angle, distance, blur intensity, and different times of day. The images were then annotated using a labeling tool, ultimately generating a VOC-format dataset that met the training requirements for the subsequent model. To prevent overfitting, the image data were split into training and test sets at a 9:1 ratio (3150 training images and 350 test images), with the training set further divided into training and validation subsets at a 9:1 ratio internally. All AP, mAP, and FPS values reported in Table 1 are computed on the held-out test set (350 images).

3.2. Ablation Experiments on the Self-Built Dataset

In this paper, the computations were performed on the high-performance computing platform “Shuguang Zhisuan” which utilizes the DCU (Deep Computing Unit) framework and conducts calculations through the DTK platform. The final evaluation metrics include the average recognition accuracy (AP) [23] for each class, the mean average precision (mAP) across all classes, and the average detection speed (FPS) when the algorithm is ported to an airborne computer for real-time detection. Here, AP is simulated by the area under the precision–recall curve, and the values for precision and recall are given by Equations (1) and (2).

Precision = \frac{T P}{T P + F P}

(1)

Recall = \frac{T P}{T P + F N},

(2)

where

T P

refers to the true positive samples that are correctly classified,

F N

refers to the false negative samples that are incorrectly classified as negative, and

F P

refers to the false positive samples that are incorrectly classified as positive. Precision is represented by the proportion of correctly classified positive samples out of all the samples that the classifier considers as positive, while recall represents the proportion of correctly classified positive samples out of all the actual positive samples. The mAP formula is given by Equation (3), where AP is the average precision for each class, and N is the total number of target categories.

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i},

(3)

To ensure controlled variables and reasonably evaluate the performance of different models, in this study, the same hyperparameters for each model were set. The initial learning rate was set to 0.01, label_smoothing was set to 0.05, and the weight decay of the SGD optimizer was set to 0.0005. Given the constraints imposed by the dataset’s limited size, the number of iterations was set to 300, at which point the model’s state achieved stability. Additionally, the batch size for each training iteration was set to 16. Under these conditions, training is conducted under the following network architectures: RepViT-M1.5-YOLOv4-CBAM, RepViT-M1.5-YOLOv4, YOLOv4, MobilenetV3-YOLOv4, MobilenetV2-YOLOv4, MobilenetV1-YOLOv4, and YOLOv5s. The resulting detection accuracy (AP), mean average precision (mAP), and real-time detection FPS in emergency rescue scenarios containing various target types are compared in Table 1.

The Precision–Recall curves for RepViT-M1.5-YOLOv4-CBAM and RepViT-M1.5-YOLOv4 are presented in Figure 11. Due to the limitations of the self-built dataset, which has a limited number of scenarios and categories, the object detection task became relatively simple. As a result, the detection accuracy exceeded 95% across all trained models.

It can be observed that RepViT-M1.5-YOLOv4-CBAM delivers the best overall mAP_0.5 (98.58%) among all lightweight models tested, with an inference speed of 18.70 FPS on the airborne computer. It is worth noting that the full YOLOv4 (98.85% mAP, 16.70 FPS) and YOLOv5s (99.58% mAP, 17.65 FPS) achieve slightly higher mAP on this custom dataset; however, both models are significantly larger (YOLOv4: 64.36 M parameters, 60.53 G FLOPs; YOLOv5s: 7.28 M parameters, 17.16 G FLOPs at a different architecture scale). The saturation of detection accuracy above 95% across all models indicates that this three-class custom dataset is relatively simple, likely because the target classes (fireman in uniform, red supply boxes, vehicles) are visually distinctive. In this regime, the accuracy difference between models is small, and the key differentiator becomes the parameter efficiency and inference latency on resource-constrained UAV hardware—where the RepViT backbone has a clear advantage. In comparison, MobilenetV3-YOLOv4 achieves the highest FPS (21.29) but at a lower mAP (96.85%), making it more suitable for latency-critical applications where some accuracy trade-off is acceptable.

To address the need for detecting different targets under various weather conditions and levels of clarity, we conducted qualitative tests under different time periods and varying levels of occlusion, including night-time, dawn/dusk, and full-occlusion conditions. The test results are displayed in Figure 12, demonstrating that the model successfully detects all three target classes with high confidence (above 0.7) under all tested conditions, confirming robustness to lighting variation and partial occlusion.

3.3. Ablation Experiments of Target Detection Algorithms on Public Datasets

The results of the above experiments reveal the model’s performance on a dataset with a small number of target categories. The uniformly high mAP values (>95%) suggest that the custom dataset may be close to performance saturation for all compared models, making it difficult to discriminate architectural differences. To evaluate backbone representational capacity and compare model complexity in a more challenging, domain-neutral setting, we used a combination of the PASCAL VOC 2007 and 2012 public datasets [24], totaling 21,504 images and 20 categories.

It is important to clarify the purpose and scope of this experiment. The goal is not to evaluate the model’s performance in emergency rescue scenarios with VOC categories, but rather to use PASCAL VOC as a standard architectural stress test for comparing backbone networks under identical conditions. This methodology is well-established in the object detection literature [9,15]: because the custom three-class dataset is near saturation, it cannot reveal meaningful differences between backbone architectures. PASCAL VOC, with its 20 semantically diverse categories encompassing varied object shapes, textures, scales, and aspect ratios, provides the detection complexity needed to expose these architectural differences. In this context, the experiment directly answers the question: “which backbone—RepViT, MobileNet variants, or CSPDarknet53—provides the best trade-off between parameter count, computational cost, and feature representational capacity?” The results should be interpreted as an architecture comparison, not as a claim that the proposed model is suited for detecting VOC categories in UAV applications. During the training process, the batch size was set to 16, and the confidence threshold was set to 0.5. Under these conditions, RepViT-M1.5-YOLOv4, RepViT-M1.5-YOLOv4-CBAM, YOLOv4, MobilenetV3-YOLOv4, MobilenetV2-YOLOv4, MobilenetV1-YOLOv4, and YOLOv5s models were compared (Table 2). The evaluation metrics included mAP, the number of parameters (Params), and model computational complexity (FLOPs). The formulas for calculating Params and FLOPs [25] are presented in Equations (4) and (5), where H and W are the height and width of the input feature map,

C_{i n}

is the number of input channels, K is the size of the convolutional kernel, and

C_{o u t}

is the number of output channels.

Params = K^{2} \times C_{i n} \times C_{o u t} + C_{o u t}

(4)

FLOPs = 2 \times H \times W \times K^{2} \times C_{i n} \times C_{o u t},

(5)

The experimental results demonstrate that the RepViT-based models achieve competitive accuracy on the PASCAL VOC benchmark. The RepViT-M1.5-YOLOv4 model attains an mAP_0.5 of 89.99% with only 22.76 M parameters and 22.46 G FLOPs, outperforming all MobileNet-based variants (79.01–80.12%) by a substantial margin while using considerably fewer parameters than the full YOLOv4 (64.3 6 M, 60.53 G FLOPs, 92.25% mAP).

It is noteworthy that RepViT-M1.5-YOLOv4-CBAM (89.28%) slightly underperforms its non-CBAM counterpart RepViT-M1.5-YOLOv4 (89.99%) on the VOC dataset. This appears to contradict the benefit of CBAM shown on the custom dataset. However, this discrepancy can be attributed to two factors. First, the CBAM module was tuned and validated in the context of our three-class emergency rescue dataset; on the 20-class VOC dataset with a larger variety of object shapes, textures, and scales, the spatial attention weights learned may not generalize as effectively, potentially introducing unwanted suppression of relevant features. Second, the difference (0.71%) is within the typical variance of training runs and may not be statistically significant. Future work should investigate CBAM hyperparameter tuning (e.g., reduction ratio r, kernel size) on multi-class general datasets to improve cross-domain transferability.

The reported performance drop from 98.58% (custom dataset) to 89.28–89.99% (VOC) reflects the expected gap between a purpose-built, visually distinctive three-class dataset and a diverse 20-class benchmark. This gap indicates that the model’s feature representations are not yet fully domain-agnostic, which is a common limitation of models trained on small, domain-specific datasets. This motivates the future collection of larger and more diverse UAV emergency rescue datasets to improve generalization.

3.4. Deep Search Based on Recognition Results

Targets with different labels are assigned to different search tiers. In this experiment, rescue personnel were assigned to the first search tier (Tier 1: P1, P2, P3), red-marked supply boxes to the second search tier (Tier 2: S1, S2, S3), and vehicles to the third search tier (Tier 3: C1, C2, C3). The UAV sequentially searched and located targets in the first, second, and third tiers according to the nearest-neighbor algorithm described in Section 2.3. The UAV’s onboard camera field of view and target positions were simulated on a canvas of 1280 × 720 pixels, matching a typical UAV camera resolution.

To systematically evaluate the search strategy across different operational scenarios, four representative scenario types were designed with varying target distributions and tier compositions, as summarized in Table 3. Each scenario was run with a randomly generated starting point O, and the path-planning algorithm was applied.

Scenario 1—Balanced distribution (Figure 13). A random starting point O and 9 target nodes distributed evenly across three tiers were generated. The simulation results are presented in Figure 13, showing the planned route from origin O to the final target

C_{2}

. The UAV correctly visits all three Tier 1 targets first, followed by all Tier 2 targets, and finally all Tier 3 targets, demonstrating that the tiered priority is correctly enforced. Path deviation from the optimal is less than 10%.

Scenario 2—Priority-skewed distribution. This scenario simulates a mass-casualty event with 5 rescue personnel (Tier 1), 2 supply boxes (Tier 2), and 1 vehicle (Tier 3). Despite the imbalanced tier sizes, the algorithm correctly prioritizes all Tier 1 targets before addressing Tier 2 and Tier 3 and completes the full tour with a path length within 12% of the brute-force optimum. This scenario confirms that the nearest-neighbor heuristic remains effective even when one tier contains a disproportionate share of targets.

Scenario 3—Spatially clustered targets. In this scenario, each tier’s targets are spatially clustered in different regions of the canvas, simulating a rescue area divided into distinct zones (e.g., a collapsed building, a supply depot, and a parking area). Because within-tier targets are geographically close, the nearest-neighbor heuristic performs near-optimally, with path deviation under 8%. Critically, the algorithm does not shortcut into a geographically nearby Tier 2 cluster to avoid backtracking; tier priority is strictly respected even at the cost of a longer inter-tier transition.

Scenario 4—Sparse wide-area search. Targets are sparse and distributed near the edges of a large search canvas, representing an open-field rescue scenario. This is the most challenging configuration for the nearest-neighbor heuristic, as long inter-node distances amplify suboptimality. Path deviation reaches up to 15%, which identifies a boundary condition for the current algorithm. For such wide-area deployments, integrating a 2-opt post-processing step or a lawnmower pre-survey pattern to reduce initial uncertainty would be beneficial, as identified in Section 4.

To further validate consistency, ten additional random simulations were conducted for each scenario type (

N_{k} \in [1, 5]

targets per tier). In all 40 runs, the algorithm correctly maintained the priority ordering (Tier 1 → Tier 2 → Tier 3) and produced path lengths within the reported deviation bounds.

Limitations and failure cases. The current simulation assumes that all target positions are known with perfect accuracy from the detection step. In practice, two failure scenarios can degrade system performance: (1) False detections: If the RepViT model misclassifies a target class (e.g., assigns a vehicle to Tier 1), the UAV would waste time visiting it before genuine Tier 1 targets. On our test set, the fireman class AP was 98.74%, suggesting that misclassification is infrequent but possible for heavily occluded targets. (2) Missed detections: If a target is not detected (FN), it will not be included in the search plan. For the redbox class with AP = 97.96%, approximately 2% of supply box instances may be missed. These failure cases highlight the importance of improving detection reliability, particularly for small and occluded targets, before operational deployment.

3.5. Discussion

Comparison with related work. The proposed RepViT-M1.5-YOLOv4-CBAM model achieves 98.58% mAP on the custom emergency rescue dataset at 18.70 FPS on an onboard UAV microcomputer, providing a favorable accuracy–efficiency trade-off compared to prior lightweight UAV detection frameworks. For instance, MobileNetv3-based detectors [10] achieve lower accuracy (96.85% mAP in our comparisons) despite faster inference (21.29 FPS), while multi-scale attention-augmented frameworks like those in [11,12] achieve higher accuracy on their respective datasets but at the cost of greater model complexity. On the PASCAL VOC generalization benchmark, our RepViT-M1.5-YOLOv4 achieves 89.99% mAP with only 22.76 M parameters—substantially more parameter-efficient than the full YOLOv4 backbone (64.36 M parameters, 92.25% mAP). These results confirm that the RepViT backbone offers a better accuracy-per-parameter ratio than standard CSP-based designs for edge-deployable UAV applications.

Challenges and limitations. Several limitations of the current approach warrant acknowledgment. First, the custom dataset consists of only 3500 images with three visually distinctive target classes captured in a laboratory setting, which may not reflect the full diversity of real-world rescue environments (e.g., outdoor rubble, smoke, variable lighting). The uniformly high mAP (>95%) across all models on this dataset suggests performance saturation, and evaluation on more challenging public UAV datasets (e.g., VisDrone, HERIDAL) would provide a more rigorous assessment. Second, the tiered search strategy operates under a static-environment assumption: all target positions must be known in advance, and the system does not support dynamic re-routing when new targets become visible during flight. Third, the CBAM modules show limited benefit on the general-domain VOC benchmark, suggesting that their configuration (reduction ratio, kernel size) may need domain-specific tuning for multi-class generalization.

Guidelines for future research. Based on these findings, three primary directions are recommended: (1) collection of a larger, outdoor UAV emergency rescue dataset with greater class diversity and challenging conditions (low altitude, motion blur, cluttered backgrounds); (2) development of an online re-planning module that integrates the detection stream with the path planner to support dynamic target discovery and mission re-routing; and (3) evaluation of more sophisticated path optimization algorithms (e.g., 2-opt TSP, reinforcement learning-based planners) to reduce tour length compared to the greedy nearest-neighbor baseline.

4. Conclusions

This paper proposes a RepViT-enhanced YOLOv4 detection framework integrated with the CBAM attention mechanism for UAV-based ground target detection and tiered search localization. The key findings are as follows. On a custom three-class emergency rescue dataset (3500 images), the RepViT-M1.5-YOLOv4-CBAM model achieved an mAP_0.5 of 98.58 at 18.70 FPS on an onboard UAV microcomputer, outperforming all MobileNet-based lightweight variants (mAP: 96.25–96.85%) while improving inference speed by 2.0 FPS over the standard YOLOv4 baseline (16.70 FPS). On the PASCAL VOC 2007+2012 generalization benchmark (21,504 images, 20 classes), RepViT-M1.5-YOLOv4 achieved 89.99% mAP with only 22.76 M parameters and 22.46 G FLOPs, demonstrating substantially better parameter efficiency than YOLOv4 (64.36 M parameters, 92.25% mAP). The proposed tiered nearest-neighbor search strategy correctly enforces target priority ordering (rescue personnel → supply boxes → vehicles) and produces paths within 15% of the brute-force optimum for small target counts, with path computation completing in under 1 ms on ARM hardware.

Key limitations of the current work include the static-environment assumption of the search strategy (no dynamic re-routing), the limited diversity of the custom dataset, and the marginal benefit of CBAM on multi-class general benchmarks. Future work will focus on expanding the dataset to outdoor environments, implementing online mission re-planning to support dynamic target discovery, and exploring more advanced path optimization algorithms to reduce total travel distance in larger-scale deployments.

Author Contributions

Conceptualization, M.Z. and H.L.; Methodology, M.Z. and H.L.; Software, M.Z.; Validation, M.Z. and H.L.; Resources, M.Z.; Data curation, M.Z. and Q.Z.; Writing—original draft, M.Z. and H.L.; Project administration, H.L.; Funding acquisition, Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key R&D Program of China under grant No. 2023YFC3011503.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Author Mi Zhen was employed by the company Aerospace Science and Industry Intelligent Operations Research and Information Security Research Institute (Wuhan). The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Kurniadi, F.A.; Setianingsih, C.; Syaputra, R.E. Innovation in Livestock Surveillance: Applying the YOLO Algorithm to UAV Imagery and Videography. In Proceedings of the International Conference on Smart Instrumentation, Measurement and Applications (ICSIMA), Kuala Lumpur, Malaysia, 17–18 October 2023; pp. 246–251. [Google Scholar]
Zheng, L.; Ai, P.; Wu, Y. Building Recognition of UAV Remote Sensing Images by Deep Learning. In Proceedings of the IGARSS 2020—2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 1185–1188. [Google Scholar]
Likitvisetpong, M.; Erjongmanee, S.; Suwanagood, E.; Klumpol, C.; Teerataphong, P. System Development for Estimating Geolocation, Direction, and Velocity of Moving Objects in UAV Applications Using Monocular Camera. In Proceedings of the International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Khon Kaen, Thailand, 27–30 May 2024; pp. 1–6. [Google Scholar]
Mokayed, H.; Nayebiastaneh, A.; Alkhaled, L.; Sozos, S.; Hagner, O.; Backe, B. Challenging YOLO and Faster RCNN in Snowy Conditions: UAV Nordic Vehicle Dataset (NVD) as an Example. In Proceedings of the 2024 2nd International Conference on Unmanned Vehicle Systems-Oman (UVS), Muscat, Oman, 12–14 February 2024; pp. 1–6. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Sarda, A.; Dixit, S.; Bhan, A. Object Detection for Autonomous Driving using YOLO [You Only Look Once] algorithm. In Proceedings of the International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), Tirunelveli, India, 4–6 February 2021; pp. 1370–1374. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. Scaled-YOLOv4: Scaling Cross Stage Partial Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13024–13033. [Google Scholar]
Wen, Q.; Wang, S.; Li, L.; Li, X.; Liang, Z.; Guo, W.; Guo, X.; Tang, Q.; He, C. Technical Requirements for Autonomous Point Cloud Collection and Autonomous Inspection of Unmanned Aerial Vehicle. In Proceedings of the 2021 IEEE 5th Conference on Energy Internet and Energy System Integration (EI2), Taiyuan, China, 22–24 October 2021; pp. 3421–3424. [Google Scholar]
Zheng, D.; Chen, C. Research on Object Detection Algorithm Based on Deep Learning. In Proceedings of the International Conference on Electronic Communication and Artificial Intelligence (ICECAI), Shanghai, China, 4–5 July 2024; pp. 725–728. [Google Scholar]
Wang, Q.; Sheng, J.; Tong, C.; Wang, Z.; Song, T.; Wang, M.; Wang, T. A Fast Facet-Based SAR Imaging Model and Target Detection Based on YOLOv5 with CBAM and Another Detection Head. Electronics 2023, 12, 4039. [Google Scholar] [CrossRef]
Jiang, S.; Weng, X. Multimodal Hub-Spoke Emergency Logistics Network Design. In Proceedings of the International Conference on Service Systems and Service Management (ICSSSM), Guangzhou, China, 22–24 June 2015; pp. 1–4. [Google Scholar]
Jin, W.; Yang, J.; Fang, Y.; Feng, W. Research on Application and Deployment of UAV in Emergency Response. In Proceedings of the International Conference on Electronics Information and Emergency Communication (ICEIEC), Beijing, China, 17–19 July 2020; pp. 277–280. [Google Scholar]
Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. RepViT: Revisiting Mobile CNN from ViT Perspective. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 15909–15920. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Wey, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Yang, Y.; Yu, J.; Fu, Z.; Zhang, K.; Yu, T.; Wang, X.; Jiang, H.; Lv, J.; Huang, Q.; Han, W. Token-Mixer: Bind Image and Text in One Embedding Space for Medical Image Reporting. IEEE Trans. Med. Imaging 2024, 43, 4017–4028. [Google Scholar] [CrossRef] [PubMed]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar] [CrossRef]
Shi, Y.; Hidaka, A. Attention-YOLOX: Improvement in On-Road Object Detection by Introducing Attention Mechanisms to YOLOX. In Proceedings of the International Symposium on Computing and Artificial Intelligence (ISCAI), Beijing, China, 16–18 December 2022; pp. 5–14. [Google Scholar]
Zhen, M. Small Dataset in the Field of Emergency Rescue [Dataset]. 2025. Available online: https://huggingface.co/datasets/zhenmi/self_dataset/tree/main (accessed on 1 January 2025).
Hua, W.; Chen, Q.; Chen, W. A New Lightweight Network for Efficient UAV Object Detection. Sci. Rep. 2024, 14, 13288. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Xiao, Y.; Di, N. SOD-YOLO: A Lightweight Small Object Detection Framework. Sci. Rep. 2024, 14, 25624. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overall architecture of the YOLOv4 object detection network, consisting of the CSPDarknet53 backbone for feature extraction, the SPP (Spatial Pyramid Pooling) module for multi-scale receptive field aggregation, and the PANet (Path Aggregation Network) neck for multi-scale feature fusion, followed by the detection head for bounding box regression and class prediction.

Figure 2. Architecture of the Resblock_body module used in CSPDarknet53, illustrating the residual connections that enable gradient flow while reducing feature redundancy across stages.

Figure 3. Architecture of the proposed RepViT-YOLOv4 detection model. The standard CSPDarknet53 backbone is replaced by the lightweight RepViT-M1.5 network, which outputs three feature maps at different scales. These are passed through the SPP and PANet neck modules for multi-scale feature fusion, and finally processed by the detection head to predict bounding boxes and class labels in different color frames.

Figure 4. Architecture of the RepViTBlock module, shown for two configurations: stride = 1 (identity branch with token_mixer and channel_mixer) and stride = 2 (downsampling branch). Each block employs depthwise separable convolution (3 × 3 DW), pointwise convolution (1 × 1), optional squeeze-and-excitation (SE) modules, and feed-forward networks (FFN) to achieve efficient local and global feature extraction.

Figure 5. Architecture of the Convolutional Block Attention Module (CBAM). The module sequentially applies (1) a channel attention sub-module, which uses max-pooling and average-pooling followed by a shared MLP to generate channel-wise weights, and (2) a spatial attention sub-module, which applies a 7 × 7 convolution on pooled feature maps to generate spatial weights. Both weights are applied via element-wise multiplication to the input feature map.

Figure 6. Insertion positions of the three CBAM modules within the RepViT-YOLOv4 network. Each module is placed at the interface between the backbone feature layers and the PANet neck, enabling attention-guided feature refinement at three detection scales (13 × 13, 26 × 26, and 52 × 52).

Figure 7. Illustration of the tiered UAV search route planning strategy. Targets are assigned to three priority tiers: Tier 1 (rescue personnel, circles), Tier 2 (supply boxes, squares), and Tier 3 (vehicles, triangles). The UAV starts from position

P_{0}

and visits all Tier 1 targets using nearest-neighbor traversal before proceeding to Tier 2, and then Tier 3. Arrows indicate the planned flight path.

Figure 7. Illustration of the tiered UAV search route planning strategy. Targets are assigned to three priority tiers: Tier 1 (rescue personnel, circles), Tier 2 (supply boxes, squares), and Tier 3 (vehicles, triangles). The UAV starts from position

P_{0}

and visits all Tier 1 targets using nearest-neighbor traversal before proceeding to Tier 2, and then Tier 3. Arrows indicate the planned flight path.

Figure 8. Example of mosaic data augmentation applied to the custom emergency rescue dataset. Four images are randomly cropped and stitched together, exposing the model to diverse scales, viewpoints, and target densities within a single training sample.

Figure 9. Example of Gaussian noise augmentation applied to training images. Additive noise is injected with varying standard deviations to simulate sensor noise and atmospheric degradation encountered during real UAV flights.

Figure 10. Example of random rotation augmentation applied to training images. Images are rotated by random angles to simulate varied UAV orientations and camera tilt angles during flight.

Figure 11. Precision–recall (PR) curves for the proposed RepViT-M1.5-YOLOv4-CBAM model and its ablation variant RepViT-M1.5-YOLOv4 (without CBAM), evaluated on the custom emergency rescue dataset for three target classes: rescue personnel (fireman), supply boxes (redbox), and vehicles (car). A larger area under the PR curve indicates higher detection accuracy.

Figure 12. Qualitative detection results of the proposed RepViT-M1.5-YOLOv4-CBAM model under challenging conditions, including night-time illumination, dawn/dusk lighting, and varying degrees of target occlusion. Bounding boxes with class labels and confidence scores are shown for rescue personnel (fireman) in red frames, supply boxes (redbox) in green frames, and vehicles (car) in blue frames.

Figure 13. Simulation results of the tiered UAV search strategy on a randomly generated scenario with nine target nodes distributed across three priority tiers. The canvas size is 1280 × 720 pixels, matching the UAV camera resolution. The drone starts from origin O and visits all Tier 1 (rescue personnel) targets before proceeding to Tier 2 (supply boxes) and Tier 3 (vehicles), following the nearest-neighbor rule within each tier. The final destination is labeled

C_{2}

.

Figure 13. Simulation results of the tiered UAV search strategy on a randomly generated scenario with nine target nodes distributed across three priority tiers. The canvas size is 1280 × 720 pixels, matching the UAV camera resolution. The drone starts from origin O and visits all Tier 1 (rescue personnel) targets before proceeding to Tier 2 (supply boxes) and Tier 3 (vehicles), following the nearest-neighbor rule within each tier. The final destination is labeled

C_{2}

.

Table 1. AP, mAP, and FPS of different network models with a batch size of 16.

Model	fireman_AP (%)	redbox_AP (%)	car_AP (%)	mAP_0.5 (%)	FPS
RepViT-M1.5-YOLOv4-CBAM	98.74	97.96	99.03	98.58	18.70
RepViT-M1.5-YOLOv4	97.16	95.06	99.17	97.13	18.91
YOLOv4	98.06	98.79	99.69	98.85	16.70
MobilenetV3-YOLOv4	97.35	94.71	98.48	96.85	21.29
MobilenetV2-YOLOv4	96.31	92.16	99.88	96.26	18.31
MobilenetV1-YOLOv4	97.34	92.25	99.16	96.25	19.45
YOLOv5s	99.56	99.24	99.92	99.58	17.65

Table 2. mAP, #Params (Number of Parameters), and FLOPs of different network models with a batch size of 16.

Model	mAP_0.5 (%)	#Params	FLOPs
RepViT-M1.5-YOLOv4	89.99	22.76 M	22.46 G
RepViT-M1.5-YOLOv4-CBAM	89.28	22.84 M	22.46 G
YOLOv4	92.25	64.36 M	60.53 G
MobilenetV3-YOLOv4	79.01	11.73 M	7.70 G
MobilenetV2-YOLOv4	80.12	10.80 M	8.29 G
MobilenetV1-YOLOv4	79.72	12.69 M	10.65 G
YOLOv5s	86.20	7.277 M	17.16 G

Table 3. Summary of four representative simulation scenarios for the tiered UAV search strategy. Path deviation is measured relative to the brute-force optimal tour length within each tier.

Scenario	Tier 1/2/3 Targets	Spatial Distribution	Priority Enforced	Path Deviation
Balanced (Figure 13)	3/3/3	Uniform random	Yes	<10%
Priority-skewed	5/2/1	Uniform random	Yes	<12%
Clustered	3/3/3	Tier-clustered	Yes	<8%
Sparse wide-area	2/2/2	Dispersed edges	Yes	<15%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, H.; Zhang, Q.; Zhen, M. A Novel Object Detection-Based Air-to-Ground Target Search and Localization Strategy. Drones 2026, 10, 375. https://doi.org/10.3390/drones10050375

AMA Style

Li H, Zhang Q, Zhen M. A Novel Object Detection-Based Air-to-Ground Target Search and Localization Strategy. Drones. 2026; 10(5):375. https://doi.org/10.3390/drones10050375

Chicago/Turabian Style

Li, Haoran, Qinling Zhang, and Mi Zhen. 2026. "A Novel Object Detection-Based Air-to-Ground Target Search and Localization Strategy" Drones 10, no. 5: 375. https://doi.org/10.3390/drones10050375

APA Style

Li, H., Zhang, Q., & Zhen, M. (2026). A Novel Object Detection-Based Air-to-Ground Target Search and Localization Strategy. Drones, 10(5), 375. https://doi.org/10.3390/drones10050375

Article Menu

A Novel Object Detection-Based Air-to-Ground Target Search and Localization Strategy

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. RepViT-Enhanced Algorithm

2.2. CBAM Module

2.3. Target Detection-Based Search Localization Strategy

3. Results and Discussion

3.1. Dataset for Specific Scenarios and Specific Targets

3.2. Ablation Experiments on the Self-Built Dataset

3.3. Ablation Experiments of Target Detection Algorithms on Public Datasets

3.4. Deep Search Based on Recognition Results

3.5. Discussion

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI