Research on Visual Target Detection Method for Smart City Unmanned Aerial Vehicles Based on Transformer

Qi, Bo; Shi, Hang; Zhao, Bocheng; Mu, Rongjun; Huo, Mingying

doi:10.3390/aerospace12110949

Open AccessArticle

Research on Visual Target Detection Method for Smart City Unmanned Aerial Vehicles Based on Transformer

by

Bo Qi

,

Hang Shi

,

Bocheng Zhao

,

Rongjun Mu

^* and

Mingying Huo

^*

School of Astronautics, Harbin Institute of Technology, Harbin 150001, China

^*

Authors to whom correspondence should be addressed.

Aerospace 2025, 12(11), 949; https://doi.org/10.3390/aerospace12110949 (registering DOI)

Submission received: 9 September 2025 / Revised: 10 October 2025 / Accepted: 22 October 2025 / Published: 24 October 2025

(This article belongs to the Section Aeronautics)

Download

Browse Figures

Versions Notes

Abstract

Unmanned aerial vehicles play a significant role in the automated inspection of future smart cities, which can ensure the safety of urban residents’ lives and property and the normal operation of the city. However, there may be situations where small targets in drone images are difficult to detect and the detection is unclear when the targets are similar to the environment. In response to the above problems, this paper proposes a real-time target detection method for unmanned aerial vehicle images based on Transformer. Aiming at the problem of small targets lacking visual features, a feature fusion module was designed, which realizes the interaction and fusion of features at different levels and improves the feature expression ability of small targets. Aiming at the problem of discontinuous features when the target is similar to the environment, a multi-head attention algorithm based on Transformer is designed. By extracting the context information of the target, the recognition ability of targets similar to the environment is improved. On the target image dataset collected by unmanned aerial vehicles in smart cities, the detection accuracy of the method described in this paper has reached 85.9%.

Keywords:

unmanned aerial vehicle; smart city; object detection; transformer

1. Introduction

As a critical component of smart city development, the intelligent identification of urban management events has garnered increasing global attention [1,2]. Smart cities, aiming to enhance operational efficiency and quality of life through cutting-edge technologies, have emerged as a transformative trend worldwide over recent decades [3,4]. In this context, two key drivers underscore the urgency of advancing intelligent event detection: First, manual identification and organization of urban management events—such as public order violations, infrastructure defects, or environmental hazards—are increasingly recognized as time-consuming and resource-intensive processes, consuming substantial human labor and financial resources [5]. Second, rapid progress in artificial intelligence (AI) and computer vision, coupled with the widespread deployment of monitoring tools like unmanned aerial vehicles (UAVs) and high-definition (HD) cameras [6], has opened new avenues for AI-powered, automated detection of urban management events [7]. These technologies have already found extensive applications in logistics transportation, agricultural plant protection, and scenic photography [8]. Notably, UAVs integrated into smart city ecosystems hold immense potential to improve daily life [9], particularly in scenarios requiring frequent and dynamic patrol inspections—for example, enabling law enforcement to rapidly respond to emergencies, ensuring public safety, or assisting power companies in detecting and resolving grid line faults to maintain stable energy supply [10,11,12].

However, real-world patrol operations face significant challenges. Complex urban architectures often obstruct human inspectors’ line of sight, degrading inspection efficacy [13]. Traditional manual patrols are not only labor-intensive but also prone to human errors and delays [14]. In contrast, equipping UAVs with fixed patrol routes and replacing human visual observation with UAV-mounted cameras could drastically reduce workload, enhance operational efficiency, and strengthen public safety guarantees [15]. This shift aligns with the core objectives of sustainable urban development by optimizing resource allocation and improving service responsiveness [16,17].

In recent years, neural network-based target detection techniques have undergone extensive research, making automated ground target detection by UAVs feasible [18]. During inspection missions, UAVs can leverage onboard processors for real-time detection and automatically alert command centers upon identifying anomalies [19]. Nevertheless, most existing target detection methods are optimized for natural scene images and exhibit notable limitations in UAV-based applications: (1) small targets in UAV imagery often lack discriminative visual features, and the downsampling operations inherent in deep neural networks may further degrade these sparse features, leading to poor detection performance [20]; (2) when targets share similar visual characteristics with their surroundings, current networks frequently fail to accurately classify such objects due to discontinuous or ambiguous feature representations [21].

To address these challenges, this paper introduces a novel UAV image target detection network termed the Deep Neural Network Object Detection Method based on Transformer (DODT). The proposed framework incorporates three key innovations:

(1) A hierarchical feature fusion module is designed to enable cross-level feature interaction and integration, thereby enhancing the feature representation capability of small targets.

(2) A Transformer-based multi-head attention mechanism is developed to capture long-range contextual dependencies, improving the recognition accuracy of targets with background-like appearances.

(3) Experimental evaluations on the COCO dataset demonstrate that DODT outperforms state-of-the-art detectors by achieving a 4.6% improvement in mean Average Precision (mAP) for small targets while maintaining competitive inference speed. These results validate the effectiveness of DODT in addressing the unique challenges of UAV-based urban inspection tasks, paving the way for more reliable and efficient smart city management systems.

The remainder of this article is organized as follows. Section 2 describes the research status of existing methods for UAV target detection. Section 3 presents the structure of the proposed network, and Section 4 presents the associated experimental findings and outcomes. Lastly, Section 5 summarizes the key conclusions.

2. Related Works

UAVs possess the capability to perform prolonged surveillance in densely populated urban areas, unhindered by terrain constraints, thereby safeguarding urban operational continuity and the safety of citizens’ lives and property. Nevertheless, manual operation of UAVs demands substantial human and material resources, while their excessive dependence on network signals severely restricts their applicability across diverse operational scenarios. To address these limitations, there is a critical need for UAVs to reliably detect ground objects using their onboard processors in varied real-world environments.

To enable automatic detection of ground targets by UAVs, numerous researchers have put forward UAV image-based target detection techniques grounded in manual feature extraction. Jiang et al. designed a multi-scale feature extraction module to extract feature information through convolution operations of different scales on multiple branches [22]. Yu et al. studied the manual method of accurately segmenting corn tassels from drone images in complex situations, and achieved good segmentation accuracy [23]. Lin et al. developed a segmented detection model for rice blast detection and resistance assessment [24]. The above-mentioned research indicates that image analysis research in UAVs has advanced task-specific computational solutions, spanning multi-scale feature processing, precise target segmentation in complex scenarios, and disease detection modeling, collectively enhancing crop phenotyping and health evaluation capabilities. With the remarkable progress achieved in deep learning-based target detection techniques, an increasing number of researchers have put forward neural network-based detection methodologies. Zuo et al. proposed a small-target detection model for drones, which improved the detection accuracy of small-target drones in air-to-air scenarios [25]. Li et al. used the method of resampling the feature image using hollow convolution to improve the performance of UAV image feature extraction and target detection [26]. Ala’eddin et al. developed an autonomous UAV system to detect dynamic and uncertain intrusions in an area [27]. Advancements in UAV vision technologies have been demonstrated through recent studies, with innovations in small-target detection optimization, feature extraction enhancement via hollow convolution resampling, and autonomous intrusion monitoring for dynamic scenarios, collectively strengthening the adaptability and reliability of UAV systems in diverse operational environments.

3. Proposed Network Framework

Owing to the high operational altitudes of unmanned aerial vehicles (UAVs), targets within their visual field tend to be diminutive in size, with limited visual details—circumstances that demand the detection model exhibit strong performance in identifying small objects. In practical working environments, UAV altitudes may shift dramatically due to unforeseen contingencies. Consequently, the scale of observed objects in the UAV’s field of view also varies dynamically, necessitating that the detection model possess robust capabilities for multi-scale object detection.

At present, researchers have conducted some studies on object detection models and algorithms based on deep learning. To balance the efficiency of real-time video object detection, the commonly selected algorithm models mainly include Fast R-CNN, YOLO, SSD, etc. This paper proposes a deep neural network method based on Transformer. In the initial part of the network, a Convolutional Neural Network (CNN) is used to extract image features, and then a path aggregation network (PAN) is adopted for multi-scale feature fusion, embedding the transformer into it to enhance context information. The Transformer module captures long-distance dependencies in images through the combination of Multi-Head Self-Attention and Feedforward Network, thereby enhancing the detection performance of the algorithm in complex scenarios. The overall network architecture is shown in Figure 1.

Within these components, Conv refers to the convolutional layer, primarily functioning for feature extraction tasks. In the present study, a 1 × 1 convolution kernel is initially employed to consolidate the channel-specific information from incoming features, thereby projecting these inputs into a lower-dimensional feature space—here, the channel count of the input features defines the dimensionality. To extract features from these reduced-dimensional feature maps, a multi-branch architecture utilizing 3 × 3 convolution kernels is adopted, with the channel dimension of each branch set to 1/32 of that in the original input feature maps. Following this, outputs from all branches are fused, and their features are combined through another 1 × 1 convolution kernel. To address gradient attenuation issues, this paper’s proposed methodology incorporates batch normalization (BN), which can be mathematically formulated as follows:

b = BN (C_{2} (C_{1} (a) + \sum_{i = 1}^{T} I_{i} (a))

(1)

where a corresponds to the input feature map, b stands for the output feature map, and

C_{1}

refers to the 1 × 1 standard convolutional layer, with its channel dimension set to half that of the input feature map.

I_{i} (a)

denotes the process of projecting input features into a low-dimensional embedding space and transforming these features. Each

I_{i}

shares an identical architecture, comprising a 1 × 1 convolution and a 3 × 3 convolution, where the channel dimension is 1/32 of that in the input feature map, a design that aids in isolating key factors and allows scaling to accommodate multiple transformations. T represents the cardinality of the transformation set to be aggregated, serving to regulate the quantity of homogeneous multi-branch configurations.

C_{2}

corresponds to the 1 × 1 standard convolutional layer applied to the fused features, while BN denotes the batch normalization layer.

The C2f module executes feature transformation on incoming data via two convolutional layers. By concatenating features derived from separate branches along the channel dimension, it facilitates the integration of these features. The resulting concatenated features, which incorporate details from diverse branches, serve to enhance the descriptive capability of the feature representation. The structure of C2f is shown in Figure 2.

The canonical Transformer module primarily comprises a multi-head self-attention (MHSA) mechanism and linear layers. The MHSA computes global relationships among sequence tokens via query–key–value interactions and further learns these global features through linear networks. However, the standard Transformer module may exhibit the following limitations: (1) the pure Transformer architecture directly slices and encodes input images, which hinders the extraction of fine-grained visual details; (2) the computational burden and memory consumption of MHSA scale quadratically with spatial dimensions in Transformers, leading to substantial training and inference overheads. To enhance Transformer performance in visual tasks, this paper proposes a Convolutional Transformer (CT) network that integrates CNN and Transformer components where detailed architecture of CT is presented in Figure 3. CT shares the same topological structure as lightweight feature extraction modules but replaces their homogeneous multi-branch structures with self-attention mechanisms. This simplified design reduces the need for free hyperparameter selection, thereby improving model stability across diverse scenarios.

To address the aforementioned issues with conventional MHSA, CT substitutes positional linear projections in original MHSA with convolutional projections. Specifically, a convolutional projection layer is employed to project the input 2D token map, which is then flattened into a 1D sequence for subsequent processing, which can be formulated as follows:

t_{i}^{Q / K / V} = Flatten (Conv (t_{i}), s)

(2)

where

t_{i}^{Q / K / V}

is the input token for the query/key/value (Q/K/V) matrices at layer i;

t_{i}

represents the unaltered token preceding the convolutional projection step; Conv refers to a standard convolution operation; and s denotes the size of the convolution kernel.

SPPF (Spatial Pyramid Pooling Fast) is designed to facilitate feature fusion and pooling tasks. By executing multi-scale pooling through pooling kernels of varying dimensions and combining the resulting outputs, it extracts features spanning multiple scales, ultimately enhancing the model’s ability to detect objects of different sizes. SCDown (Selective Channel Downsampling), a lightweight downsampling component, initially modifies the channel dimensionality via pointwise convolution, followed by spatial downsampling using depthwise convolution. This design effectively curbs computational expenses and parameter counts while preserving model accuracy. Functioning as a single-stage detection framework, the methodology presented here necessitates only one forward network pass to directly forecast object positions and class labels. It further leverages non-maximum suppression (NMS) to eradicate redundant, low-quality detection bounding boxes, yielding the ultimate detection outcomes.

In neural network-based salient object detection methods, the adoption of multiple merging operations leads to a reduction in feature map dimensions, which tends to obscure the boundaries of salient objects. Additionally, image information undergoes severe degradation as it traverses the network. Conventional fusion approaches relying on direct layer stacking face inherent limitations, particularly when processing images through deep networks where substantial low-level feature information is lost. Notably, as network depth increases, the severity of information loss escalates. In fact, the missing low-level visual information proves highly valuable for achieving accurate recognition. To address these challenges, this paper introduces a hierarchical feature fusion strategy that capitalizes on the complementarity of different feature levels, as shown in Figure 4.

Cross-entropy loss is standard in classification and widely adopted in salient object detection to contrast predicted saliency maps with ground truth. The formulation can be expressed as follows:

L_{D} = - \sum_{i = 0}^{size (X_{i})} (X_{i} log (Y_{i}) + (1 - X_{i}) log (Y_{i}))

(3)

where

X_{i}

denotes the ground truth and

Y_{i}

denotes the predicted saliency map. In DODT, the aggregate loss combines weighted contributions from classification and regression sub-tasks. The classification loss uses binary cross-entropy (BCEL) to measure differences between predicted class probabilities and ground truth, expressed as follows:

L_{C} = - \frac{1}{N} \sum_{i = 1}^{N} y_{i} log (P (y_{i})) + (1 - y_{i}) log (1 - P (y_{i}))

(4)

where

y_{i}

is a binary label taking values of 0 or 1,

P (y)

denotes the probability that the output belongs to the label, and N represents the number of groups of objects predicted by the model. Meanwhile, the regression loss incorporates the bounding box regression loss (BBRL).

L_{B} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}}

(5)

where

I o U

measures overlap between detection b and ground truth

b^{g t}

, with c as the diagonal of their minimal enclosing box. BBRL minimizes coordinate discrepancies between predicted and true bounding boxes. They are combined via a weighted scheme for the total loss:

L = W_{C} L_{C} + W_{D} L_{D} + W_{B} L_{B}

(6)

where

W_{C}

,

W_{D}

, and

W_{B}

denote the weight coefficient of each loss.

Given that the information extracted via the fusion of Conv-1 and Conv-2 primarily comprises edge details from images, this study proposes the concept of supervising the extracted edge information using ground truth edge data. Notably, the Laplacian operator—which is a widely adopted second-order derivative operator for edge detection—is defined in two-dimensional space as follows:

Δ l = \frac{𝜕^{2} f}{𝜕 p^{2}} + \frac{𝜕^{2} f}{𝜕 q^{2}}

(7)

The weighted fusion module employed in this paper can be formulated as follows:

F_{O} = \sum_{i} \frac{w_{i} F_{I}^{i}}{\sum_{i} w_{i} + υ}

(8)

where

F_{I}^{i}

denotes the input feature,

w_{i} \geq 0

is the trainable weight parameter, and

υ

represents a small constant, specifically intended to enhance computational stability.

4. Experiments and Analysis

In this section, experimental investigations were performed on a UAV image dataset. The obtained results corroborated the validity of the theoretical analyses. These tests were executed on a server configured with 13th Gen Intel(R) Core(TM) i7-13700F CPU (Intel Corporation, Santa Clara, CA, USA) and NVIDIA GeForce RTX 4060 Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA). To ensure fair comparison, all simulation results presented in this work were generated under standardized configurations: we adopted a batch size of 1, utilized FP32 precision for model inference, developed the framework based on PyTorch 2.1.1, and explicitly included non-maximum suppression (NMS) processing in the detection pipeline. Certain datasets commonly utilized in object detection tasks are grounded in natural scene environments, where their sample categories and feature distributions diverge notably from those of UAV imagery. Put differently, these datasets fail to capture the distinct characteristics inherent to UAV-based object detection scenarios. To assess the performance of our proposed model in real-world UAV operational settings, we leveraged a collection of authentic UAV-captured images. This dataset encompasses diverse time periods, weather conditions, lighting scenarios, and complex backgrounds. Following sorting and annotation processes, a UAV image dataset comprising 15,000 images was compiled. In alignment with the detection requirements of UAVs in practical applications, these images were categorized into classes including background, freight car, van, bus, truck, and car; Figure 5 shows some examples of the dataset. As can be seen from the figure, the dataset contains UAV-captured images of various scenes under different time periods and lighting conditions. In the experiments of this section, 60% of the dataset was randomly selected as the training set, 20% as the validation set, and the remaining 20% as the test set. The hyperparameter settings for model training are shown in Table 1.

4.1. Training Process

During the training process, to effectively illustrate the trade-off between accuracy and recall at varying thresholds, this study employed the P-R curve to experimentally depict the recognition rates of different vehicle categories by validating the proposed DODT. As shown in Figure 6, the improved model exhibited enhanced detection accuracy across all categories, particularly for the “car” category, which reached 95.8%. Upon calculation, the model achieved an average accuracy of 80.5%. Subsequently, this section utilized the confusion matrix to assess the accuracy of the proposed model’s prediction results, as shown in Figure 7. In this confusion matrix, each column denotes the predicted proportion of each category, while each row corresponds to the true proportion of each category within the dataset. Analysis of the confusion matrix reveals that the prediction accuracies for the categories “car,” “truck,” “bus,” “van,” and “freight car” are 92%, 62%, 89%, 57%, and 54%, respectively, further confirming that the DODT algorithm demonstrates high accuracy across all categories in the drone image recognition dataset.

4.2. Test Results

To highlight the effectiveness of the method proposed in this paper, this chapter conducts comparisons with several state-of-the-art approaches across multiple metrics, including mean Average Precision (mAP), Frames Per Second (FPS), and recognition results for different targets. The compared methods encompass YOLOv3 [28], YOLOv4 [29], YOLOv5-S(Ultralytics, v5.0) [30], YOLOv6-N(Meituan-Yolo, v6.0) [31], YOLOv8(Ultralytics, v8.0) [32], Region-based Convolutional Neural Networks (R-CNN) [33], Single Shot MultiBox Detector (SSD) [34], Real-Time Object Detection (RTD) [35], YOLOv7-Transformer (YOLOv7-TR), and Real-Time Detection Transformer (RT-DETR). Table 2 presents the test outcomes of all methods under investigation, with the optimal results for each data entry highlighted in bold, where Fcar denotes the freight car class.

Experimental results demonstrate that DODT achieves the highest detection accuracy with a mean Average Precision (mAP) of 77.7%. This represents a 2.5% mAP improvement over the second-ranked method RTD and a 3.4% mAP advantage over the third-placed YOLOv8. DODT achieves a mAP of 77.7%, while RT-DETR attains 77.1%, meaning DODT’s mAP is 0.6% higher than that of RT-DETR. Specifically, DODT’s mAP still exceeds the highest mAP among other methods. Concurrently, DODT attains an inference speed of 89.7 Frames Per Second (FPS), which is faster than R-CNN by a factor of 9.08 and exceeds YOLOv4 by a factor of 1.96. While DODT exhibits marginally slower inference speeds compared to YOLOv6-N, YOLOv8, and RTD, its performance remains at a practical level. Despite being 13.4 FPS slower than the fastest method (YOLOv8), it delivers superior detection accuracy. Furthermore, DODT maintains a compact model size of 13.7 million parameters, which is relatively small among all evaluated methods. It can therefore be concluded that DODT achieves favorable trade-offs between detection accuracy, inference speed, and model size, making it well-suited for aerial object detection on computationally constrained platforms such as drones in smart city applications. The YOLOv7-Transformer achieves a competitive mAP of 75.2%, outperforming YOLOv8, largely due to its Transformer-enhanced feature fusion improving small-target detection. However, its computational overhead limits real-time performance to 38 FPS—significantly slower than DODT and RT-DETR. RT-DETR excels with a high mAP and the fastest speed, while DODT maintains superior robustness in complex scenarios: it dominates in detecting background-similar targets like van and freight car, validating its hierarchical fusion and context-aware attention for challenging UAV environments. Thus, DODT strikes a unique balance—prioritizing occlusion resilience and feature continuity where pure Transformer or CNN models falter. By comparing the results with the above methods, the effectiveness of DODT as a method integrating CNN and Transformer modules on the UAV image dataset can be verified. Furthermore, DODT achieves high detection accuracy across most object categories. Notably for car, bus, and van categories, it delivers mAP scores of 95.8%, 93.9%, and 64.9%, respectively. Truck and freight car represent the only categories where it is marginally outperformed by RTD and YOLOv8. Compared with detectors featuring considerably more parameters including YOLOv3, YOLOv4, R-CNN, and YOLOv5-S, DODT demonstrates the highest accuracy overall. While DODT achieves superior mAP and efficiency among all methods, a closer look at its per-category detection results reveals that DODT performs slightly worse than RTD or YOLOv8 in categories like “truck” and “van.” It also performs better in detecting small objects like “car.” This performance difference is primarily due to category-specific environmental challenges, including background similarity and occlusion. In extreme cases, purely CNN-based methods may still have a slight advantage.

Figure 8 shows the object detection results of each method under varying lighting and scene conditions. Due to the slower detection speeds of methods such as R-CNN, SSD, YOLOv3, and YOLOv4, which are insufficient to meet the real-time ground object detection requirements for UAVs, we conducted a visual comparison of the results from several other methods, namely YOLOv6-N, RTD, YOLOv8, and DODT, on a UAV dataset. In the presented figure, the first row displays the original images. Subsequent rows show the detection result instances generated by each respective method. Visual analysis reveals that YOLOv6-N missed several objects. While RTD’s detection performance is acceptable, it exhibits target misclassification errors (e.g., misidentifying a ’Bus’ as a ’Freight car’). YOLOv8 demonstrates generally good performance overall but fails to detect objects under occlusion conditions, such as those obscured by trees. In contrast, DODT demonstrates robust detection across varying object scales, detecting targets with minimal omissions and outperforming all other methods in detection completeness. Notably, it accurately detected, identified, and correctly classified occluded objects. To sum up, DODT achieves a 77.7% mAP on the UAV image dataset, surpassing state-of-the-art methods like YOLOv8 and RTD to demonstrate strong performance. Its targeted improvement for small targets is validated by category-specific gains: trucks hit 62% mAP, up from YOLOv8’s 50%, and freight cars reach 54% mAP, up from YOLOv8’s 52%. The hierarchical feature fusion module (Figure 2) and Convolutional Transformer (Figure 3) enhance small-object detection in high-altitude scenarios, with visual proof in Figure 8 showing robust performance for occluded/miniature targets. The experimental results indicate that DODT possesses strong multi-scale object detection capabilities and exhibits significant occlusion resilience, demonstrating immunity to confounding information.

To further validate the efficacy of DODT in target detection for smart city applications, we deployed this model for special event identification within a localized smart city setting, utilizing the dedicated dataset specifically designed for detecting anomalous events in urban environments (DAEUE-dataset). Representative results from this practical implementation are depicted in Figure 9. As shown in the figure, DODT effectively detects diverse urban special events across complex city environments, including packaged waste (DBLJ), exposed garbage (BLLJ), road occupation (ZDJY), material stockpiles (DWDL), improper umbrella usage (WGCS), and non-motor vehicle violations (FJDC), demonstrating robust recognition capabilities in challenging real-world scenarios. DODT achieves 76.2% mAP on the DAEUE-dataset for urban anomaly detection, excelling in small-object tasks (e.g., waste piles, vehicle violations) with 68.5% mAP for sub-32 × 32 anomalies and a 6.3% small-target mAP lead over YOLOv8. This stems from its Transformer-based attention mechanism, which captures contextual dependencies in cluttered urban settings to ensure reliable detection of targets blending with backgrounds.

We conducted evaluations of the proposed DODT method on the public dataset Microsoft Common Objects in Context (MS COCO) to assess its generalization capability across diverse scenarios. Unlike drone image datasets, COCO represents a highly challenging public benchmark characterized by dozens of distinct object categories. Figure 10 presents visualized instances of DODT results on the COCO dataset. Visual evidence confirms DODT’s robust performance in detecting small objects across varied scenarios, underpinned by its effective small-target feature learning. The method also maintains high precision in identifying specific object classes within challenging scenes characterized by backgrounds exhibiting high similarity to the target objects. In addition, these visualizations demonstrate that our method performs consistently well across diverse scenarios, indicating its robust generalization capability.

Table 3 presents a comparative analysis of the test results of other state-of-the-art models on the DODT and daeu datasets, demonstrating the input resolutions of different methods and comparing the results under the two evaluation metrics of mAP0.5 and MAP0.5:0.95, respectively. As shown in Row 5 of Table 3, the detection accuracy of our method on the COCO dataset is comparatively lower than that of YOLOv10. Our proposed DODT method achieves a detection accuracy of 38.9% on the COCO dataset, which is comparatively lower than that of YOLOv10. However, when evaluated against advanced approaches including YOLOv8, SSD, and RTD, DODT demonstrates significant accuracy advantages. Specifically, it outperforms these methods by margins of 4.29%, 35.07%, and 15.77%, respectively, in terms of the mAP0.5:0.95 metric. These improvements substantiate the effectiveness of the proposed approach to some extent. To sum up, on UAV image data, DODT demonstrates superior detection performance over these existing approaches, which is primarily attributed to the fact that UAV image datasets are predominantly composed of small objects. These experimental findings suggest, to a certain degree, that DODT exhibits stronger suitability for small object detection in UAV images, particularly under conditions involving approximate background contexts, when compared with alternative methods.

5. Conclusions

This paper introduces DODT, a real-time object detection framework specifically designed for autonomous UAV inspection in smart city environments. To address key challenges in UAV imagery analysis, namely the difficulty of detecting small objects and the low target–background contrast, DODT incorporates a hierarchical feature fusion mechanism based on the Transformer architecture. This design enhances the model’s capability for robust small object detection and improves performance under conditions of indistinct or discontinuous feature boundaries. Evaluated on a real-world UAV image dataset collected from operational smart city scenarios, DODT achieves a mAP of 77.7% while processing frames at 89.7 FPS. The framework also demonstrates satisfactory detection performance in identifying specific exceptional events relevant to smart city management. Furthermore, experimental results on standard benchmarks indicate that DODT achieves high-precision real-time detection across diverse experimental conditions, fulfilling the critical requirements for automated inspection tasks within the smart city domain.

Author Contributions

Conceptualization, B.Q. and M.H.; methodology, B.Z. and M.H.; software, B.Z. and M.H.; validation, B.Q., M.H. and H.S.; formal analysis, B.Q. and B.Z.; investigation, B.Z., R.M. and B.Q.; resources, R.M.; data curation, B.Z. and B.Q.; writing—original draft preparation, B.Z. and B.Q.; writing—review and editing, B.Z., B.Q. and R.M.; visualization M.H. and H.S.; supervision, B.Q. and M.H.; project administration, R.M.; funding acquisition, R.M. and M.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant number 12272104) and the National Natural Science Foundation of China (grant number U22B2013).

Data Availability Statement

The data presented in this study are available on request from the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mora, L.; Gerli, P.; Batty, M.; Royall, E.B.; Carfi, N.; Coenegrachts, K.-F.; de Jong, M.; Facchina, M.; Janssen, M.; Meijer, A.; et al. Confronting the smart city governance challenge. Nat. Cities 2025, 2, 110–113. [Google Scholar] [CrossRef]
Kaiser, Z.A.; Deb, A. Sustainable smart city and sustainable development goals (sdgs): A review. Reg. Sustain. 2025, 6, 100193. [Google Scholar] [CrossRef]
Dahmane, W.M.; Ouchani, S.; Bouarfa, H. Smart cities services and solutions: A systematic review. Data Inf. Manag. 2025, 9, 100087. [Google Scholar] [CrossRef]
Abu-Rayash, A.; Dincer, I. Development of an integrated model for environmentally and economically sustainable and smart cities. Sustain. Energy Technol. Assess. 2025, 73, 104096. [Google Scholar] [CrossRef]
Behmanesh, H.; Brown, A. Improving the design and management of temporary events in public spaces by applying urban design criteria. J. Urban Manag. 2025; in press. [Google Scholar] [CrossRef]
Fang, H.; Xia, M.; Zhou, G.; Chang, Y.; Yan, L. Infrared small uav target detection based on residual image prediction via global and local dilated residual networks. IEEE Geosci. Remote. Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Bocheng, Z.; Mingying, H.; Zheng, L.; Wenyu, F.; Ze, Y.; Naiming, Q.; Shaohai, W. Graph-based multi-agent reinforcement learning for collaborative search and tracking of multiple uavs. Chin. J. Aeronaut. 2025, 38, 103214. [Google Scholar]
Salameh, H.B.; Alhafnawi, M.; Masadeh, A.; Jararweh, Y. Federated reinforcement learning approach for detecting uncertain deceptive target using autonomous dual uav system. Inf. Process. Manag. 2023, 60, 103149. [Google Scholar] [CrossRef]
Dong, Y.; Ma, Y.; Li, Y.; Li, Z. High-precision real-time uav target recognition based on improved yolov4. Comput. Commun. 2023, 206, 124–132. [Google Scholar] [CrossRef]
Zhao, B.; Huo, M.; Li, Z.; Yu, Z.; Qi, N. Clustering-based hyper-heuristic algorithm for multi-region coverage path planning of heterogeneous uavs. Neurocomputing 2024, 610, 128528. [Google Scholar] [CrossRef]
Zhao, B.; Huo, M.; Li, Z.; Yu, Z.; Qi, N. Graph-based multi-agent reinforcement learning for large-scale uavs swarm system control. Aerosp. Sci. Technol. 2024, 150, 109166. [Google Scholar] [CrossRef]
Wang, C.; Tian, J.; Cao, J.; Wang, X. Deep learning-based uav detection in pulse-doppler radar. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 1–12. [Google Scholar] [CrossRef]
Ahmad, J.; Sajjad, M.; Eisma, J. Small unmanned aerial vehicle (uav)-based detection of seasonal micro-urban heat islands for diverse land uses. Int. J. Remote. Sens. 2025, 46, 119–147. [Google Scholar] [CrossRef]
Zhai, Y.; Wang, L.; Yao, Y.; Jia, J.; Li, R.; Ren, Z.; He, X.; Ye, Z.; Zhang, X.; Chen, Y.; et al. Spatially continuous estimation of urban forest aboveground biomass with uav-lidar and multispectral scanning: An allometric model of forest structural diversity. Agric. For. Meteorol. 2025, 360, 110301. [Google Scholar] [CrossRef]
Raju, M.R.; Mothku, S.K.; Somesula, M.K.; Chebrolu, S. Age and energy aware data collection scheme for urban flood monitoring in uav-assisted wireless sensor networks. Ad Hoc Netw. 2025, 168, 103704. [Google Scholar] [CrossRef]
Chao, D.; Zhang, Y.; Ziye, J.; Yiyang, L.; Zhang, L.; Qihui, W. Three-dimension collision-free trajectory planning of uavs based on ads-b information in low-altitude urban airspace. Chin. J. Aeronaut. 2025, 38, 103170. [Google Scholar]
Yang, J.; Qin, D.; Tang, H.; Tao, S.; Bie, H.; Ma, L. Dinov2-based uav visual self-localization in low-altitude urban environments. IEEE Robot. Autom. Lett. 2025, 10, 2080–2087. [Google Scholar] [CrossRef]
Liu, J.; Wen, B.; Xiao, J.; Sun, M. Design of uav target detection network based on deep feature fusion and optimization with small targets in complex contexts. Neurocomputing 2025, 639, 130207. [Google Scholar] [CrossRef]
Zhao, B.; Huo, M.; Yu, Z.; Qi, N.; Wang, J. Model-reference reinforcement learning for safe aerial recovery of unmanned aerial vehicles. Aerospace 2023, 11, 27. [Google Scholar] [CrossRef]
Lu, S.; Guo, Y.; Long, J.; Liu, Z.; Wang, Z.; Li, Y. Dense small target detection algorithm for uav aerial imagery. Image Vis. Comput. 2025, 156, 105485. [Google Scholar] [CrossRef]
Semenyuk, V.; Kurmashev, I.; Lupidi, A.; Alyoshin, D.; Kurmasheva, L.; Cantelli-Forti, A. Advances in uav detection: Integrating multi-sensor systems and ai for enhanced accuracy and efficiency. Int. J. Crit. Infrastruct. Prot. 2025, 49, 100744. [Google Scholar] [CrossRef]
Jiang, L.; Yuan, B.; Du, J.; Chen, B.; Xie, H.; Tian, J.; Yuan, Z. Mffsodnet: Multiscale feature fusion small object detection network for uav aerial images. IEEE Trans. Instrum. Meas. 2024, 73, 1–14. [Google Scholar] [CrossRef]
Yu, X.; Yin, D.; Nie, C.; Ming, B.; Xu, H.; Liu, Y.; Bai, Y.; Shao, M.; Cheng, M.; Liu, Y.; et al. Maize tassel area dynamic monitoring based on near-ground and uav rgb images by u-net model. Comput. Electron. Agric. 2022, 203, 107477. [Google Scholar] [CrossRef]
Shaodan, L.; Yue, Y.; Jiayi, L.; Xiaobin, L.; Jie, M.; Haiyong, W.; Zuxin, C.; Dapeng, Y. Application of uav-based imaging and deep learning in assessment of rice blast resistance. Rice Sci. 2023, 30, 652–660. [Google Scholar] [CrossRef]
Zuo, G.; Zhou, K.; Wang, Q. Uav-to-uav small target detection method based on deep learning in complex scenes. IEEE Sensors J. 2024, 25, 3806–3820. [Google Scholar] [CrossRef]
Tan, L.; Lv, X.; Lian, X.; Wang, G. Yolov4_drone: Uav image target detection based on an improved yolov4 algorithm. Comput. Electr. Eng. 2021, 93, 107261. [Google Scholar] [CrossRef]
Masadeh, A.; Alhafnawi, M.; Salameh, H.A.B.; Musa, A.; Jararweh, Y. Reinforcement learning-based security/safety uav system for intrusion detection under dynamic and uncertain target movement. IEEE Trans. Eng. Manag. 2022, 71, 12498–12508. [Google Scholar] [CrossRef]
Lawal, M.O. Tomato detection based on modified yolov3 framework. Sci. Rep. 2021, 11, 1447. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Zhou, Q.; Zhang, W.; Li, R.; Wang, J.; Zhen, S.; Niu, F. Improved yolov5-s object detection method for optical remote sensing images based on contextual transformer. J. Electron. Imaging 2022, 31, 043049. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. Yolov6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. Uav-yolov8: A small-object-detection model based on improved yolov8 for uav aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef] [PubMed]
Charisis, C.; Nuwayhid, S.; Argyropoulos, D. A novel mask r-cnn-based tracking pipeline for oyster mushroom cluster growth monitoring in time-lapse image datasets. Comput. Electron. Agric. 2025, 237, 110590. [Google Scholar] [CrossRef]
Zhu, W.; Zhang, H.; Eastwood, J.; Qi, X.; Jia, J.; Cao, Y. Concrete crack detection using lightweight attention feature fusion single shot multibox detector. Knowl.-Based Syst. 2023, 261, 110216. [Google Scholar] [CrossRef]
Ye, T.; Qin, W.; Zhao, Z.; Gao, X.; Deng, X.; Ouyang, Y. Real-time object detection network in uav-vision based on cnn and transformer. IEEE Trans. Instrum. Meas. 2023, 72, 1–13. [Google Scholar] [CrossRef]

Figure 1. Network architecture.

Figure 2. Structure of C2f.

Figure 3. Structure of convolutional transformer.

Figure 4. Hierarchical feature fusion extraction module.

Figure 5. Instance from the UAV image dataset.

Figure 6. Precision–recall curve of DODT on the dataset.

Figure 7. Confusion matrix of DODT on the dataset.

Figure 8. Test results of each method under varying lighting and scene conditions.

Figure 9. Instance of special event detection result in a smart city.

Figure 10. Examples of visual test results on the COCO dataset.

Table 1. Hyperparameters for DODT.

Hyperparameters	Value
Image size	640 × 512
Batch size	4
Epoch	300
Weight Decay	0.0005
Momentum	0.91
Learning Rate	0.01
Optimizer	SGD

Table 2. Comparison with other advanced methods on the UAV image dataset.

Method	Car (%)	Truck (%)	Bus (%)	Van (%)	Fcar (%)	Model Size (M)	mAP (%)	FPS
YOLOv3	90.2	49.8	85.9	57.9	56.1	24.3	66.4	52.4
YOLOv4	91.3	52.5	86.6	58.4	55.3	25.6	68.7	30.3
R-CNN	90.5	50.3	84.3	59.4	61.7	32.3	70.5	8.9
YOLOv5-S	92.3	65.2	88.4	60.7	58.1	17.2	68.4	87.2
YOLOv6-N	92.5	66.4	88.4	60.4	58.4	9.5	69.7	93.2
SSD	89.8	53.3	91.2	63.2	59.9	20.3	72.8	13.7
YOLOv7-TR	91.5	60.7	89.3	58.2	57.6	42.3	75.1	38.0
RT-DETR	94.2	63.5	92.1	61.8	59.3	34.8	77.1	105.3
YOLOv8	93.2	68.8	90.5	58.6	62.2	8.7	74.3	103.1
RTD	94.1	73.2	90.2	62.5	60.3	14.1	75.2	90.3
DODT	95.8	72.9	93.9	64.9	61.1	13.7	77.7	89.7

Table 3. Comparison with other advanced methods on the COCO dataset.

Method	Input Size	mAP0.5 (%)	mAP0.5:0.95 (%)
SSD	512 × 512	48.5	28.8
YOLOv8	608 × 608	52.3	37.3
RTD	608 × 608	54.3	33.6
YOLOv10	640 × 640	54.8	39.2
DODT	640 × 640	60.5	38.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qi, B.; Shi, H.; Zhao, B.; Mu, R.; Huo, M. Research on Visual Target Detection Method for Smart City Unmanned Aerial Vehicles Based on Transformer. Aerospace 2025, 12, 949. https://doi.org/10.3390/aerospace12110949

AMA Style

Qi B, Shi H, Zhao B, Mu R, Huo M. Research on Visual Target Detection Method for Smart City Unmanned Aerial Vehicles Based on Transformer. Aerospace. 2025; 12(11):949. https://doi.org/10.3390/aerospace12110949

Chicago/Turabian Style

Qi, Bo, Hang Shi, Bocheng Zhao, Rongjun Mu, and Mingying Huo. 2025. "Research on Visual Target Detection Method for Smart City Unmanned Aerial Vehicles Based on Transformer" Aerospace 12, no. 11: 949. https://doi.org/10.3390/aerospace12110949

APA Style

Qi, B., Shi, H., Zhao, B., Mu, R., & Huo, M. (2025). Research on Visual Target Detection Method for Smart City Unmanned Aerial Vehicles Based on Transformer. Aerospace, 12(11), 949. https://doi.org/10.3390/aerospace12110949

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Visual Target Detection Method for Smart City Unmanned Aerial Vehicles Based on Transformer

Abstract

1. Introduction

2. Related Works

3. Proposed Network Framework

4. Experiments and Analysis

4.1. Training Process

4.2. Test Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI