Lightweight Object Detector Based on Images Captured Using Unmanned Aerial Vehicle

Chen, Dike; Sui, Jiacheng; Zhang, Ji; Wang, Hongyuan

doi:10.3390/app15137482

Open AccessArticle

Lightweight Object Detector Based on Images Captured Using Unmanned Aerial Vehicle

¹

CI Xbot School, Changzhou University, Changzhou 213164, China

²

School of Computer Science and Artificial Intelligence, Changzhou University, Changzhou 213164, China

³

School of Safety Science and Engineering, Changzhou University, Changzhou 213164, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 7482; https://doi.org/10.3390/app15137482

Submission received: 5 June 2025 / Revised: 26 June 2025 / Accepted: 27 June 2025 / Published: 3 July 2025

Download

Browse Figures

Versions Notes

Abstract

This study aims to investigate the flight endurance problems that unmanned aerial vehicles (UAVs) face when carrying out filming tasks, the relatively limited computational resources of xmini platforms carried by UAVs, and the need for fast decision making and responses when processing image data in real-time. In this study, an improved Yolov8s-CFS model based on Yolov8s is proposed to address the need for a lightweight solution when UAVs are used to perform filming tasks. First, the Bottlenet in C2f is replaced by the FasterNet Block to achieve an overall lightweighting effect; second, in order to reduce the problem of model accuracy degradation due to excessive lightweighting, this study introduces the self-weight coordinate attention (SWCA) in the C2f-Faster module connected to each detect head. This results in the C2f-Faster-SWCA module, which provides a better solution to mitigate the model accuracy degradation that may occur due to excessive lightweighting. The experimental results show that the number of parameters in the Yolov8-CFS model is decreased by 17.4% with respect to the baseline on the Visdrone2019 dataset; in addition, its average accuracy remains at 40.1%. In summary, the Yolov8-CFS model reduces the number of parameters and model complexity while ensuring the accuracy of the model, facilitating its application in mobile deployment scenarios.

Keywords:

unmanned aerial vehicle; object detection; lightweight; attention

1. Introduction

The technology of unmanned aerial vehicles (UAVs), characterized by low cost and high efficiency, has substantially widened the scope and boundaries of UAV applications. As UAV technology matures and gains widespread adoption, its fields of application are continually expanding and diversifying. Aerial photography, as one of the earliest and most sophisticated applications of UAVs, continues to lead the way in industrial innovation and development, offering unparalleled visual experiences and data support in sectors such as agricultural production [1], environmental protection [2], safety and security [3], and rural environments [4].

The application of UAVs in petrochemical and power grid inspection is becoming increasingly widespread [5], as the efficient, convenient, and flexible characteristics of UAVs have significantly improved the quality and efficiency of inspection tasks. In the petrochemical field, UAVs are capable of flying over complex terrain and reaching areas that are difficult to access, carrying out high-definition filming and the infrared scanning of key facilities such as oil pipelines and storage tanks. In power grid inspections, faced with vast transmission lines and dense substations, UAVs can swiftly identify potential faults, such as line aging and insulator breakage.

Against the backdrop of rapid advancements in UAV technology, UAV aerial photography target recognition models, as key technologies, are gradually penetrating into various fields, including environmental monitoring, agricultural management, emergency rescue, and film and television production. Men et al. [6] lightened a model by introducing the Fasternet structure. Yao et al. [7] introduced an SCConv module into C2f to design an improved C2f-SCConv module in order to reduce feature redundancy in spatial and channel dimensions and reduce the computational load of the model. Weng et al. [8] improved the network by creating a module with cross-stage local-connection feature extraction and feature fusion capabilities. Ye et al. [9] reconstructed the Yolov8 backbone network by introducing Deformable Convolutionv4 and Swin Transformer encoder structures to enhance its feature transfer and extraction capabilities and improve its sensitivity to small objects. Peng et al. [10] combined an existing fast single-frame detection method with the spatio-temporal relationships of video sequences to design an efficient lightweight object detection model; however, the inference speed was only around 20 fps. Zhou et al. [11] introduced a BiFormer attention mechanism module in the backbone and head networks to enhance the algorithm’s ability to detect small objects. Although most studies have focused on methods for improving the recognition accuracy of a model through optimizing the algorithm’s structure and its feature extraction and fusion capabilities, they tend to ignore the uniqueness and environmental constraints of UAV applications, which introduces viability and deployment efficiency problems for relevant models. Therefore, real-time and low-latency object detection algorithms are crucial. Traditional high-performance models are highly accurate, but also have high computational complexity and long processing times, making it difficult to meet to rapid response demands of UAVs [12]. To achieve this goal, it is imperative to explore more efficient algorithm designs, such as adopting lightweight network structures, optimizing the size and number of convolutional kernels, and introducing quantization and pruning techniques to reduce the number of model parameters and computational loads. This is expected to enable significant enhancements in processing speed and reduced latency while maintaining a certain level of accuracy. In addition to optimization at the technical level, the robustness and adaptability of UAV-based [13] aerial object recognition models in practical applications should also be considered. Furthermore, optimizing the deployment process of the model is crucial, in order to ensure its stable and reliable operation under the resource-constrained conditions of unmanned aerial vehicles. The main contributions of this study are as follows:

1. Replacing Bottlenet [14] with FasterNet in the C2f module, the C2f-Faster module is combined with PConv to replace the original convolution in order to reduce access to memory and the number of parameters in the model. This allows the model to be successfully lightweighted as a whole.

2. In order to maintain the overall performance of the model without a significant decrease in accuracy due to excessive lightweighting operations, self-weight coordinate attention (SWCA)—an attention module based on coordinate attention—is used. Then, self-weight coordinate attention is introduced and combined into C2f-Faster to form the C2f-Faster-SWCA module, which enhances the original model by increasing secondary weights. The original weights are enhanced by adding secondary weights to improve the performance of the model via an increase in attention.

In this study, the relevant contents of model improvement are introduced in detail in Section 2. Then, the effectiveness of the improved content is verified using experiments, as outlined in Section 3. Finally, this study is summarized in Section 4.

2. Materials and Methods

2.1. Materials

2.1.1. Object Detection

The main challenge of small object detection is that the small objects not only have low saliency relative to the background but also contain less information due to their small size, which makes their recognition and localization more difficult [15]. Small objects are often surrounded by complex backgrounds and are susceptible to interference from background noise. In datasets, the relatively low number of samples representing small objects often leads to susceptibility to overfitting or underfitting during model training. To address the challenges associated with small object detection, researchers have proposed a variety of methods, including increasing data samples, refining network structures, using multiscale feature fusion, etc. The traditional object detection algorithms include Viola–Jones [16], Histogram of Oriented Gradient [17], Deformable PartModel [18], etc., which are considered to be representative algorithms. With the continuous development of deep learning technology—especially the broad application of Faster R-CNN [19] and models such as Transformer [20] and the Yolo series [21,22]—the performance of small object detection has been significantly improved. Yolov8 is a member of the YOLO (You Only Look Once) family. Its core design concept revolves around performing rapid and accurate object detection directly on the input image through a singular neural network architecture. The backbone network of Yolov8 is responsible for extracting multiscale feature representations from the input image. Yolov8 incorporates the C2f module, which borrows ideas from both ResNet and ELAN [23]. The backbone utilizes the Bottleneck [24] module to progressively enhance the performance of small object detection. To progressively deepen the network, the branching part retains the input-layer channels and integrates them with the features of the trunk part rather than utilizing the C3 module. In the neck part of the module, Yolov8 employs a hybrid structure combining a PAN (Path Aggregation Network) and FPN (Feature Pyramid Network) to effectively aggregate the multiscale features extracted via the backbone network. Finally, three detection head sets of varying sizes—corresponding to different scales of feature maps generated by the backbone network—are utilized to detect objects of different sizes within an image. During the prediction process, Yolov8 adopts Distribution Focal Loss (DFL) [25] as the classification loss function and CIoU (Complete Intersection over Union) [26] as the localization loss function.

2.1.2. Fasternet Block

FasterNet discovers similarities between feature maps, improves feature extraction efficiency, and optimizes the overall model using redundancy [27]. The core features of FasterNet optimize the overall model, considering that partial convolution (PConv) only treats a portion of the input feature map channels as a representation of the entire feature map for feature extraction. By keeping the other channels unchanged, the retained channels can be utilized in subsequent processing carried out in the pointwise convolution (PWConv) layer. This allows feature information to flow through all the channels, ensuring the completeness of the feature information. In traditional convolution, the convolution kernel performs a sliding calculation on all channels of the entire input feature map, whereas PConv only convolves a portion of the input channel, and the remaining channels are passed directly to the output. PConv divides the channel of the input feature map into two parts: One part is used for convolution calculation, and the other part remains unchanged, which allows PConv to effectively extract spatial features while reducing the amount of computation and memory access. In the FasterNet Block, the PConv layer is usually followed by two point convolution (PWConv) layers, forming an inverted residual block structure. This structure enables FasterNet to reduce the amount of computation and memory access while maintaining high-precision feature extraction capabilities. The structure of the PConv convolution is illustrated in Figure 1.

We assume that the number of channels of the input and output feature maps is equal and denoted as c ; meanwhile, w and h are the width and height of the input image, and K is the channel size. The FLOP of PConv is as follows:

h \times w \times k^{2} \times c_{p}^{2}

(1)

When the ratio r to

c_{p}

is 1/4 (i.e., r =

\frac{c_{p}}{c}

= 1/4), the FLOPs for partial convolution are only 1/16 of those for normal convolution, while memory access for partial convolution is described as follows:

h \times w \times 2 c_{p} + k^{2} \times c_{p}^{2} \approx h \times w \times 2 c_{p}

(2)

Therefore, partial convolution requires only 1/4 of the memory access of normal convolution.

Figure 2 illustrates the overall design of the FasterNet Block module, which is designed to incorporate the concepts of partial convolution (PConv) and the Inverse Residual Block. The PConv layer is immediately followed by two convolutional layers, which are used as part of the Inverse Residual Block for the further processing of features. The use of normalization and activation layers after the intermediate convolutional layers helps to preserve the diversity of the features and prevents the problem of vanishing or exploding gradients during the training process.

2.1.3. Coordinate Attention

In the realm of deep learning, traditional attention mechanisms—such as self-attention—have produced remarkable results in capturing dependencies within sequences or images. However, they often overlook the significance of location information. To address this shortcoming, Hou [28] introduced coordinate attention, which further enhances the channel attention mechanism by decomposing it into two independent one-dimensional feature encoding processes for feature aggregation along the spatial dimensions (height and width). Leveraging the spatial dimensions enables the model to capture long-range dependencies in one spatial direction while preserving the precision of positional information in the other. The coordinate attention structure is illustrated in Figure 3.

2.2. Methods

2.2.1. Improvements in the Yolov8 Model

Although YOLOv8 models exhibit outstanding comprehensive performance within the YOLOv8 series, there is still potential for further improvement. In this study, we replace Bottleneck with the FasterNet Block within the C2f framework to reduce the model’s overall parameter count and computational complexity. We introduce the self-weight coordinate attention mechanism and propose C2f-Faster-SWCA (CFS), aiming to minimize computational overhead while maintaining model performance.The network architecture of Yolo-CFS is shown in Figure 4, and the implementation details of C2f-Faster and C2f-Faster-SWCA are described in Section 2.2.2, Section 2.2.3 and Section 2.2.4.

2.2.2. C2f Module in Conjunction with FasterNet

To better deploy the object detection model using a UAV’s limited computational resources, we introduce the C2f-Faster module, which effectively reduces the number of parameters and memory access points, thereby lowering the model’s complexity. The core improvement of C2f-Faster lies in replacing the Faster Block in C2f with the Faster Block from FasterNet, which significantly decreases the number of model parameters and memory access points. Partial convolution (PConv), as a lightweight variant of convolution, enhances computational efficiency by operating solely on valid input regions, thereby avoiding ineffective or repetitive computations that are prevalent in traditional convolution operations. In FasterNet Block, the application of PConv not only optimizes the feature extraction process, but also makes the network structure more compact and reduces unnecessary parameters and computations.

2.2.3. Self-Weight Coordinate Attention

Coordinate attention (CA), as a lightweight attention mechanism, combines positional information with channel information and incorporates it into channel attention to enhance the model’s ability to capture long-distance dependencies. To improve the sensitivity of coordinate attention to spatial attention and the accuracy of detecting important objects at different locations, this study proposes an improved self-weight coordinate attention (SWCA). SWCA improves attention performance by merging feature maps in the width and height directions with additional convolution and sigmoid operations. The additional convolutional layer is used to transform parametric features to enhance the expression ability of attention weights, capture the dependence between adjacent positions, and correct the spatial information fragmentation problem caused by the pooling operation of the coordinate attention mechanism. The sigmoid activation function is applied to the feature map after convolution, and the value of each element of the feature map is compressed to between 0 and 1, which eliminates the dimensional difference of the original weights between different channels and positions and ensures that the attention module’s intensity is comparable across spatial positions in order to generate attention weights. These attention weights calibrate regions differently—which would otherwise be attention weights—by additionally reflecting the importance of different positions, improving the attention weighting capacity of the model. The SWCA structural diagram is shown in Figure 5:

The input feature map

x_{c}

is average-pooled along the horizontal and vertical coordinates to obtain feature maps

z_{c}^{h}

and

z_{c}^{w}

, corresponding to the height and width, respectively. One can capture the long-range dependency of spatial directions and retain the precise position information of the other spatial direction, while the other can better help the network to locate the object of interest. The computational formulas are as follows:

z_{c}^{h} = \frac{1}{W} \sum_{0 \leq i < W} x_{c} (h, i)

(3)

z_{c}^{w} = \frac{1}{H} \sum_{0 \leq i < H} x_{c} (j, w)

(4)

The height and width feature map extension channels are spliced, and the convolution operation is performed to obtain the feature map containing global information in both the height and width dimensions, as shown in Equation (5):

f = δ (C o n v ([z_{c}^{h}, z_{c}^{w}]))

(5)

where [*, *] denotes the concatenation along the spatial dimension,

δ

denotes the nonlinear activation function,

f ϵ R^{\frac{C}{r} \times (H \times W)}

is an intermediate feature map that encodes spatial information both horizontally and vertically, and r is a deceleration rate that controls the size of the block. Then, f is separated into two independent tensors

f^{h} ϵ R^{\frac{C}{r} \times H}

and

f^{w} ϵ R^{\frac{C}{r} \times W}

along the spatial dimension. Next, two 1*1 convolutions are used to change

f^{h}

and

f^{w}

into tensors with the same number of channels as the input x, as shown in the following:

g^{h} = σ (C o n v (f^{h}))

(6)

g^{w} = σ (C o n v (f^{w}))

(7)

where

σ

is the sigmoid function, and the outputs

g^{h}

and

g^{w}

are used as preliminary attention weights. After obtaining feature map f, which contains global information with respect to both height and width dimensions, additional convolution and sigmoid operations are used to obtain

f^{H W}

. This is then separated into two independent tensors,

g^{h^{'}}

and

g^{w^{'}}

, along the spatial dimensions for calibration of the secondary weights using the preliminary weights:

f^{H W} = σ (C o n v (f))

(8)

Finally, the integrated weights are combined with the original output, and the final output is shown in Equation (9):

y_{c} (i, j) = x_{c} (i, j) \times (g_{c}^{h} (i) \times g_{c}^{h} (i)) \times (g_{c}^{w} (j) \times g_{c}^{w} (j))

(9)

2.2.4. C2f Module Incorporating Lightweight Self-Weight Coordinate Attention

To balance the accuracy degradation resulting from lightweighting processes and the increase in model complexity due to accuracy improvement, this study designs a C2f-Faster-SWCA module by integrating the C2f-Faster module with the SWCA module. The module is introduced before each detection head to enhance the detection of key objects through attention mechanisms following feature fusion. Additionally, the FasterNet module is utilized to reduce the number of model parameters, while the self-calibration coordinate attention module (SWCA) stabilizes the overall accuracy of the model. The structure of C2f-Faster-SWCA is shown in Figure 6:

3. Results

3.1. Dataset and Experimental Environment

The publicly available dataset used in this study was Visdrone2019 [29], which consists of 8599 high-quality images captured by UAVs. This dataset was carefully divided into three subsets: the training set contained 6471 images, the validation set contained 548 images, and the test set contained 1580 images.

3.2. Indicators for Model Evaluation

Commonly used optimizers for model training include SGD [30], Adam [31], RMSProp [32], etc. The performance of an optimizer affects both the convergence speed and stability of the training process. In the experiments, the SGD optimizer was utilized to expedite the model’s convergence. To ensure fairness, each experiment was conducted for 450 epochs. The experimental hardware and software environments are detailed in Table 1.

When evaluating the model’s detection effectiveness for small objects, it is important to note that small objects tend to have a lower pixel percentage and present more complex background interference. In order to comprehensively and accurately measure the detection performance of the model, the mean average precision (mAP) is often adopted as the core evaluation metric. Specifically, mAP@50, which sets the intersection-over-union (IoU) threshold to 0.5, focuses on evaluating the model’s detection ability under a more relaxed matching criterion. However, this may not sufficiently reflect the model’s performance in dealing with small objects and complex scenes. Therefore, to more rigorously evaluate the model’s detection accuracy for small objects, we used mAP@50:95, which calculates the average of AP over ten IoU thresholds ranging from 0.5 to 0.95 (with a step size of 0.05). This metric better reflects the model’s ability to handle small objects and finely tuned detection tasks.

3.3. Experiments and Analysis of Results

3.3.1. Comparative Experiments on Model Lightweighting

The Bottlenet in the C2f module was replaced with FasterNet to further investigate the impact of the C2f-Faster module on each component of the model (Backbone and Neck). The experiments were conducted separately, with each based on the original model but using the C2f-Faster module to replace C2f in different parts. The results of these experiments are presented in Table 2. Specifically, “C2f-Faster (Backbone)” indicates the replacement of C2f in the Backbone part, “C2f-Faster (Neck)” indicates the replacement of C2f in the Neck part, and “C2f-Faster (all)” refers to the replacement of C2f throughout the entire model.

In YOLOv8 series models, the backbone network’s role is to extract multiscale features from the input image, whereas the neck network’s role is to perform multiscale feature fusion. After lightweighting the model, the detection model may lose focus on the originally extracted features, resulting in reduced model performance and a decrease in model accuracy. Consequently, the impact of the lightweighting module on the multiscale fusion part of the neck network is smaller than its impact on the extracted multiscale features in the backbone network. Considering the overall model, we replaced the entire C2f module with the C2f-Faster module, taking both the module’s effect on model accuracy and the reduction in parameters into account, thereby balancing the conflict between improving parameter efficiency and maintaining accuracy after the module’s introduction. Compared to only carrying out lightweighting on the backbone network, lightweighting processes carried out on the entire network achieved the same effect with respect to model accuracy while further reducing the number of parameters. Therefore, this study implemented lightweighting of the entire model.

3.3.2. Comprehensive Comparison Experiments

To demonstrate the ability of Yolo-CFS to maintain accuracy while reducing the number of parameters, comprehensive experiments using state-of-the-art object detection models as comparison models were conducted. Among these, models with a similar number of parameters compared to Yolo-CFS were selected for accuracy comparisons, and models with similar accuracies with respect to Yolo-CFS were selected for comparisons of the number of parameters. In the case of the same experimental and environmental configuration and model parameters, the Visdrone2019 dataset was used for detection, and the results of the model comparison are detailed in Table 3.

The data show that, in the case of an input image with the same size, when comparing the high-accuracy models with similar accuracy to Yolov8s—such as YolovX-X, Swin-T, etc., with an average accuracy of about 40.0%—the Yolo-CFS model showed a greater advantage, as its number of parameters only amounted to 8.61 M compared to 99.1 M (YolovX-X) or 38.6 M (Swin-T). When using models with similar numbers of parameters—such as Yolov3-tiny and Yolov5s, with low parameter quantities of 8.68 M and 7.03 M and average accuracies of 13.5 and 26.4, respectively—Yolo-CFS had an average accuracy of 40 with only 8.61 M parameters. When comparing the proposed model with the baseline and Yolov8s models, the average accuracy was still guaranteed to be 40.0, while the number of parameters is reduced by 2.52 M.

3.3.3. Ablation Experiments

To verify the effectiveness of each proposed module for the overall model, ablation experiments were conducted. These experiments demonstrated the effectiveness of various modules within the overall model, using Yolov8s as the baseline. Specifically, we evaluated the lightweight C2f-Faster, C2f-SWCA, and C2f-Faster-SWCA modules, with the final module combining both the lightweight C2f-Faster and C2f-SWCA modules. The ablation experiments are shown in Table 4.

The data indicate that, after introducing the C2f-Faster module, the overall number of model parameters was reduced by 25.4% but the accuracy decreased by 0.6%. Compared to the introduction of coordinate attention, the SWCA module improved its effectiveness by 40% on one side and increased accuracy by 0.2%. The C2f-Faster-SWCA module mitigated the significant accuracy drop caused by the C2f-Faster module in the overall lightweighting process, and the improved model’s accuracy matched that of the baseline model. Notably, the number of parameters increased by only 4%, proving that the overall model maintained accuracy without a significant decrease during the lightweighting process.

4. Conclusions

Given the high demand for fast responses and low processing times for target detection tasks performed on the small platforms carried by UAVs, a small target detection model based on optimization of the YOLOv8s architecture was proposed, which we call YOLO-CFS. This model has two key improvements over the original model: firstly, in the C2f module, the FasterNet module is used instead of the BottleNet module, effectively reducing model complexity and computational burden. Secondly, to mitigate any potential accuracy loss caused by the lightweight design, this study fuses the self-calibrating weight coordinate attention module with the improved C2f-Faster module. Experimental verification carried out on the Visdrone 2019 dataset showed that, compared with other mainstream models, the YOLO-CFS model achieves a significant reduction in the number of parameters while maintaining comparable accuracy. This result completely proves the effectiveness and viability of the proposed improvement strategy, providing a more efficient and accurate solution for target detection tasks on the small platforms carried by UAVs.

5. Discussion

The VisDrone dataset mainly focuses on urban surveillance scenarios from the perspective of UAVs, which are relatively narrow in scope and lack diversity, thus limiting the generalization ability of the model in wider and more complex environments. To overcome this limitation, future research will focus on enriching and expanding the dataset, including the creation of special datasets for different scenarios. This includes (but is not limited to) the natural environment, industrial scenes, traffic arteries, and other environments, with the goal of enhancing the adaptability and robustness of the model in different scenarios by enriching the diversity of data used. Secondly, this study focuses on the frontier of cross-modal fusion. In complex scenarios, traditional single-modal detection methods often have insufficient accuracy due to low resolutions, sparse features, and susceptibility to background interference. Cross-modal fusion technology can significantly enhance the small-target detection performance of UAVs through enabling feature complementation, thus improving anti-jamming capability and expanding feature dimensions. Cross-modal fusion technology enables the integration of information from different modalities (e.g., images, videos, audio, text, etc.) to achieve more comprehensive understanding and analysis of data.

Author Contributions

D.C.: Methodology, formal analysis, and writing—review and editing. J.S.: Software, writing—original draft. J.Z.: Writing—review and editing. H.W.: Supervision, funding acquisition, and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant 61976028.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in Visdrone2019: https://github.com/VisDrone/VisDrone-Dataset (accessed on 23 March 2024).

Acknowledgments

The authors would like to express their sincere gratitude to Changzhou University for generously providing the experimental venue. Special thanks also go to CI Xbot School, School of Computer Science and Artificial Intelligence, and School of Safety Science and Engineering for their invaluable support and assistance. Moreover, the authors are deeply indebted to all colleagues in the laboratory for their unwavering support and help throughout the research process.

Conflicts of Interest

The authors declare that they have no competing financial or non-financial interests related to the work submitted in this publication.

Abbreviations

The following abbreviations are used in this manuscript:

CFS	C2f-Faster self-weight coordinate attention;
CA	Coordinate attention;
SWCA	Self-weight coordinate attention.

References

Stefas, N.; Bayram, H.; Isler, V. Vision-based UAV navigation in orchards. IFAC-PapersOnLine 2016, 49, 10–15. [Google Scholar] [CrossRef]
Asadzadeh, S.; de Oliveira, W.J.; de Souza Filho, C.R. UAV-based remote sensing for the petroleum industry and environmental monitoring: State-of-the-art and perspectives. J. Pet. Sci. Eng. 2022, 208, 109633. [Google Scholar] [CrossRef]
Cho, J.; Lim, G.; Biobaku, T.; Kim, S.; Parsaei, H. Safety and security management with unmanned aerial vehicle (UAV) in oil and gas industry. Procedia Manuf. 2015, 3, 1343–1349. [Google Scholar] [CrossRef]
Lu, J.; Liu, Y.; Jiang, C.; Wu, W. Truck-drone joint delivery network for rural area: Optimization and implications. Transp. Policy 2025, 163, 273–284. [Google Scholar] [CrossRef]
Yang, L. The development status and future trend of China’s civil UAV industry. China Secur. Prot. 2022, 12, 15–18. [Google Scholar]
Men, D.; Tan, Q. Improved personnel detection of aerial images based on YOLOv8. Laser J. 2025, 46, 112–118. [Google Scholar]
Yao, J.; Cheng, G.; Wan, F.; Zhu, D. Improved Lightweight Bearing Defect Detection Algorithm of YOLOv8. Comput. Eng. Appl. 2024, 60, 205–214. [Google Scholar]
Weng, Z.; Liu, H.; Zheng, Z. CSD-YOLOv8s: Dense Sheep Small Target Detection Model Based on UAV Images. Smart Agric. 2024, 6, 42. [Google Scholar]
Ye, D.; Jing, J.; Zhang, Z.; Li, H.; Wu, H.; Xie, L. MSH-YOLOv8: Mushroom Small Object Detection Method with Scale Reconstruction and Fusion. Smart Agric. 2024, 6, 139. [Google Scholar]
Zhou, P.; Liu, G.; Wang, J.; Weng, Q.; Zhang, K.; Zhou, Z. Lightweight unmanned aerial vehicle video object detection based on spatial-temporal correlation. Int. J. Commun. Syst. 2022, 35, e5334. [Google Scholar] [CrossRef]
Zhou, Y.; Liao, B. Foreign object detection in transmission lines based on improved YOLOv7 algorithm. J. North China Electr. Power Univ. 2024, 1–9. [Google Scholar]
Liu, L.; Zhang, S.; Bai, Y.; Li, Y.; Zhang, C. Improved light-weight military aircraft detection algorithm of YOLOv8. J. Comput. Eng. Appl. 2024, 60, 114–125. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 2778–2788. [Google Scholar]
Chen, J.; Ma, A.; Huang, L.; Li, H.; Zhang, H.; Huang, Y.; Zhu, T. Efficient and lightweight grape and picking point synchronous detection model based on key point detection. Comput. Electron. Agric. 2024, 217, 108612. [Google Scholar] [CrossRef]
Wu, M.; Yun, L.; Chen, Z.; Zhong, T. Improved YOLOv5s small object detection algorithm in UAV view. J. Comput. Eng. Appl. 2024, 60, 191–199. [Google Scholar]
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, Kauai, HI, USA, 8–14 December 2001; Volume 1, pp. I–I. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Felzenszwalb, P.; McAllester, D.; Ramanan, D. A discriminatively trained, multiscale, deformable part model. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, Alaska, 23–28 June 2008; pp. 1–8. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Zhang, X.; Zeng, H.; Guo, S.; Zhang, L. Efficient long-range attention network for image super-resolution. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 649–667. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 12021–12031. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Hardt, M.; Recht, B.; Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 19–24 June 2016; pp. 1225–1234. [Google Scholar]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Feng, Y.; Li, Y. An overview of deep learning optimization methods and learning rate attenuation methods. Hans J. Data Min. 2018, 8, 186–200. [Google Scholar] [CrossRef]

Figure 1. Convolutional structure of PConv.

Figure 2. Structure of FasterNet Block.

Figure 3. Coordinate attention structure.

Figure 4. Overall structure of Yolo-CFS (Yolo-C2f-Faster self-weight coordinate attention).

Figure 5. Structure of self-weight coordinate attention.

Figure 6. Structure of C2f-Faster-SWCA (Fasternet-SWCA replaces Bottlenet).

Table 1. Hardware and software environments used for model training and evaluation.

Name	Detailed Information
Operating system	Ubuntu 16.04
Display card (computer)	NVIDIA RTX 2080ti
CUDA	10.2
Deep learning frameworks	Pytorch 1.12.0
Language	Python 3.8

Table 2. Experimental comparison when introducing C2f-Faster into different parts of the model.

Model	mAP50	mAP50:95	GFLOPs	Params/M
Baseline	39.9	23.8	28.5	11.13
C2f-Faster (Backbone)	39.3	23.4	21.4	8.30
C2f-Faster (Neck)	40.0	24.1	25.6	9.75
C2f-Faster (All)	39.3	23.8	21.4	8.10

Table 3. Performance comparison of Yolo-CFS with other object detection models on the Visdrone2019 dataset.

Model	Image Size	Params/M	mAP50	mAP50:95
Yolov3-tiny	640*640	8.68	13.5	5.8
Yolov5s	640*640	7.03	26.4	14.2
VA-Yolo	640*640	6.56	23.6	12.6
PP-Yolo	640*640	52.2	39.6	24.6
FasterRCNN ResNeXt101	640*640	—	40.2	22.6
YoloX-X	640*640	99.1	43.2	25.8
Swin-T	640*640	38.6	42.5	23.1
DDETR	640*640	39.8	42.7	24.8
Yolov8s	640*640	11.13	39.9	23.8
Yolo-CFS	640*640	8.61	40.0	23.8

Table 4. Impact of individual modules (C2f-Faster and SWCA) on model performance relative to the baseline (Yolov8s).

Model	mAP50	mAP50:95	GFLOPs	Params/M
Baseline	39.9	23.8	28.5	11.13
+C2f-Faster	39.3	23.8	21.4	8.30
+C2f-CA	40.4	24.4	28.4	11.15
+C2f-SWCA	40.6	24.4	28.5	11.56
+C2f-Faster-SWCA	40.0	23.8	21.3	8.61

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, D.; Sui, J.; Zhang, J.; Wang, H. Lightweight Object Detector Based on Images Captured Using Unmanned Aerial Vehicle. Appl. Sci. 2025, 15, 7482. https://doi.org/10.3390/app15137482

AMA Style

Chen D, Sui J, Zhang J, Wang H. Lightweight Object Detector Based on Images Captured Using Unmanned Aerial Vehicle. Applied Sciences. 2025; 15(13):7482. https://doi.org/10.3390/app15137482

Chicago/Turabian Style

Chen, Dike, Jiacheng Sui, Ji Zhang, and Hongyuan Wang. 2025. "Lightweight Object Detector Based on Images Captured Using Unmanned Aerial Vehicle" Applied Sciences 15, no. 13: 7482. https://doi.org/10.3390/app15137482

APA Style

Chen, D., Sui, J., Zhang, J., & Wang, H. (2025). Lightweight Object Detector Based on Images Captured Using Unmanned Aerial Vehicle. Applied Sciences, 15(13), 7482. https://doi.org/10.3390/app15137482

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Object Detector Based on Images Captured Using Unmanned Aerial Vehicle

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.1.1. Object Detection

2.1.2. Fasternet Block

2.1.3. Coordinate Attention

2.2. Methods

2.2.1. Improvements in the Yolov8 Model

2.2.2. C2f Module in Conjunction with FasterNet

2.2.3. Self-Weight Coordinate Attention

2.2.4. C2f Module Incorporating Lightweight Self-Weight Coordinate Attention

3. Results

3.1. Dataset and Experimental Environment

3.2. Indicators for Model Evaluation

3.3. Experiments and Analysis of Results

3.3.1. Comparative Experiments on Model Lightweighting

3.3.2. Comprehensive Comparison Experiments

3.3.3. Ablation Experiments

4. Conclusions

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI