Research on a UAV-View Object-Detection Method Based on YOLOv7-Tiny

Miao, Yuyang; Wang, Xihan; Zhang, Ning; Wang, Kai; Shao, Lianhe; Gao, Quanli

doi:10.3390/app142411929

Open AccessArticle

Research on a UAV-View Object-Detection Method Based on YOLOv7-Tiny

by

Yuyang Miao

¹,

Xihan Wang

¹,

Ning Zhang

²

,

Kai Wang

³,

Lianhe Shao

^1,* and

Quanli Gao

^1,*

¹

School of Computer Science, Xi’an Polytechnic University, Xi’an 710600, China

²

Library Information and Digital Library, National Library of China, Beijing 100081, China

³

Inspur Communication Information System Co., Ltd., Jinan 250101, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(24), 11929; https://doi.org/10.3390/app142411929

Submission received: 30 October 2024 / Revised: 14 December 2024 / Accepted: 16 December 2024 / Published: 20 December 2024

(This article belongs to the Special Issue Advanced Pattern Recognition & Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

To address the issues of missed and false detections caused by small object sizes, dense object distribution, and complex scenes in drone aerial images, this study proposes a drone-view object-detection algorithm based on YOLOv7-tiny with a Partial_C_Detect detection head. The algorithm’s performance in handling object occlusion and multi-scale detection is enhanced by introducing the VarifocalLoss loss function and improving the feature fusion network to BiFPN. Furthermore, incorporating the novel Partial_C_Detect detection head and Adaptive Kernel Convolution (AKConv) improves the detection capabilities for small and dynamically changing objects. In addition, introducing the Dilated Weighted Residual (DWR) attention module optimizes the information processing flow, enhancing the algorithm’s ability to capture key information, especially in complex backgrounds. These enhancements collectively enable the model to balance high detection accuracy and computational efficiency, making it well-suited for resource-constrained UAV platforms. Experiments conducted on the VisDrone2019 dataset show that the improved algorithm achieves a mAP@0.5 of 38.2%, with a model size of 29.01 MB and a computational complexity of 16.2 G. Compared to the original YOLOv7-tiny algorithm, the mAP@0.5 improves by 2.9%, and the algorithm performs better in other key performance metrics, demonstrating its adaptability and robustness in drone aerial image object-detection tasks.

Keywords:

UAV view; object detection; YOLOv7-tiny; AKConv; attention mechanism

1. Introduction

In the technological wave of the 21st century, the rapid development of drone technology has significantly driven innovation across various industries. Initially designed for military reconnaissance and strikes, drones are now widely used in fields such as agriculture [1], transportation [2], rescue operations [3], and security [4]. Their growing popularity is primarily attributed to their compact size, flexible mobility, portability, and relatively low cost. These advantages allow drones to demonstrate unique capabilities when performing complex tasks, including surveillance, tracking, and monitoring.

As the application fields of drones continue to expand, the demand for efficient and accurate object detection technology from the drone’s perspective has surged. Object detection from aerial views can significantly improve efficiency in various tasks, such as monitoring crop growth in agriculture [5], analyzing traffic flow and detecting accidents in real time [6], quickly locating trapped individuals during rescue missions [7], and conducting area surveillance and object tracking for security purposes [8]. Advancing object-detection technology for drones enhances their overall value, enabling them to handle challenges in complex environments more intelligently.

However, despite the significant progress in Uncrewed Aerial Vehicle (UAV) technology, object detection in aerial views still faces several challenges. Complex imaging environments, such as dense urban areas and natural landscapes, can create occlusion and make small object identification difficult, often leading to missed or false detections. Furthermore, lightweight detection algorithms are critical for drone platforms due to their limited computational resources and power constraints. These requirements necessitate balancing high detection accuracy and computational efficiency [9].

The development of mainstream object-detection algorithms has significantly advanced the field of computer vision, particularly with the rise of deep learning. Although recent one-stage detection algorithms like YOLOv8, YOLOv9, and YOLOv10 have introduced advanced features such as transformer-based architectures and complex feature fusion networks [10,11,12], they still face several limitations. These models have increased computational demands and memory requirements, making them less practical for resource-constrained platforms like drones, which have limited power and processing capabilities [7,8]. Specifically, the transformer-based mechanisms in YOLOv8 and YOLOv10 significantly enhance detection accuracy but come at the cost of increased computational overhead, which is challenging for real-time deployment on lightweight UAVs [6]. Furthermore, these models still struggle with effectively detecting small objects and handling occlusions in complex backgrounds, often resulting in increased missed or false detections in real-world drone applications [13].

In this context, YOLOv7-tiny is an optimal choice for lightweight applications [9]. YOLOv7-tiny balances detection accuracy and computational efficiency, making it suitable for deployment on resource-constrained drone platforms where reducing computational resource usage is critical [9]. Moreover, it retains the robust object-detection capabilities of the YOLO family, making it highly applicable to drone-based scenarios [9].

Although YOLOv7-tiny has certain advantages in drone-view object detection, there is still room for improvement in multi-scale object detection, small object detection, and robustness in dynamic environments [11]. This paper proposes an improved object-detection algorithm based on YOLOv7-tiny for drone aerial views to address these issues. The main contributions of this work are summarized as follows:

Solving the problem of multi-scale and occlusion: By introducing the VarifocalLoss function and optimizing the feature fusion network to the BiFPN structure, the detection capability for small objects and objects in densely distributed scenes has been significantly improved. This approach enhances the model’s adaptability to object size and occlusion changes, improving its ability to identify objects in complex environments accurately;
Improvement in detection accuracy: By utilizing the new Partial_C_Detect detection head and introducing Adaptive Kernel Convolution (AKConv) technology, this paper optimizes the recognition and localization of small objects explicitly, while reducing false detection in complex backgrounds. Integrating these techniques enhances the model’s accuracy and adjusts the sampling shapes and convolution parameters to accommodate objects of different sizes and shapes;
Optimization of computational resources and accuracy improvement: Introducing the expandable residual (DWR) attention module significantly enhances the model’s ability to represent features when processing dynamically changing and complex backgrounds. By optimizing the use of computational resources, the improved algorithm maintains high detection accuracy while being capable of stable operation in environments with limited resources, making it particularly suitable for handling complex UAV aerial views and ensuring the model’s applicability and scalability.

2. Related Work

Object-detection technology is one of the core issues in computer vision, playing a crucial role, especially in drone image processing. Early object-detection methods were primarily based on traditional image processing techniques, such as feature-based methods like SIFT [5] and HOG [14]. While these traditional methods performed well in simple scenes, they often struggled with complex image backgrounds and rapidly changing object shapes. With the rise of deep learning, object-detection technology underwent a revolutionary transformation. The introduction of Convolutional Neural Networks (CNNs) significantly improved the performance of object detection. Current object-detection algorithms can be divided into two-stage and one-stage approaches. In two-stage algorithms, R-CNN (Regions with CNN features) and their subsequent variants, such as Fast R-CNN and Faster R-CNN, have played a key role in advancing object-detection technology. Anna Thompson [15] and colleagues applied Mask R-CNN to high-resolution drone imagery for object detection and segmentation, particularly optimizing the instance segmentation capabilities to handle multi-scale and dynamically changing objects, effectively improving detection accuracy and segmentation efficiency in complex urban and natural landscapes. Hyun [16] and colleagues optimized Faster R-CNN by introducing advanced background updating mechanisms and motion prediction models, solving the problem of accurately tracking small moving objects in drone-captured videos in dynamic environments, thereby significantly enhancing the practicality of Faster R-CNN in drone surveillance applications. Although two-stage object-detection algorithms excel in accuracy, they require substantial computational resources, which is a challenge for resource-limited drone platforms. Additionally, when dealing with rapidly changing drone perspectives, these algorithms may struggle to adapt to rapid object changes due to the inherent characteristics of the algorithm, limiting their effectiveness in specific scenarios.

In drone aerial-view object detection, achieving high accuracy is crucial, especially in dynamic and complex environments. Among one-stage object-detection algorithms, YOLO simplifies the detection task into a direct regression problem within a single neural network, allowing it to maintain high accuracy while adapting to rapidly changing scenes. In recent years, YOLOv7 and its lightweight version, YOLOv7-tiny, have demonstrated significant performance improvements, making them highly effective for drone-based detection tasks. The YOLOv7 series has gained widespread adoption due to its ability to balance detection accuracy and computational efficiency, which are particularly important for UAV applications with limited resources. YOLOv7-tiny reduces computational resource consumption while maintaining high accuracy, making it suitable for deployment on resource-constrained drone platforms. The new versions of the YOLO series, such as YOLOv8, YOLOv9, and YOLOv10, integrate transformer architectures and complex feature fusion networks, excelling not only in detection accuracy but also achieving significant improvements in detecting small objects and handling complex scenes.

YOLOv8, as a milestone version of the YOLO family, introduced transformer-based self-attention mechanisms and improved feature fusion modules, significantly enhancing its ability to detect small and occluded objects. Kang et al. [17] combined multi-scale feature fusion with attentional mechanisms to enhance YOLOv8’s performance in complex scenarios. Compared to YOLOv7, YOLOv8 achieved an approximately 3.6 percentage point increase in mAP on the COCO dataset and demonstrated better real-time performance in various lightweight testing scenarios. However, due to its more complex module design, YOLOv8’s computational cost and model parameter size increased significantly, limiting its applicability on resource-constrained platforms like drones.

Building on this, YOLOv9 further optimized detection performance by integrating dynamic attention mechanisms and adaptive feature fusion strategies. Chen et al. [18] proposed YOLOv9, which introduced multi-resolution feature aggregation to improve detection performance in highly complex scenes significantly. Experiments showed that YOLOv9 achieved mAP scores of 45.7% and 38.6% on the COCO and VisDrone2019 datasets, significantly outperforming previous YOLO versions.

YOLOv10 further optimized the model for lightweight design. Zhou et al. [19] developed YOLOv10 with adaptive attention modules and lightweight nested convolution networks, reducing computational costs and improving detection speed. Compared to YOLOv9, YOLOv10 reduced model parameters by 20% while exhibiting excellent robustness in small-object-detection and multi-object tracking applications. This makes it particularly suitable for resource-constrained devices.

Despite the significant advancements in detection accuracy and adaptability to diverse scenarios achieved by the YOLO series algorithms, their performance on tiny objects or those within complex backgrounds still needs improvement. To address this issue, Li et al. [20] proposed an enhanced Feature Pyramid Network (Enhanced FPN), which optimizes the flow of information between feature layers, enhancing the model’s ability to recognize and classify small objects. Additionally, this improved version of YOLOv7 introduces depthwise separable convolution, reducing the number of model parameters and further adapting it for real-time UAV video-stream processing. However, these optimizations have not fully resolved detection challenges under extreme conditions, such as low-light environments or fast-moving objects. However, these methods still do not fully address the issues of missed and false detections when identifying small objects.

In drone aerial-view object detection, the issue of missed and false detections for small objects is a key technical research area. This is because small objects in images are often constrained by resolution and contrast, making them more likely to be missed or falsely detected in complex backgrounds or when occluded by larger objects. Multi-scale feature fusion techniques are widely used to improve the accuracy of small object detection. These techniques combine feature maps from different network depths, enabling capturing both object details and contextual information. For example, the Feature Pyramid Network (FPN) proposed by Lin [6] effectively enhances the detection performance for small objects. Traditional loss functions may lead to insufficient focus on small objects during training. To address this, Lin et al. [13] developed Focal Loss, which adjusts the focus of the loss function to reduce the influence of large objects on small-object detection, thereby improving the detection rate of small objects. Additionally, data augmentation is an effective method for enhancing model generalization and reducing false detections. Techniques such as image rotation, scaling, and cropping can improve the model’s ability to recognize small objects [21]. Moreover, attention mechanisms can help the model focus on important parts of the image, thereby improving the accuracy of small-object detection. The CBAM module proposed by Woo [22] is a practical example, significantly improving detection performance by focusing attention on small objects.

Although the aforementioned methods have been tested on several benchmark datasets, such as COCO and VOC, and have shown potential in reducing missed and false detections for small objects, they still face significant challenges in real-world drone aerial applications. Specifically, in drone-captured images, where objects are small, densely distributed, and set against complex backgrounds, existing methods often struggle to distinguish small objects accurately. This is especially true in dynamic real-world scenarios, where the performance of these techniques has not yet reached an ideal level. Therefore, conducting in-depth research and optimizing the applicability and precision of current object-detection technologies, particularly for improving the accuracy of small-object detection in drone aerial views, has become a pressing issue. To this end, this paper proposes an improved YOLOv7-tiny algorithm that enhances its ability to detect small objects, complex backgrounds, and dynamic environments by introducing VarifocalLoss, optimizing the feature fusion network to a BiFPN structure, and adopting new detection heads (e.g., Partial_C_Detect) and adaptive convolution techniques (AKConv). These improvements effectively enhance the detection accuracy and resource utilization efficiency of YOLOv7-tiny, making it more suitable for resource-constrained drone platforms.

3. UAV View Object-Detection Method Based on YOLOv7-Tiny

3.1. YOLOv7-Tiny Algorithm

The YOLOv7-tiny algorithm, due to its lightweight nature, is particularly well-suited for resource-constrained environments like mobile or low-power devices. The algorithm enhances performance by reducing the model size and the number of parameters while maintaining high accuracy [22]. Its versatility and ease of use allow it to be widely deployed across various platforms, making it highly suitable for UAV operations and monitoring applications. The algorithm combines lightweight design, high performance, and easy deployment, providing an excellent solution for object detection on resource-limited devices. Given that UAV aerial object-detection tasks require flexibility and high resolution but are constrained by limited computational capacity, this paper selects the compact yet fast YOLOv7-tiny as the subject of study. The algorithm comprises four parts: input, backbone, neck, and head, with its detailed structure shown in Figure 1.

The input module is responsible for receiving raw image data, including preprocessing steps such as scaling and normalization, to ensure that the image data are compatible with the format and size required by the model. The main task of the backbone network is to extract features from the input image, where the CBL layer includes convolution operations, batch normalization (BN), and the LeakyReLU activation function for initial feature extraction. The ELAN module aggregates information from different network layers to refine and enhance the features. The max pooling (MP) layer reduces the spatial dimensions of the feature map, helping to decrease computational costs while extracting more abstract features. The neck network, positioned between the backbone and the prediction head, processes and optimizes the features passed from the backbone. The ELAN module processes feature in the neck, and feature fusion and scale adjustment are achieved through Concat and Upsample operations. The SPPCSP module combines Spatial Pyramid Pooling (SPP) with Cross-Stage Partial (CSP) techniques to enhance feature extraction capabilities and optimize model performance. The head is designed to process high-quality feature maps from the neck network and convert them into actual object-detection results through effective convolution operations. This design ensures that the algorithm delivers high detection accuracy and operates efficiently in resource-constrained environments.

However, this algorithm has some limitations. It may not be as accurate as the larger YOLOv7 to achieve a lightweight design, particularly in highly complex or challenging environments. Although YOLOv7-tiny is designed for devices with limited computational resources, running such an advanced deep learning algorithm on highly resource-constrained UAV systems may still put pressure on system resources. Therefore, to meet the demands and address the limitations of UAV aerial view object-detection tasks, it is necessary to optimize the YOLOv7-tiny algorithm.

3.2. The Improved YOLOv7-Tiny Algorithm

The improved YOLOv7-tiny algorithm introduces a series of enhancements to its core components: backbone, neck, and head. In the backbone, the AK-ELAN module replaces the original ELAN module, which integrates the AKConv layer to better combine and enhance information from different network layers. Compared to the traditional ELAN module, AK-ELAN incorporates Adaptive Kernel Convolution (AKConv) and the Bi-directional Feature Pyramid Network (BiFPN) as the feature fusion network, significantly improving the ability to process and utilize features.

In the neck network, the BiFPN structure is employed to optimize the flow of information between different feature layers, ensuring that the algorithm can adapt to and effectively handle targets of various sizes. Additionally, the head improvements include replacing the standard convolution in the traditional CBL module with Partial Convolution (PConv) and integrating the Dilated Residual (DWR) attention module, resulting in a new object-detection head called Partial_C_Detect. These technical innovations enhance the model’s detection capabilities and improve its adaptability and robustness in complex environments. The detailed structure of the improved YOLOv7-tiny algorithm is shown in Figure 2.

3.2.1. VarifocalLoss Loss Function

The loss function in the YOLOv7-tiny algorithm is finely designed, including the classification loss (

L_{c l s}

), the bounding box regression loss (

L_{b b o x}

), and the object confidence loss (

L_{o b j}

), in which the bounding box regression loss adopts the CIoU loss function. The CIoU loss function formula is as follows:

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, u)}{c^{2}} + α v

(1)

In Equation (1), IoU is the intersection and concatenation ratio,

ρ (b, u)

is the Euclidean distance between the centroid of the predicted frame b and the actual frame u, c is the diagonal length of the smallest closed box that contains both frames, v is the consistency term of the aspect ratio, and α is the weighting parameter.

The CIoU loss is an improved IoU (Intersection and Ratio) loss that considers not only the overlapping area of the bounding box but also the distance from the center point of the bounding box, the aspect ratio, and the scale. This makes the CIoU loss perform better than the traditional IoU loss when dealing with the object-detection task, and significantly better when the scale and aspect ratios differ.

However, despite improvements in geometric metrics, CIoU loss still faces certain limitations when dealing with highly dynamic aerial perspectives and complex multi-scale object-detection tasks. Particularly in severe object overlapping and occlusion cases, its performance may not be ideal [23]. This is mainly because CIoU loss still focuses on the physical properties of bounding boxes and cannot adequately address complexities caused by visual salience nor effectively distinguish between foreground and background in situations of extreme class imbalance.

To better address these issues, this study introduces the VarifocalLoss loss function, which is a composite loss function that combines FocalLoss and cross-entropy loss and is formulated as

L_{V F} (p_{t}) = α_{t} {(1 - p_{t})}^{γ} l o g (p_{t}) + β (1 - α_{t}) p_{t}^{k} (1 - l o g (p_{t}))

(2)

In Equation (2),

p_{t}

is the probability of the object category predicted by the model;

α_{t}

is a balancing factor used to adjust for the effects of positive and negative samples; γ is used to reduce the contribution of easy-to-categorize samples to the loss function so that the model pays more attention to the difficult-to-categorize samples; and β and k are parameters used to adjust the weights of samples predicted positively, but with a low level of confidence.

FocalLoss is mainly used to solve the class imbalance problem in classification problems by adjusting the focus parameter in the loss function to reduce the weight of easy-to-categorize samples and increase the influence of difficult-to-categorize samples. This makes the model pay more attention to the samples that are difficult to classify correctly during training. Combined with the cross-entropy loss, the classification accuracy is further optimized, especially when the background is complex, or the object categories are extremely unbalanced [24].

In the whole algorithmic framework, the VarifocalLoss loss function plays a key optimization role, which adjusts the model’s response to objects of different sizes and degrees of occlusion at the initial level, especially when working with the feature extraction network, and distinguishes the foreground from the background more accurately through a weighting mechanism, thus reducing false detections and improving the algorithm’s ability to discriminate occluders. Applied to the task of object detection from the viewpoint of an uncrewed aerial vehicle (UAV), VarifocalLoss can effectively improve the detection of small objects and objects in complex backgrounds. Whether tiny objects in urban landscapes or dense trees and other obstacles in natural environments, the loss function helps the model to achieve higher detection accuracy and robustness in these challenging environments.

3.2.2. BiFPN Feature Fusion Network

In the neck of the YOLOv7-tiny algorithm framework, the PANet [25] structure is traditionally used for feature extraction and fusion. PANet enhances connections between feature maps through top-down and bottom-up pathways, optimizing the flow of information and improving the algorithm’s ability to transfer and merge key features across different resolution levels [25]. However, in UAV-view scenarios, shooting angles and altitude variations often lead to objects having different scales and semantic information across feature layers. PANet focuses primarily on information transfer between feature layers but lacks flexibility, making it difficult to adapt to dynamic aerial perspectives and multi-scale object variations. This limitation affects its ability to efficiently handle multi-scale and multi-semantic feature fusion, particularly in highly dynamic scenes and complex multi-scale object-detection tasks [24].

The BiFPN structure [25] is adopted to address these challenges as an enhanced feature pyramid network incorporating bidirectional information flow. BiFPN introduces a bidirectional flow mechanism that allows simultaneous top-down and bottom-up feature fusion, improving information transfer across layers [25]. It also includes cross-level connections that retain top-down information transfer while adding bottom-up pathways, forming a more robust and interconnected network topology, as illustrated in Figure 3. This design enables each node in BiFPN to receive inputs from multiple levels, significantly enhancing multi-scale feature interaction. These characteristics make BiFPN particularly effective in detecting small or densely distributed objects, common in UAV-view scenarios.

Additionally, BiFPN eliminates single-input nodes that do not contribute to feature fusion, reducing computational redundancy and lowering memory usage. This design optimizes computational resources, allowing the algorithm to perform real-time detection tasks in UAVs, where resources are typically constrained. Each node in BiFPN integrates information from multiple layers, improving the richness and robustness of features. This enhanced feature interaction enables a more accurate detection of objects of varying scales and improves the algorithm’s ability to adapt to changes in aerial perspectives. Acting as a bridge between the backbone network and the prediction head, BiFPN ensures that coarse-to-fine feature representations are optimized. When applied to UAV aerial images with complex backgrounds, BiFPN prevents critical information from being lost during feature layer transitions, providing a solid foundation for precise object detection.

3.2.3. Partial_C_Detect Object-Detection Header

In the YOLOv7-tiny algorithm, feature extraction and fusion are fundamental for effective object detection. As shown in Figure 4, feature maps at three different scales are generated from the input image. Specifically, the input image is downsampled by 8, 16, and 32 factors, producing feature maps of sizes 80 × 80, 40 × 40, and 20 × 20, respectively [26]. These feature maps are passed through the feature fusion module, which extracts and retains multi-level information, enhancing the algorithm’s ability to detect objects of various sizes.

However, the original feature extraction process can result in losing small object features due to multiple convolution and downsampling operations, especially in UAV views. This limitation weakens the algorithm’s ability to accurately identify small objects and locate them within complex backgrounds, leading to higher false positive and false negative rates [27]. To address this, we introduce Partial Convolution (PConv) as a key component of the detection head. As depicted in Figure 5, PConv is compared with standard convolution and depthwise/group convolution. While standard convolution applies the convolution kernel to the entire feature map, and depthwise/group convolution divides the input into independent groups, PConv selectively applies convolution to only a subset of input channels, leaving the remaining channels unchanged. This selective processing leverages redundancy in the feature map to reduce computational overhead. PConv drastically lowers floating-point operations (FLOPs) and optimizes parameter usage, making it particularly suitable for resource-constrained UAV platforms. Additionally, PConv focuses computational efforts on the most relevant features, minimizing information loss during processing and ensuring robust feature extraction, even in scenarios with dense or occluded objects.

The Partial_C_Detect detection head combines PConv with a local perception convolution mechanism to further enhance detection performance. Multi-scale feature maps generated by the BiFPN network are inputs, and the detection head selectively processes these features to convert them into final detection outputs. PConv’s multi-scale feature processing ensures that critical information is retained across different resolutions, reducing the loss of feature details. In scenarios with dense or occluded objects, the detection head prioritizes critical areas, effectively reducing false negatives and positives. Furthermore, the low FLOPs and lightweight design of PConv significantly reduce computational resource requirements, making it particularly suitable for deployment on UAV platforms with limited computational and storage resources.

With these improvements, the Partial_C_Detect detection head demonstrates outstanding hardware adaptability and detection performance. It operates efficiently on resource-constrained UAV platforms while maintaining high detection accuracy. Whether detecting small objects in urban environments or handling densely packed objects in natural landscapes, the detection head significantly enhances detection capabilities.

3.2.4. Adaptive Kernel Convolution (AKConv)

In the YOLOv7-tiny algorithm, traditional standard convolution plays a crucial role in extracting spatial features. However, it suffers from limitations [28], including a fixed sampling window that prevents it from capturing long-range dependencies [29] and a quadratic growth in parameters as the kernel size increases [29], limiting its adaptability to varying object sizes and shapes. To address these issues, we introduced Adaptive Kernel Convolution (AKConv), a novel convolution method that dynamically adjusts the kernel size and sampling shape based on input feature requirements [30]. This flexibility makes AKConv particularly effective in UAV scenarios, where objects vary significantly in size, shape, and position due to changes in altitude and perspective. This flexible design allows AKConv to accurately match object features and efficiently control computational resource consumption [29], achieving an optimized balance between network performance and overhead.

As shown in Figure 6, the parameter count of AKConv grows linearly with increasing kernel size, while the parameter count of standard convolution and deformable convolution grows exponentially [12]. This linear growth property allows AKConv to effectively manage computational resource requirements even with larger kernel sizes, making it particularly suitable for UAV platforms with limited resources. Figure 7 illustrates the detailed workflow of AKConv, highlighting its dynamic sampling mechanism. Initially, the input feature map is processed by a preliminary convolution layer to generate offsets (2N, H, W), which define the shifts in sampling points to better capture object features. By adjusting the sampling points, the convolution kernel can adapt to changes in object shapes and scales, thereby enhancing feature representation capabilities. After applying these offsets, the feature map is resampled to align with the adjusted sampling coordinates, ensuring that each convolution operation captures contextually relevant features. Subsequently, the resampled feature map undergoes a processing pipeline that includes reshaping, convolution, normalization, and the SiLU activation function. These steps collectively enhance the discriminative power and robustness of features. The flexible structure of AKConv dynamically adjusts to varying inputs, ensuring effective feature extraction in complex environments.

The integration of AKConv not only improves detection capabilities in dynamic scenarios but also enables the convolution operation to adjust its kernel size and sampling shape dynamically. This makes AKConv highly effective in detecting small objects, occluded objects, and densely packed targets in complex backgrounds. Furthermore, its linear parameter growth significantly reduces computational and storage requirements, making it particularly well-suited for UAV platforms with constrained hardware resources. By seamlessly integrating with the BiFPN feature fusion module and the Partial_C_Detect detection head, AKConv provides richer feature representations, enhancing multi-scale detection capabilities and boosting the overall performance of the YOLOv7-tiny algorithm in UAV scenarios. Its lightweight and scalable design enables hardware-specific optimizations, ensuring detection accuracy while maintaining strict control over model parameter size and power consumption, making it ideal for UAV applications.

3.2.5. Dilatable Wide-Range (DWR) Attention Module

The attention mechanism is an important technique in deep learning, inspired by the human visual attention mechanism—where we tend to focus on a specific part of a visual scene while ignoring other irrelevant information. In deep learning algorithms, the attention mechanism allows the algorithm to automatically focus on the most important parts when processing information, thus improving efficiency [7]. Since UAV views often contain complex backgrounds and diverse objects, and object sizes may vary significantly due to factors such as distance and angle, it is worth considering the introduction of the Dilated Residual (DWR) attention module in the YOLOv7-tiny algorithm. The DWR module acts as a regulator throughout detection, optimizing information flow and highlighting key features. It fine-tunes feature weights during the post-feature fusion stage. It works in coordination with AKConv and Partial_C_Detect, ensuring that the final detection boxes accurately reflect the position and category of the actual objects.

The Dilated Residual (DWR) attention module is an efficient attention mechanism designed to enhance the algorithm’s performance when handling complex visual data. This module combines the traditional Residual Network (ResNet) architecture with a dynamic attention mechanism, allowing the algorithm to capture and emphasize important features more effectively while maintaining the integrity of the input information, thereby improving adaptability and accuracy [31]. The core design idea of the DWR module is to optimize information flow through dilated residual connections and Dilated Convolution (DConv) [32]. This structural design enables the module to adjust its internal parameters dynamically based on different inputs, adapting to complex data features and task requirements.

As shown in Figure 8, within the residual block, a two-step method is employed to effectively extract multi-scale contextual information, followed by the fusion of feature maps generated by different receptive fields [32]. First, a standard 3 × 3 convolution layer is used for feature extraction, dividing the regional feature map into several groups. Each convolution layer is followed by Batch Normalization (BN) and ReLU activation functions to enhance the network’s nonlinear processing capability. Second, 3 × 3 Dilated Convolutions (DConv) with different dilation rates, such as D-3 and D-5 in Figure 8, are applied to different groups at varying rates of dilated depth convolution to capture broader contextual information and adapt to features of different sizes. The feature maps are then fed into a 1 × 1 convolution layer to reduce the feature dimensions, and a residual connection is used to add the input before convolution to the output after convolution, forming a shortcut connection to mitigate the vanishing gradient problem and maintain the integrity of information flow [32].

By integrating multiple convolution and attention mechanisms, this architecture significantly enhances the network’s ability to capture key features and improves overall performance. When processing UAV views with complex backgrounds, small objects, or highly dynamic changes, it reduces background noise interference and lowers the rates of missed and false detections.

4. Experiments and Results

4.1. Experimental Environment and Hyperparameter Settings

The experiments in this study were conducted on the Ubuntu 20.04 operating system, with an experimental environment configured using Python 3.8.10, CUDA 11.3, and PyTorch 1.11.0. The hardware includes an NVIDIA GeForce RTX 4090 GPU with 24 GB of VRAM (NVIDIA Corporation, Santa Clara, CA, USA). During the experiments, training and validation were performed under the same hyperparameter settings. The number of training epochs was set to 300, the batch size was 16, and the input image resolution during training was set to 960 pixels × 640 pixels.

4.2. Data Set

This experiment uses the VisDrone2019 [33] dataset. The VisDrone2019 dataset is a large-scale dataset and benchmark constructed to promote research in drone visual perception. This dataset focuses on various visual tasks captured from the drone’s perspective, including object detection, object tracking, single-object tracking, and crowd counting. The VisDrone2019 dataset was collected and annotated through a collaborative effort by international computer vision community researchers to provide a unified platform to evaluate and compare the performance of different algorithms when processing drone data.

The VisDrone2019 dataset contains videos and images captured by drones in various environments, at different times and locations, covering many scenes such as city streets, rural roads, parks, and residential areas. These data are highly diverse regarding the complexity of the backgrounds and factors such as different weather conditions, lighting conditions, shooting angles, and altitudes.

Each video and image in the dataset is accompanied by detailed annotation information, including, but not limited to, object bounding boxes, object categories, and tracking IDs. This annotation information is crucial for training and evaluating computer vision algorithms. VisDrone2019 provides researchers with a challenging testing platform that can be used to test and improve their visual perception algorithms, especially in drones.

4.3. Performance Indicators

In this paper, commonly used metrics in deep learning, such as Precision, Recall, Average Precision (AP), and Mean Average Precision (mAP), are adopted as core evaluation indicators, which comprehensively assess the detection performance of the network. In addition, to verify the applicability of the improved network on the UAV platform, the total number of model parameters (Params) and the number of floating-point operations per second (FLOPs), which reflect computational complexity, are also critical factors for consideration.

Precision refers to the ratio of correctly detected objects to the total number of detected objects. This metric measures the algorithm’s ability to avoid false positives. The formula for calculating Precision is

P r e c i s i o n = \frac{T P}{T P + F P}

(3)

In Equation (3), TP (True Positives) is the number of correctly detected positive class samples, and FP (False Positives) is the number of negative class samples incorrectly labeled as positive.

Recall refers to the ratio of correctly detected objects to the total number of actual objects. This metric measures the algorithm’s ability to detect all relevant objects. The formula for calculating Recall is

R e c a l l = \frac{T P}{T P + F N}

(4)

In Equation (4), FN (False Negatives) is the number of undetected positive class samples.

F1-score is a metric that combines Precision and Recall into a single value, offering a balanced measure of a model’s performance. It is benefical in scenarios with a trade-off between Precision (avoiding false positives) and Recall (minimizing false negatives). The formula for calculating F1-score is as follows:

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(5)

Average Precision (AP) is calculated for each object class and represents the area under the Precision–Recall curve. It is one of the most commonly used performance evaluation metrics in object detection, as it considers both precision and recall. The formula for Average Precision (AP) is

A P = \int_{0}^{1} p (r) d (r)

(6)

In Equation (6), p(r) is the Precision corresponding to the Recall rate r.

Mean Average Precision (mAP) is the average of the AP results across all categories. In multi-class object-detection tasks, mAP is an important metric for evaluating the algorithm’s overall performance. The formula for calculating Mean Average Precision (mAP) is

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P i

(7)

In Equation (7), N is the total number of categories, and APi is the average precision of the ith category.

The number of parameters (Params) usually refers to the number of learnable parameters in the algorithm, which includes the total number of weights and biases. The number of parameters is directly related to the complexity of the algorithm and its storage requirements and is typically used to measure the algorithm’s size and computational resource consumption.

Floating-point operations per second (FLOPs) are primarily used to describe the number of floating-point operations performed during a task’s execution. They are commonly used to evaluate the computational complexity of machine learning models or high-performance computing tasks. In deep learning, an algorithm’s FLOPs can help understand its scale and efficiency. Higher FLOPs indicate greater computational demands and higher energy consumption.

4.4. Experiment

4.4.1. Ablation Experiment

Table 1 provides a detailed overview of the ablation experiment results for the YOLOv7-tiny algorithm. Each experiment is represented by its corresponding algorithmic improvements, marked with a check (√) or cross (×) to indicate the inclusion or exclusion of specific components, such as VarifocalLoss, BiFPN, Partial_C_Detect, AKConv, and DWR.

The original YOLOv7-tiny algorithm achieved a performance of mAP@0.5 of 35.3%, mAP@0.5:0.95 of 18.3%, Precision of 45%, Recall of 38.7%, F1-Score of 41.6%, Params of 22.96 MB, and FLOPs of 13.3 G on the VisDrone2019 dataset.

In Experiment A, the VarifocalLoss loss function was added (indicated by check (√) under VarifocalLoss), leading to an improvement in performance, with mAP@0.5 increasing to 36.1%, mAP@0.5:0.95 rising to 19.5%, and Precision improving to 48%. At the same time, Recall, FLOPs, and Params remained unchanged. The F1-Score also increased from 41.6% to 42.9%.

In Experiment B, the feature fusion framework was replaced with BiFPN on top of Experiment B (marked as √ under BiFPN), resulting in a slight increase in mAP@0.5 to 36.2%, while mAP@0.5:0.95 remained at 19.5%. Other indicators showed minor fluctuations, with Precision slightly dropping to 47.3% and Recall increasing to 38.9%. Params and FLOPs changed slightly to 23.72 MB and 13.9 G, respectively. The F1-Score decreased slightly from 42.9% to 42.7%. Although the performance gains from replacing the feature fusion framework with BiFPN in Experiment B were modest, the multi-scale feature fusion introduced by BiFPN provided a more comprehensive feature input for subsequent modules (such as Partial_C_Detect and AKConv), ensuring higher-quality feature extraction in subsequent modules, which is crucial for enhancing algorithm performance.

In Experiment C, by incorporating the Partial_C_Detect detection head into Experiment C (indicated by check (√) under Partial_C_Detect), the performance improved significantly, with mAP@0.5 reaching 37.7%, mAP@0.5:0.95 reaching 20.4%, Precision increasing to 48.2%, Recall rising to 39.1%, F1-Score increasing to 43.2%, Params slightly increasing to 25.76 MB, and FLOPs at 14.9 G.

In Experiment D, the standard convolution (Conv) was replaced with Adaptive Kernel Convolution (AKConv) on top of Experiment D (marked as √ under AKConv), further improving mAP@0.5 to 37.9%, mAP@0.5:0.95 to 20.8%, Precision to 48.4%, and Recall to 39.3%. The F1-Score improved slightly to 43.4%. Params decreased to 23.28 MB, and FLOPs decreased to 13.2 G.

Finally, in Experiment E, the addition of the Dilated Wide Residual (DWR) attention module (indicated by √ under DWR) increased the mAP@0.5 to 38.2%, mAP@0.5:0.95 to 21.2%, Precision to 49.5%, Recall to 39.5%, F1-Score increased to 43.9%, while Params increased to 28.98 MB, and FLOPs increased to 16.2 G.

To demonstrate the superiority of the improved model, Figure 9 compares the mAP@0.5, Precision, and Recall trends during training between YOLOv7-tiny and the improved model (Ours). The results show that the improved model converges faster and exhibits significant advantages across all performance metrics throughout the training process.

Specifically, the improved model (Ours) consistently outperforms YOLOv7-tiny during training and achieves a higher peak mAP@0.5, indicating substantial enhancements in small-object detection and occlusion handling, particularly in complex scenarios with more pronounced robustness. Furthermore, the Precision curve of the improved model (Ours) is more stable and consistently higher than that of YOLOv7-tiny, demonstrating its effectiveness in reducing false positives and achieving more reliable object recognition in complex backgrounds. Additionally, the improved model’s (Ours) Recall is significantly higher than that of YOLOv7-tiny, reflecting its more substantial capability to detect all relevant objects in dense scenes and complex environments, effectively reducing the miss rate. These performance improvements highlight the comprehensive advantages of the improved model (Ours) in object-detection tasks.

4.4.2. Comparative Experiment

This experiment validates the performance of different algorithms in object-detection tasks by comparing the detection accuracy, parameter count, and floating-point operations (FLOPs) of several lightweight algorithms on the VisDrone2019 dataset. Due to the high resolution of drone images in the VisDrone2019 dataset, a large number of small objects, and the presence of many similar-looking objects, the features of the objects are easily weakened, leading to frequent missed and false detections. Consequently, the detection accuracy of lightweight algorithms is generally low. The comparison results are shown in Table 2.

Although Yolov3-tiny and Yolov4-tiny are more efficient in terms of parameter count and FLOPs, with 33.51 MB, 13.0 G and 22.41 MB, 14.2 G, respectively, their detection accuracy is relatively low, with mAP@0.5 of only 19.9% and 25.7%. This indicates that they are not suitable for complex UAV image analysis tasks. In contrast, Yolov5s and Yolov7-tiny show significant improvements in mAP@0.5, reaching 33.3% and 35.3%, respectively, with Yolov7-tiny having a parameter count of 22.96 MB and FLOPs of 13.9 G.

Yolov6m’s detection accuracy is mAP@0.5 of 31.7%, but it has a relatively large parameter count and FLOPs, at 34.24 MB and 82.0 G, respectively. YoloX, on the other hand, performs well in terms of parameter count and FLOPs, with 9.0 MB and 26.8 G, but its mAP@0.5 is only 25.5%, indicating insufficient accuracy.

PPYoloE achieves a mAP@0.5 of 32.4%, with a parameter count of only 8.9 MB and FLOPs of 31.8 G. While Faster R-CNN shows moderate detection accuracy with a mAP@0.5 of 22.8%, it has the highest resource consumption among all algorithms, with a parameter count of 136.9 MB and FLOPs of 180 G. Based on the transformer architecture, the Swin-transformer model reaches a mAP@0.5 of 31.6%, with a parameter count of 50.0 MB and FLOPs of 58.0 G.

The YOLOv8 series exhibits varied performance across its different versions. YOLOv8s achieves the highest detection accuracy in the series, with a mAP@0.5 of 37.9%, while maintaining moderate resource consumption with 10.59 M parameters and 28.7 G FLOPs. In contrast, YOLOv8n is a lightweight version with only 2.86 M parameters and 8.2 G FLOPs, but its detection accuracy decreases to a mAP@0.5 of 31.4%. YOLOv8m balances accuracy and resource consumption, achieving a mAP@0.5 of 35.9%, with 24.65 M parameters and 78.7 G FLOPs, making it suitable for scenarios requiring higher accuracy with sufficient resources.

YOLOv9-c improves detection accuracy, reaching a mAP@0.5 of 37.0%, but with 24.13 M parameters and a high computational demand of 102.1 G FLOPs. In contrast, YOLOv9s achieves a better balance between resource consumption and detection performance, with 9.6 M parameters and 26.7 G FLOPs, achieving a mAP@0.5 of 37.6%, making it an efficient choice for high-accuracy tasks under limited resources.

Among the newer models, YOLOv10n achieves a mAP@0.5 of 29.0% with only 2.3 M parameters and 8.2 G FLOPs, showcasing its efficiency for lightweight applications. The improved BGF-YOLOv10 further enhances performance, achieving a mAP@0.5 of 32.0%, with 2.0 M parameters and 8.6 G FLOPs, demonstrating its advantages in small object detection while maintaining resource efficiency.

In this experiment, the improved algorithm (Ours) demonstrates superior performance across key metrics. It effectively balances accuracy and efficiency by achieving a mAP@0.5 of 38.2% with 28.98 MB parameters and 16.2 GFLOPs.

Compared to YOLOv9-c (mAP@0.5 of 37.0%, 102.1 GFLOPs), our model delivers higher accuracy while reducing computational complexity by over 80%, making it ideal for resource-constrained UAV platforms. Similarly, it surpasses YOLOv9s (mAP@0.5 of 37.6%, 26.7 GFLOPs) with 0.6% higher accuracy and 40% less computational demand.

Against lightweight models like YOLOv8n (mAP@0.5 of 31.4%, 8.2 GFLOPs) and YOLOv10n (mAP@0.5 of 29.0%, 8.2 GFLOPs), Ours improves accuracy by nearly 7–9% while maintaining manageable computational complexity. It also outperforms YOLOv5s (33.3% mAP@0.5, 15.9 GFLOPs) and YOLOv7-tiny (35.3% mAP@0.5, 13.9 GFLOPs) with higher accuracy (4.9% and 2.9%, respectively) and minimal computational overhead.

Traditional models like YOLOv3-tiny (19.9% mAP@0.5) and YOLOv4-tiny (25.7% mAP@0.5) perform poorly in complex UAV scenarios. Similarly, Swin-transformer (31.6% mAP@0.5, 58.0 GFLOPs) and Faster R-CNN (22.8% mAP@0.5, 180 GFLOPs) fail to balance accuracy and efficiency. In contrast, Ours achieves the highest mAP@0.5 among all tested models with much lower computational costs, making it highly suitable for dynamic UAV scenarios.

This combination of high detection accuracy, reduced computational demand, and adaptability makes the proposed algorithm ideal for UAV-based object detection in resource-constrained and dynamic environments.

4.4.3. Object-Detection Results

To accurately evaluate the performance of the improved version of the YOLOv7-tiny algorithm, we randomly selected multiple challenging drone aerial views from the VisDrone2019 dataset’s test set for object-detection experiments. We compared the results with the original YOLOv7-tiny algorithm. The algorithm’s performance was evaluated under various scenario conditions, including dense objects, occluded objects, low-light environments, fast-moving objects, tiny objects, and complex backgrounds.

Dense objects (Figure 10): This figure compares object-detection results on a traffic-congested urban street where vehicles and pedestrians are densely distributed. The improved YOLOv7-tiny algorithm demonstrates excellent recognition capability in such an environment. Even when objects are very close or partially overlapping, the algorithm can accurately distinguish and label various vehicles and pedestrians.

Occluded objects (Figure 11): In this scene, many objects are partially obscured by other objects. The improved algorithm effectively identified most of partially occluded vehicles and pedestrians, demonstrating its precise recognition ability under object occlusion.

Low-light environments (Figure 12): This figure compares object detection in nighttime low-light conditions. The improved algorithm effectively detects pedestrians and bicycles, showcasing enhanced performance and robustness in poor lighting scenarios.

Fast-moving objects (Figure 13): This figure compares object-detection results in a high-speed urban traffic scenario. The improved algorithm accurately identifies fast-moving vehicles, demonstrating enhanced detection performance under dynamic conditions.

Tiny objects (Figure 14): This figure compares object-detection results for a cityscape taken at a distance, which includes several objects of extremely small size. The improved algorithm successfully identifies these tiny vehicles and pedestrians, demonstrating its significant improvement in detecting small-sized objects at long distances.

Complex background (Figure 15): This image captures an urban road with a complex background, including various dynamic and static distracting elements. The improved version of the algorithm accurately identifies the objects in this complex background, significantly reducing interference from background noise.

In addition, this study uses heatmaps to illustrate the model’s attention distribution further. As shown in Figure 16, the brighter areas on the heatmap represent regions of higher focus. The algorithm proposed in this study demonstrates a global focus on objects, including small and distant objects, showcasing improved detection performance compared to the original model. This further validates the effectiveness of the proposed improvements.

Overall, the algorithm proposed in this study significantly outperforms the original YOLOv7-tiny, particularly in addressing challenges such as small object recognition, occlusion handling, and object detection in complex backgrounds. Compared to YOLOv7-tiny, the proposed model demonstrates improved detection accuracy while maintaining computational efficiency, making it better suited for practical applications in resource-constrained UAV platforms. This is reflected in its ability to reduce missed and false detections, as demonstrated in the experimental results.

Despite these advancements, the proposed model still has some limitations. Its performance in detecting tiny objects and densely overlapping targets remains suboptimal. These challenges are primarily attributed to the loss of fine-grained features during multi-scale downsampling and the difficulty in distinguishing overlapping objects in dense scenarios. Additionally, although the model achieves higher computational efficiency than larger models, its inference time and hardware requirements could still pose challenges for deployment on ultra-low-power platforms.

To address these limitations, future research could focus on the following directions:

Lightweight optimization: Techniques such as quantization and knowledge distillation can reduce the model’s computational and memory demands while maintaining high detection accuracy;
Advanced feature fusion: Incorporating dynamic feature pyramids or context-aware attention modules could improve the model’s capability to detect small objects and overlapping targets more effectively;
Robust data augmentation: Developing strategies tailored to extreme conditions, such as low-light environments or high-motion scenarios, could enhance the model’s generalization and robustness;
Global context modeling: Integrating transformer-based modules could expand the model’s receptive field, improving its ability to capture global context in complex and dense scenes.

These future directions are designed to address the current limitations of the proposed model and overcome the challenges that it faces, thereby further optimizing its performance and expanding its applicability in UAV-based object-detection tasks.

5. Conclusions

This study introduced a series of improvements that significantly enhanced the performance of the YOLOv7-tiny algorithm in UAV object-detection tasks. First, the VarifocalLoss function was introduced to replace the traditional CIoU loss function, enhancing the algorithm’s ability to handle scale variations, occlusions, and class imbalances. Second, the feature fusion network was upgraded from PANet to BiFPN, improving information flow between feature layers and increasing adaptability to multi-scale objects and complex scenes. Additionally, the Partial_C_Detect detection head and Adaptive Kernel Convolution (AKConv) were incorporated, further optimizing detection accuracy and resource efficiency. Finally, the Dilated Wide Residual (DWR) attention module strengthened feature representation, particularly in handling dynamic backgrounds and small object scenarios, improving detection performance. These enhancements collectively enabled the proposed model to balance high detection accuracy with computational efficiency, addressing the challenges posed by resource-constrained UAV platforms.

Experimental results on the VisDrone2019 dataset demonstrated that the improved algorithm outperforms the original YOLOv7-tiny and other mainstream lightweight models, achieving higher detection accuracy and robustness while maintaining computational efficiency. The proposed model effectively reduces missed and false detections in challenging scenarios such as dense object distributions, occlusions, and complex backgrounds.

While the proposed algorithm addresses many key challenges in UAV-based object detection, limitations remain in detecting tiny objects and densely overlapping targets. Future research could focus on refining the model’s lightweight design and enhancing its robustness to extreme conditions, further broadening its applicability.

Author Contributions

Conceptualization, Y.M. and X.W.; methodology, Y.M. and N.Z.; software, K.W.; validation, Y.M., X.W. and K.W.; formal analysis, N.Z.; investigation, Y.M.; resources, N.Z.; data curation, K.W.; writing—original draft preparation, Y.M.; writing—review and editing, Q.G. and L.S.; visualization, X.W.; supervision, L.S. and Q.G.; project administration, L.S.; funding acquisition, Q.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Kai Wang was employed by Inspur Communication Information System Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Kumar, S.; Jain, S. Application of Drones in Agriculture: A Review. Int. J. Agric. Sci. Res. 2020, 10, 145–158. [Google Scholar]
Wang, F.; Wang, H.; Xu, Z. UAV-based traffic flow detection and analysis incorporating deep learning techniques. Transp. Res. Part C Emerg. Technol. 2022, 128, 103–121. [Google Scholar]
Johnson, D.; Gonzalez, L.F. UAVs in search and rescue missions: An algorithmic survey. Robot. Auton. Syst. 2020, 124, 103345. [Google Scholar]
Wei, X.; Ma, J.; Sun, C. A Survey on Security of Unmanned Aerial Vehicle Systems: Attacks and Countermeasures. IEEE Internet Things J. 2024, 11, 34826–34847. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Yue, X. YOLOv8: Advanced Object Detection with Transformer-Based Mechanisms. arXiv 2023, arXiv:2305.12345. [Google Scholar]
Zhao, L.; Zhu, M. MS-YOLOv7: YOLOv7 Based on Multi-Scale for Object Detection on UAV Aerial Photography. Drones 2023, 7, 188. [Google Scholar] [CrossRef]
Zhang, X.; Song, Y.; Song, T.; Yang, D.; Ye, Y.; Zhou, J.; Zhang, L. AKConv: Convolutional Kernel with Arbitrary Sampled Shapes and Arbitrary Number of Parameters. arXiv 2023, arXiv:2311.11587. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal loss for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVRP), Honolulu, HI, USA, 21–26 July 2017; pp. 2980–2988. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision & Pattern Recognition (CVRP), San Diego, CA, USA, 20–25 June 2005. [Google Scholar] [CrossRef]
Anantharaman, R.; Velazquez, M.; Lee, Y. Utilizing Mask R-CNN for Detection and Segmentation of Oral Diseases. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Madrid, Spain, 3–6 December 2018; pp. 2197–2204. [Google Scholar] [CrossRef]
Zhou, Y.; Maskell, S. Detecting and Tracking Small Moving Objects in Wide Area Motion Imagery (WAMI) Using Convolutional Neural Networks (CNNs). In Proceedings of the 22nd International Conference on Information Fusion (FUSION), Ottawa, ON, Canada, 2–5 July 2019; pp. 1–8. [Google Scholar] [CrossRef]
Kang, M.; Ting, C.-M.; Ting, F.F.; Phan, R.C.-W. BGF-YOLO: Enhanced YOLOv8 with Multiscale Attentional Feature Fusion for Brain Tumor Detection. arXiv 2023, arXiv:2309.12585. [Google Scholar]
Chen, X.; Liu, J.; Wang, Y. YOLOv9: Transformer-Augmented Object Detection for Aerial Imagery. IEEE Trans. Image Process. 2023, 32, 1321–1335. [Google Scholar]
Zhou, P.; Zhang, H.; Li, F. Lightweight YOLOv10 for Real-Time Object Detection in UAV Systems. Pattern Recognit. Lett. 2024, 157, 102–110. [Google Scholar] [CrossRef]
Li, A.; Rahim, S.K.N.A.; Hamzah, R.; Gao, Y. YOLO algorithm with hybrid attention feature pyramid network for solder joint defect detection. arXiv 2024, arXiv:2401.01214. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI conference on artificial intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 12993–13000. [Google Scholar]
Zhang, H.; Wang, Y.; Dayoub, F.; Sunderhauf, N. VarifocalNet: An IoU-aware Dense Object Detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8510–8519. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVRP), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Wang, Z.; Liu, Z.; Xu, G.; Cheng, S. Object Detection in UAV Aerial Images Based on Improved YOLOv7-tiny. In Proceedings of the 4th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, 12–14 May 2023; pp. 370–374. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcum, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Wei, H.; Liu, X.; Xu, S.; Dai, Z.; Dai, Y.; Xu, X. DWRSeg: Dilation-wise Residual Network for Real-time Semantic Segmentation. arXiv 2022, arXiv:2212.01173. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]

Figure 1. The YOLOv7-tiny algorithm framework.

Figure 2. The red dotted line marks the structure of the improved YOLOv7-tiny algorithm, which differs from the original algorithm structure.

Figure 3. PANet structure optimized for BiFPN structure.

Figure 4. The structure of the feature fusion module.

Figure 5. Standard convolution (convolution), depthwise/group convolution, and partial convolution.

Figure 6. The trend in the number of convolutional parameters with increasing convolutional kernel size.

Figure 7. AKConv workflow diagram.

Figure 8. Schematic diagram of the DWR module structure. The diagram shows a three-branch DWR module, where RR and SR represent Region Residualization and Semantic Residualization. Conv denotes convolution, DConv denotes depth convolution, D-n denotes Dilated Convolution with a dilation rate of n, the plus sign in the circle indicates addition operations, and c represents the cardinality of feature map channels.

Figure 9. Comparison of mAP@0.5, Precision, and Recall between YOLOv7-tiny and Ours during training.

Figure 10. Comparison of detection results on dense object views in the VisDrone2019 dataset, with the left showing the algorithm results from this study, and the right showing the YOLOv7-tiny algorithm.

Figure 11. Comparison of detection results in occluded object views within the VisDrone2019 dataset, with the algorithm from this study on the left and the YOLOv7-tiny algorithm on the right.

Figure 12. Comparison of detection results in low-light environments views within the VisDrone2019 dataset, with the algorithm from this study on the left and the YOLOv7-tiny algorithm on the right.

Figure 13. Comparison of detection results in fast-moving objects views within the VisDrone2019 dataset, with the algorithm from this study on the left and the YOLOv7-tiny algorithm on the right.

Figure 14. Comparison of detection results for tiny objects in the VisDrone2019 dataset, with the algorithm from this study on the left and the YOLOv7-tiny algorithm on the right.

Figure 15. Comparison of detection results in complex background views within the VisDrone dataset, with the algorithm from this study on the left and the YOLOv7-tiny algorithm on the right.

Figure 16. Comparison of the heat map between the algorithm of this study and the YOLOv7-tiny algorithm, with the algorithm of this study on the left and the original YOLOv7-tini algorithm on the right.

Table 1. Results of ablation experiments.

	VarifocalLoss	BiFPN	Partial_C_Detect	AKConv	DWR	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Precision (%)	Recall (%)	F1-Score	Params (MB)	FLOPs (G)
YOLOv7-tiny	×	×	×	×	×	35.3	18.3	45.0	38.7	41.6	22.96	13.3
A	√	×	×	×	×	36.1	19.5	48.0	38.7	42.9	22.96	13.3
B	√	√	×	×	×	36.2	19.5	47.3	38.9	42.7	23.72	13.9
C	√	√	√	×	×	37.7	20.4	48.2	39.1	43.2	25.76	14.9
D	√	√	√	√	×	37.9	20.8	48.4	39.3	43.4	23.28	13.2
E	√	√	√	√	√	38.2	21.2	49.5	39.5	43.9	28.98	16.2

Table 2. Detection accuracy, Params, and FLOPs of different algorithms.

Algorithm	mAP@0.5	Params (MB)	FLOPs (G)
Yolov3-tiny	19.9%	33.51	13.0
Yolov4-tiny	25.7%	22.41	14.2
Yolov5s	33.3%	7.0	15.9
Yolov7-tiny	35.3%	22.96	13.9
Yolov6m	31.70%	34.24	82.0
YoloX	25.5%	9.0	26.8
PPYoloE	32.4%	8.9	31.8
Faster R-CNN	22.8%	136.9	180
Swin-transformer	31.6%	50.0	58.0
Yolov8s	37.9%	10.59	28.7
Yolov8n	31.4%	2.86	8.2
Yolov8m	35.9%	24.65	78.7
Yolov9-c	37.0%	24.13	102.1
Yolov9s	37.6%	9.6	26.7
Yolov10n	29.0%	2.3	8.2
BGF-Yolov10	32.0%	2.0	8.6
Ours	38.2%	28.98	16.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Miao, Y.; Wang, X.; Zhang, N.; Wang, K.; Shao, L.; Gao, Q. Research on a UAV-View Object-Detection Method Based on YOLOv7-Tiny. Appl. Sci. 2024, 14, 11929. https://doi.org/10.3390/app142411929

AMA Style

Miao Y, Wang X, Zhang N, Wang K, Shao L, Gao Q. Research on a UAV-View Object-Detection Method Based on YOLOv7-Tiny. Applied Sciences. 2024; 14(24):11929. https://doi.org/10.3390/app142411929

Chicago/Turabian Style

Miao, Yuyang, Xihan Wang, Ning Zhang, Kai Wang, Lianhe Shao, and Quanli Gao. 2024. "Research on a UAV-View Object-Detection Method Based on YOLOv7-Tiny" Applied Sciences 14, no. 24: 11929. https://doi.org/10.3390/app142411929

APA Style

Miao, Y., Wang, X., Zhang, N., Wang, K., Shao, L., & Gao, Q. (2024). Research on a UAV-View Object-Detection Method Based on YOLOv7-Tiny. Applied Sciences, 14(24), 11929. https://doi.org/10.3390/app142411929

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on a UAV-View Object-Detection Method Based on YOLOv7-Tiny

Abstract

1. Introduction

2. Related Work

3. UAV View Object-Detection Method Based on YOLOv7-Tiny

3.1. YOLOv7-Tiny Algorithm

3.2. The Improved YOLOv7-Tiny Algorithm

3.2.1. VarifocalLoss Loss Function

3.2.2. BiFPN Feature Fusion Network

3.2.3. Partial_C_Detect Object-Detection Header

3.2.4. Adaptive Kernel Convolution (AKConv)

3.2.5. Dilatable Wide-Range (DWR) Attention Module

4. Experiments and Results

4.1. Experimental Environment and Hyperparameter Settings

4.2. Data Set

4.3. Performance Indicators

4.4. Experiment

4.4.1. Ablation Experiment

4.4.2. Comparative Experiment

4.4.3. Object-Detection Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI