YOLOv7-UAV: An Unmanned Aerial Vehicle Image Object Detection Algorithm Based on Improved YOLOv7

: Detecting small objects in aerial images captured by unmanned aerial vehicles (UAVs) is challenging due to their complex backgrounds and the presence of densely arranged yet sparsely distributed small targets. In this paper, we propose a real-time small object detection algorithm called YOLOv7-UAV, which is speciﬁcally designed for UAV-captured aerial images. Our approach builds upon the YOLOv7 algorithm and introduces several improvements: (i) removal of the second downsampling layer and the deepest detection head to reduce the model’s receptive ﬁeld and preserve ﬁne-grained feature information; (ii) introduction of the DpSPPF module, a spatial pyramid network that utilizes concatenated small-sized max-pooling layers and depth-wise separable convo-lutions to extract feature information across different scales more effectively; (iii) optimization of the K-means algorithm, leading to the development of the binary K-means anchor generation algo-rithm for anchor allocation; and (iv) utilization of the weighted normalized Gaussian Wasserstein distance (nwd) and intersection over union (IoU) as indicators for positive and negative sample assignments. The experimental results demonstrate that YOLOv7-UAV achieves a real-time detection speed that surpasses YOLOv7 by at least 27% while signiﬁcantly reducing the number of parameters and GFLOPs to 8.3% and 73.3% of YOLOv7, respectively. Additionally, YOLOv7-UAV outperforms YOLOv7 with improvements in the mean average precision (map (0.5:0.95)) of 2.89% and 4.30% on the VisDrone2019 and TinyPerson datasets, respectively.


Introduction
With the decrease in the cost of drones, the civilian drone market has entered a period of rapid development.At the same time, target detection technology based on deep learning has also made remarkable progress in recent years, which has made the combination of drones and target detection technology more closely related.The integration of the two can play an important role in many fields, such as crop detection [1], intelligent transportation [2], and search and rescue [3].However, most target detection models are designed based on natural scene image datasets, and there are significant differences between natural scene images and drone aerial images.This makes it a meaningful and challenging task to design a target detection model specifically suitable from the aerial drone perspective.
In practical application scenarios, real-time target detection of the unmanned aerial vehicle (UAV) aerial video stream places a high demand on the detection speed of the algorithm model.Furthermore, unlike natural scene images, due to the high altitude of UAV flights and the existence of a large number of small targets in aerial images, there are fewer extractable features for these targets.In addition, the UAVs actual flight altitude often varies greatly, leading to drastic changes in object proportions and a low detection accuracy.Finally, complex scenes are often encountered during actual flight shooting, and there may be a large amount of occlusion between densely packed small targets, making them easily obscured by other objects or the background.In general, generic feature extractors [4][5][6] downsample the feature maps to reduce spatial redundancy and noise while learning high-dimensional features.However, this processing inevitably leads to the representation of small objects being eliminated.Additionally, in real-world scenarios, the background exhibits diversity and complexity, characterized by various textures and colors.Consequently, small objects tend to be easily confounded with these background elements, resulting in an increased difficulty of their detection.In summary, there is a need to design a real-time target detection model for UAV aerial photography that is suitable for dense small target scenarios in order to meet practical application requirements.
Object detection algorithms based on neural networks can generally be divided into two categories: two-stage detectors and one-stage detectors.Two-stage detection methods [7][8][9] first use region proposal networks (RPNs) to extract object regions, and then detection heads use region features as input for further classification and localization.In contrast, one-stage methods directly generate anchor priors on the feature map and then predict classification scores and coordinates.One-stage detectors have a higher computational efficiency but often lag behind in accuracy.In recent years, the YOLO series of detection methods has been widely used in object detection of UAV aerial images due to their fast inference speeds and good detection accuracies.YOLOv1 [10] was the first YOLO algorithm, and subsequent one-stage detection algorithms based on its improvements mainly include YOLOv2 [11], YOLOv3 [12], YOLOv4 [13], YOLOv5 [14], YOLOx [15], YOLOv6 [16], YOLOv7 [17], and YOLOv8 [18].YOLO algorithms directly regress the coordinates and categories of objects, and this end-to-end detection approach significantly improves the detection speed without sacrificing much accuracy, which meets the basic requirements of real-time object detection for unmanned systems.
Previous improvement methods for target detection in UAV aerial images can be categorized into three types: (i) utilizing more shallow feature information, such as adding small target detection layers [19]; (ii) enhancing the feature extraction capability of the target detection network, such as improving the Neck network [20] or introducing attention mechanisms [21]; and (iii) increasing input feature information, such as generating higher resolution images [22], image copying [23], and image cropping [24,25].
Taking into consideration the aforementioned discussion, we propose a high-precision real-time algorithm, namely YOLOv7-UAV, for aerial image detection in unmanned aerial vehicles (UAVs).In summary, the contributions of this paper are as follows: (1) We have optimized the overall architecture of the YOLOv7 model by removing the second downsampling layer and introducing an innovative approach to eliminate its final neck and detection head.This modification significantly enhances the utilization efficiency of the detection model in capturing shallow-level information.
(2) We present the DpSPPF module as an alternative to the SPPF module.It replaces the original max pooling layers with a concatenation of smaller-sized max pooling layers and depth-wise separable convolutions.This design choice enables a more detailed extraction of feature information at different scales.
(3) We propose the binary K-means anchor generation algorithm, which avoids the problem of local optimal solutions and increases the focus on sparse-sized targets by reasonably dividing the anchor generation range into intervals and assigning different numbers of anchors that need to be generated in each interval.
(4) Extensive experiments were conducted on both the VisDrone dataset and the TinyPerson dataset to validate the superiority of our proposed method over state-of-the-art real-time detection algorithms.

YOLOv7
YOLOv7 is one of the most advanced single-stage object detection algorithms that satisfies both real time and high precision requirements.YOLOv7 incorporates several trainable bag-of-freebies, which can significantly enhance the detection accuracy without increasing the inference cost.It uses the "ex-tend" and "compound scaling" methods to improve the utilization of parameters and computational resources.YOLOv7 also incorporates improved re-parametrization modules and label assignment strategies.The YOLOv7 model is mainly composed of three parts: a backbone network (Backbone), a bottleneck layer network (Neck), and a detection network (Head).The backbone network includes standard convolutional layers, max pooling layers, Extended Efficient Layer Aggregation Networks (ELAN) modules, and SPPCSPC modules.The backbone network performs feature extraction, where the ELAN module increases the cardinality of newly added features using group convolution without altering the original gradient propagation path.It merges features from different groups by mixing and merging their cardinalities, which enhances the learned features from different feature maps and improves the usage of parameters and computations.The SPPCSPC module performs feature extraction through max-pooling with different pooling kernel sizes, which expands the model's receptive field.To fuse feature information on different scales, the neck uses three different sized feature maps extracted from the backbone for feature fusion.This part still uses the PANet [26]structure based on FPN, which adds channels from shallow to deep networks.The model's head can be viewed as the YOLOv7 classifier and regressor.
However, the YOLOv7 algorithm was not specifically designed to address small object datasets so it cannot be directly applied to the detection of aerial images from unmanned aerial vehicles (UAVs).

Spatial Pyramid Pooling
Spatial pyramid pooling (SPP) was proposed by Kaiming He et al. [27].It aggregates features of different sizes by using pooling layers of different scales and produces an output with a fixed size.In the YOLO series, YOLOv4 was the first to incorporate the SPP structure.YOLOv5 replaced the three parallel maximum pooling layers in SPP with three concatenated maximum pooling layers of smaller sizes to obtain Spatial Pyramid Pooling-Fast (SPPF).The impacts of SPPF and SPP on neural network output results are nearly identical, but SPPF has a faster processing speed.YOLOv7 uses the SPPCSPC module, which is a fusion of SPP and CSPNet [28] modules.Compared to the SPP module, the SPPCSPC module can extract richer feature information, but it has a higher number of parameters and a higher computational complexity.

Anchor Generation Algorithm
Anchors were first introduced in Fast-RCNN as pre-defined bounding boxes that are used to label regions in an image that may contain objects.It aids in the precise and efficient localization of targets for object detection algorithms.During detection, anchorbased object detection models adjust the size of anchors and filter them to obtain the final predicted boxes.In the past, there have been two main approaches to obtaining anchors: one involves manual design, while the other involves clustering algorithms such as Kmeans and K-means++.In the YOLO series, the anchor mechanism was first introduced in YOLOv2.YOLOv3, YOLOv4, YOLOv5, and YOLOv7 object detection models employ a genetic algorithm to refine the anchor generated by the K-means algorithm.

Bounding Box Regression Loss Function
The bounding box regression loss function is an important component in object detection tasks which measures the difference between predicted detection boxes and true boxes.In early object detection methods, the Mean Square Error (MSE) loss function was a common choice, which calculates the squared error between the predicted coordinates of the detection box and the true coordinates of the box.However, the MSE loss function is highly sensitive to outliers.To address this issue, Fast R-CNN introduced the Smooth L1 loss function, which uses a square function when the error is small and a linear function when the error is large, while also possessing robustness.
IoU Loss is a loss function based on intersection over union (IoU), which optimizes the model by minimizing the IoU distance between the detection box and the true box, thereby more directly considering the degree of overlap between the detection boxes.GIoU Loss [29] is an improved version of IoU Loss, which not only considers the intersection and union of the two boxes but also considers the distance between their bounding boxes.Zhaohui Zheng et al. [30] proposed DIoU and CIoU.DIoU Loss is an improvement of GIoU Loss, which uses a more accurate distance metric in the calculation of the distance.CIoU Loss further considers the difference in aspect ratios based on DIou loss.Compared to CIoU Loss, EIoU Loss [31] directly considers the difference in length and width and SIoU Loss [32] adds considerations for the angle of the bounding box regression.Jinwang Wang et al. [33] pointed out that IoU is too sensitive to small object position deviations; thus, they designed an evaluation metric (nwd, normalized Gaussian Wasserstein distance) for small objects based on the Wasserstein distance.

YOLOv7-UAV
YOLOv7 is one of the most advanced single-stage object detection models, which comprises seven distinct versions: YOLOv7-tiny, YOLOv7, YOLOv7-X, YOLOv7-W6, YOLOv7-E6, YOLOv7-D6, and YOLOv7-E6E.Considering the trade-off between detection accuracy and speed, we selected the YOLOv7 model as the foundation for constructing the YOLOv7-UAV network architecture.
The overall structure of the YOLOv7-UAV model is illustrated in Figure 1, which differs from YOLOv7 in four aspects.In the following four subsections, we will introduce each of these four modifications separately in detail.It should be noted that in order to ensure a fair comparison, we performed an overall scaling of the channel numbers on the modified model to ensure the compared models had similar GFLOPs.The scaling of channel numbers is accomplished through the approach described in Equation (1).We denote the scaling factor for the portion of YOLOv7 that is located before the second downsampling layer as W 1 , and the scaling factor for the remaining portion as W 2 .The GFLOP calculation formula for the CNN neural network is shown in Equation (2).Removing the second downsampling layer in the YOLOv8 model results in a two-fold increase in the height and width of feature maps following the layer, leading to a significant rise in model GFLOPs.However, the model's GFLOPs can be substantially reduced by reducing the number of feature map channels using a scaling factor W. Additionally, since removing the second downsampling layer does not affect the size of feature maps preceding it, different values of parameter W can be assigned to the the feature maps before and after the second downsampling layer.It is worth noting that the settings of W1 and W2 in this paper not only make the GFLOPs of the model similar before and after adjustment but also adhere to the ratio of GFLOPs between the second downsampling layer of the YOLOv7 model before and after the adjustment.Clearly, there are infinite combinations of W1 and W2 that satisfy this condition.However, due to experimental constraints, we only compared a few of them.
where C 1 and C 2 are the number of channels in a neural network layer before and after scaling, respectively.
where H and W represent the height and width of the output feature map, k denotes the size of the convolutional kernel, and C i and C o correspond to the channel numbers of the input and output feature maps, respectively.

Reducing the Receptive Field of YOLOv7
We removed the second downsampling layer in YOLOv7 in order to reduce the receptive field and mitigate the loss of fine-grained feature information caused by downsampling.
Despite the fact that deep-level feature information is beneficial for object classification, there exists a semantic gap between feature information extracted from different layers, and the overly large receptive field of deep networks is not conducive to detecting small objects.Thus, we removed the third detection head of the YOLOv7 model to enhance its utilization of fine-grained feature information.We then adjusted the model's channel numbers by scaling them using W 1 = 0.75 and W 2 = 0.5.

Replacing SPPCSPC with DpSPPF
We believe that a large-sized maximum pooling layer will result in the loss of finegrained feature information, which is detrimental to small object detection.Therefore, we proposed the DpSPPF moudle which replaces the maximum pooling layer (Kernel_size = 5) in the SPPF module with interconnected smaller depth-wise separable convolution (Ker-nel_size = 3) and max pooling layers (Kernel_size = 3).The structure of the DpSPPF module is illustrated in Figure 2. Subsequently, we incorporated the DpSPPF module into the deepest layer of the YOLOv7 model backbone that had undergone two modifications to aggregate the feature information of different scales.

Binary k-Means Anchor Generation Algorithm
In UAV-based image detection tasks, there exist targets of different sizes, and these targets are imbalanced in both the dataset and real-world scenarios.YOLOv7 initially generates anchors using the K-means algorithm and then applies the standard genetic algorithm to mutate these anchors based on their fitness, which is determined by the overlap between the generated anchors and the dimensions of all the targets in the training set.However, the k-means algorithm is highly influenced by initial points and outliers, which may result in the clustering results being only locally optimal.In addition, when the k-means algorithm is combined with the genetic algorithm in the anchor clustering process, it often focuses on samples with common sizes, while some samples with rare sizes may significantly deviate from the clustering results.To address these issues, we propose an improved anchor generation algorithm referred to as the "binary k-means anchor generation algorithm".
The binary K-means prior anchor generation algorithm first obtains K cluster centers on the dataset using k-means and the genetic algorithm.Based on the width and height of the cluster anchor with the largest area, the algorithm divides the target size distribution interval for the dataset into three intervals to generate an anchor, requiring each region to generate at least one prior anchor.This helps the generated anchors to be closer to some rare sizes and reduces the probability of them becoming local optimal solutions.In addition, the algorithm determines the number of prior anchors to be generated in each interval based on the ratio between the number of cluster centers contained in each interval, so that more attention can be paid to samples with common sizes during the process of prior anchor generation.Its steps are shown in Algorithm 1.
The selection of a K value affects the degree of attention paid by the target detection algorithm to targets of different sizes, so the selection of a K value needs to be within an appropriate range.Through experiments in Section 4.3.2,we found that when generating six anchors on the VisDrone2019 dataset, as long as the value of K is within [11,19], the effect of the binary K-means prior anchor generation algorithm is better than that of the K-means prior anchor generation algorithm.The default value of K for YOLOv7-UAV is 12, which was determined as the optimal value through testing on the VisDrone2019 dataset.
To illustrate more clearly the difference in clustering performance between the binary prior anchor generation algorithm and the approach that sequentially uses k-means and the genetic algorithm, we present in Figure 3a comparison of the two algorithms on the VisDrone2019 dataset and the TinyPerson dataset.From the figure, it can be observed that the anchors generated by the binary K-means prior box generation algorithm are more widely dispersed, yet they also place a greater emphasis on objects of common sizes.
3: K: The number of clusters at the first cluster.4: k: The number of anchors needed.Output: anchor = (a 1 , a 2 , ..., a k ) 5: Consecutively using k-means and genetic algorithm to obtain K cluster centers on T, listed in ascending order by their respective areas as A 1 . . .A K .6: N 1 ← 0, N 2 ← 0, N 3 ← 0. N 1 , N 2 , and N 3 represent the number of anchors allocated to each interval to generate anchor, initially set to 0 7: b ← (A K [0] + A K [1])/6.It serves as the partition threshold for generating three intervals to generate anchor.8: for i = 1 to K do 9:  if

append(T[i])
end if 39: end for 40: Successively apply K-means and genetic algorithm for clustering on the T 1 (k = k 1 ),T 2 (k = k 2 ), and T 3 (k = k 3 ) to get the anchor, respectively.

Nwd and Positive/Negative Sample Allocation Strategy
YOLOv7 determines the number of positive samples required (k) for each ground truth object by summing the top 10 IoU scores.The model then selects the top k samples with the smallest cost (the cost is the sum of the classification loss and the regression loss, added in a ratio of 1:3) for each ground truth object as positive samples.Due to the excessive sensitivity of the IoU to size deviations of small objects, this approach leads to an insufficient number of positive samples being assigned to small target ground truths during the training process of object detection networks.
The normalized Wasserstein distance (nwd) is a novel method for evaluating small object detection, which models the bounding boxes as two-dimensional Gaussian distributions and measures the similarity between predicted and ground truth objects, regardless of their overlap.The nwd is less affected by the scale of objects, making it particularly suitable for evaluating small objects.In YOLOv7's positive/negative sample allocation strategy, we use the weighted nwd metric and IoU metric instead of the IoU metric.The IoU loss is retained, as it is more suitable for medium-to large-sized objects.The computation process of the nwd is shown in Equation ( 3).After experimental tuning, we set nwd:IoU = 0.5:0.5 for the TinyPerson dataset, and nwd:IoU = 0:1.0for the VisDrone2019 dataset.
where C is a constant related to the dataset (we adopted the same setting of C = 12.8 as in the original paper [33]), W 2 2 (N a , N b ) is a distance metric which is computed using Equation ( 4), and N a and N b are Gaussian distributions modeled by B a = (x a , c a , w a , h a ) and B =(x b , y b , w b , h b ).The VisDrone2019 [34] dataset consists of a large number of annotated images captured by drones, with a total of 7019 images divided into 10 classes.The training and validation sets contain 6471 and 548 images, respectively.This dataset mainly contains small-and medium-sized targets.
The TinyPerson [35] dataset consists of 1610 images with a total of 72,651 annotated bounding boxes, mainly focusing on small objects.The images in this dataset were mainly captured by unmanned aerial vehicles and are categorized into two groups, namely sea_person and earth_person.The training and testing sets of the dataset comprise 794 and 816 images, respectively.There are a few annotation boxes in TinyPerson that can be ignored, including densely packed crowds that are difficult to separate, ambiguous regions, and shadow regions in water.These annotation boxes were replaced by the mean value of the image region in [35], and we simply ignore them.

Evaluation Metrics
We evaluated the performance of the object detection algorithm using four metrics, namely mean average precision (map), GFLOPs, Frames Per Second (FPS), and parameters.Map is computed using Equation (5), map0.5 denotes the map calculated at an IoU threshold of 0.5 and map(0.5:0.95) denotes the map scores across 10 IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05.GFLOPs can quantify the computational complexity of the model, parameter can measure the size of the model, and FPS represents the actual inference speed of the model.
where N represents the total number of categories, P represents precision, and R represents recall.
In the experiment conducted on the TinyPerson dataset, we set the number of epochs, batch size, and image input dimensions to 150, 4, and 960 * 960, respectively.In the experiment conducted on the VisDrone2019 dataset, we set the number of epochs, batch size, and image input dimensions to 300, 4, and 640 * 640, respectively.The iteration numbers of the K-means algorithm and genetic algorithm were set to 30 and 1000, respectively.The object detection model was evaluated for FPS on the corresponding test set of the dataset on which it was trained.

Ablation Experiments
We have divided the construction process of YOLOv7-UAV into four steps: In step 1, we removed the second downsampling layer and the deepest detection head of YOLOv7.Then, we reduced the number of channels in the model using Equation (1) with weights W 1 = 0.75 and W 2 = 0.5.In step 2, we introduced the DpSPP module.In step 3, we utilized the binary K-means anchor generation algorithm.In step 4, we used the weighted nwd and IoU instead of the IoU in the positive and negative sample allocation strategy.In the tables of each subsection in this chapter, the experimental settings highlighted in bold will serve as the baseline for the subsequent subsection.
Tables 1 and 2 illustrate the changes in the performance of the object detection model during the construction process of YOLOv7-UAV on the VisDrone2019 and TinyPerson datasets, respectively.The results shown in these two tables indicate that each step of improvement we made to YOLOv7 was effective.In the following four sections, we will further introduce the ablation experiments conducted for each step and compare YOLOv7-UAV with other advanced object detection algorithms.Due to the presence of a large number of intermediate-sized objects in VisDrone2019, we did not apply step 4 to the models trained on VisDrone2019.We present the impact of changes to the model architecture on the detection model in Tables 3 and 4. In order to facilitate a fair comparison, we ensured that the experimental setups had similar GFLOP levels by adjusting the overall channel numbers of the models.The comparative results of the first and third rows, as well as those of the second and fourth rows in both tables, demonstrate that reducing the detection head located at the deepest position of YOLOv7 not only significantly increases the speed of model detection, but also greatly improves the accuracy of detecting unmanned aerial vehicle (UAV) images.The comparative results from the fifth to the seventh rows in both tables indicate that, in terms of both the detection speed and accuracy, the DpSPPF module achieves the best performance compared to the popular SPPF and SPPCSPC modules.The experimental setup in the seventh row of both tables corresponds to step 2 as described in Tables 1 and 2.

The Impact of the K Value in the Binary k-Means Anchor Generation Algorithm
We conducted an empirical study on the impact of the choice of the K value in the first iteration of the binary K-means anchor generation algorithm on the performance of the object detection algorithm.The results, as shown in Table 5, indicate that as the value of K increases, the generated range of anchors tends to become larger.Furthermore, the binary K-means anchor generation algorithm performs better when the value of K falls within the range of [11,19].Specifically, the results in the table also indicate that the object detection algorithm achieves the highest map (0.5:0.95) with K = 12 (i.e., step 3 in Tables 1 and 2).[4,6,12,8,8,15,22,20,48,34,91,56] In Table 6, we present the impact of the binary K-means anchor generation algorithm on several popular object detection models.For the object detection algorithm with nine anchors, we simply set K to 18 by taking 12 × K 1 /K 2 = 12 × 9/6 = 18.As demonstrated in this table, all detection algorithms exhibited better performances on both datasets, indicating the excellent generalization ability of the binary K-means anchor generation algorithm.7 presents the impact of using different weights of nwd and IoU in the positivenegative sample allocation strategy on the detection performance.The results in the table indicate that on the TinyPerson dataset, setting nwd:IoU = 0.5:0.5 (i.e., step 4 in Table 2) yields the best performance, while the VisDrone dataset, which contains a large number of medium-sized objects, does not require the use of the nwd.

Algorithm Comparison
The detection performance comparison of YOLOv7-UAV and other state-of-the-art realtime object detection algorithms on a UAV dataset is presented in Table 8.The algorithm tph-YOLOv5 in the table is specifically designed for detecting small targets in UAV imagery, while YOLOv8m-p2 is a version of YOLOv8 specifically designed for small object detection.The results indicate that YOLOv7-UAV outperforms its counterparts in both detection speed and accuracy.
Figure 4 illustrates the detection results of YOLOv7 and YOLOv7-UAV.Both models were trained on the training set of the VisDrone2019 dataset.It can be observed that YOLOv7-UAV has a lower false negative rate and generally higher confidence levels compared to YOLOv7.

Conclusions
There exists a significant challenge for existing detection algorithms in detecting a large number of small targets with diverse shooting angles in unmanned aerial vehicle (UAV) images.This paper proposes an algorithm, YOLOv7-UAV, which can detect UAV images in real time.Firstly, the algorithm reduces the loss of feature information and improves the model's utilization efficiency of fine-grained feature information by removing the second downsampling layer and the deepest detection head of the YOLOv7 model.Then, the algorithm replaces the maximum pooling layer in the SPPF module with concatenated smaller depth-wise separable convolution and maximum pooling layers, optimizing its ability to extract fine-grained feature information while retaining the ability to aggregate multi-scale feature information.Subsequently, the paper proposes a binary K-means anchor generation algorithm that reasonably divides the anchor box generation interval and retains a focus on common sizes to obtain better anchors.Finally, YOLOv7-UAV introduces the weighted nwd and IoU as evaluation metrics in the label assignment strategy on the TinyPerson dataset.Results on the VisDrone2019 and TinyPerson datasets demonstrate that YOLOv7-UAV outperforms most popular real-time detection algorithms in terms of the detection accuracy, detection speed, and memory consumption.
Despite the excellent performance of our proposed method on the unmanned aerial vehicle object detection datasets, there are still some limitations.Specifically, the effectiveness of YOLOv7-UAV on low-power platforms, such as embedded devices, requires further testing and optimization.Moreover, the determination of the K value in the binary K-means anchor generation algorithm is solely based on experimental results, without a sufficient analysis of the factors influencing K.In real-world unmanned aerial vehicle tasks, adverse weather conditions such as fog and darkness may be encountered, which our method has not been specifically optimized for.Additionally, unmanned aerial vehicles can be equipped with camera systems that have a larger field of view (FOV), which introduces radial and barrel distortions that significantly impact the detection of small targets, an aspect that our proposed algorithm has not specifically addressed.
In future work, we plan to optimize the performance of YOLOv7-UAV on low-power platforms by employing model compression techniques such as model pruning and distillation.Moreover, we aim to improve the process of generating anchors using the binary K-means algorithm by directly determining an appropriate K value based on the distribution of data within the dataset and the receptive fields of the target detection model.We plan to incorporate advanced generative networks or construct more abundant datasets to enhance the performance of the YOLOv7-UAV algorithm in challenging environments such as those with dense fog or low light conditions.Additionally, we intend to investigate the rectification issues of wide-angle cameras and devise targeted data augmentation methods to enhance the detection performance of object detection algorithms on images captured with a larger FOV.
(a) Clustering results on VisDrone2019 (b) Clustering results on TinyPerson

Figure 3 .
Figure 3.Comparison of clustering results between the binary k-means anchor generation algorithm with the approach that sequentially uses k-means and a genetic algorithm.(The 'k' and 'K' represent the number of anchors to be generated and the number of clusters in the first clustering step of the binary K-means prior box generation algorithm, respectively.The graph's coordinate points are indicative of the size of the targets, with a darker shade of blue indicating a higher number of targets represented).

T 1 ,
T 2 , T 3 represent the sets of object bounding box sizes within the three interval to generate anchor.20: for i = 1 to n do21:

Figure 4 .
Figure 4.The detection results of YOLOv7 and YOLOv7-UAV (the sample images are all from the testing set of the VisDrone2019 dataset).

Table 1 .
Impact of each step evaluated on VisDrone2019.

Table 2 .
Impact of each step evaluated on TinyPerson.

Table 3 .
Impact of changes in the YOLOv7s architecture evaluated on VisDrone2019 ('dsp' refers to the second downsampling layer).

Table 4 .
Impact of changes in the YOLOv7s architecture evaluated on TinyPerson refers to the second downsampling layer).

Table 5 .
The impact of the K value in the binary k-means anchor generation algorithm on Vis-Drone2019.

Table 6 .
Impact of the binary K-means anchor generation algorithm on several detection models.

Table 7 .
Impact of using different weights of nwd and IoU.

Table 8 .
Performance of the YOLOv7-UAV algorithm and other object detection algorithms.