YOLO Adaptive Developments in Complex Natural Environments for Tiny Object Detection

Zhong, Jikun; Cheng, Qing; Hu, Xingchen; Liu, Zhong

doi:10.3390/electronics13132525

Open AccessArticle

YOLO Adaptive Developments in Complex Natural Environments for Tiny Object Detection

by

Jikun Zhong

¹,

Qing Cheng

^1,2,*,

Xingchen Hu

¹ and

Zhong Liu

^1,2

¹

Laboratory for Big Data and Decision, National University of Defense Technology, 109 Deya Road, Changsha 410003, China

²

Hunan Institute of Advanced Technology, 699 Qingshan Road, Changsha 410003, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(13), 2525; https://doi.org/10.3390/electronics13132525

Submission received: 28 May 2024 / Revised: 20 June 2024 / Accepted: 24 June 2024 / Published: 27 June 2024

(This article belongs to the Special Issue Three-Dimensional Machine Vision for Robots: Human Activity and Scene Understanding)

Download

Browse Figures

Versions Notes

Abstract

Detection of tiny object in complex environments is a matter of urgency, not only because of the high real-world demand, but also the high deployment and real-time requirements. Although many current single-stage algorithms have good detection performance under low computing power requirements, there are still significant challenges such as distinguishing the background from object features and extracting small-scale target features in complex natural environments. To address this, we first created real datasets based on natural environments and improved dataset diversity using a combination of copy–paste enhancement and multiple image enhancement techniques. As for the choice of network, we chose YOLOV5s due to its nature of fewer parameters and easier deployment in the same class of models. Most improvement strategies to boost detection performance claim to improve the performance of privilege extraction and recognition. However, we prefer to consider the combination of realistic deployment feasibility and detection performance. Therefore, based on the hottest improvement methods of YOLOV5s, we try to make adaptive improvements in three aspects, namely attention mechanism, head network, and backbone network. The experimental results proved that the decoupled head and Slimneck based improvements achieved, respectively, 0.872 and 0.849, 0.538 and 0.479, 87.5% and 89.8% on the mAP0.5, mAP0.5:0.95, and Precision metrics, surpassing the results of the baseline model on these three metrics: 0.705, 0.405 and 83.6%. This result suggests that the adaptively improved model can better meet routine testing needs without significantly increasing the number of parameters. These models perform well on our custom dataset and are also effective on images that are difficult to detect by naked eye. Meanwhile, we find that YOLOV8s, which also has the decoupled head improvement, has the results of 0.743, 0.461, and 87.17% on these three metrics. It proves that under our dataset, it is possible to achieve more advanced results with lower number of model parameters just by adding decoupled head. And according to the results, we also discuss and analyze some improvements that are not adapted to our dataset, which also provides ideas for researchers in similar scenarios: in the booming development of object detection, choosing the suitable model and adapting to combine with other technologies would help to provide solutions to real-world problems.

Keywords:

object detection; complex natural environment; YOLOV5s; copy–paste; self-attention mechanism; head network; backbone network

1. Introduction

The accurate detection of objects within complex natural environments is critical for a variety of applications, including environmental monitoring and wildlife conservation. However, the task is challenging due to the intricacies of the natural world, such as varying weather conditions, dense foliage, and the presence of camouflage. Additionally, the detection of small or tiny objects within these environments requires sophisticated technology and algorithms, as well as significant computational resources. Despite advancements in automated systems, the reliable detection of objects in complex natural settings remains an open research problem. The emergence of deep neural networks such as R-CNN [1], YOLO [2] and VIT [3] has provided solutions for reconnaissance purposes. The new end-to-end detection algorithms eliminate the need for feature extraction, making real-time detection faster in complex environments. However, the effectiveness of these algorithms in actual natural environments, which are significantly different from controlled experimental environments, remains a challenge. The following challenges remain when it comes to detecting in complex natural environments:

(1) In complex environments, the target’s color blends with the background, making it difficult for the algorithm to extract significant color features. As a result, the final image may appear blurry. (2) Poor lighting conditions and environmental factors such as haze and shadows further complicate the imaging process, creating a complex background. (3) The occlusion of the target by natural elements such as woods, mountains, and buildings also impacts the algorithm’s ability to extract the target’s edge features, causing it to be obscured. (4) It is worth noting that in real situations targets in complex environments tend to be small and difficult to obtain, resulting in fewer available samples for analysis.

Nowadays, different improvements are proposed in different complex environments. Utilizing YOLO, Mate et al. [4] employed thermal image to improve object detection performance in challenging conditions such as bad weather, nighttime, and densely packed areas. However, it is difficult to collect stable and usable thermal images in most complex environments, especially when shooting conditions are limited. Another approach to reduce the impact of weather-specific information is to build complex image restoration networks: Liu et al. [5] proposed a Differentiable Image Processing (DIP) module to target adverse weather conditions and predict the parameters of the DIP module through a small Convolutional neural network model (CNN-PP); Hnewa et al. [6] proposed a domain-adaptive training framework to face the detection problems in foggy and snow scenarios; Huang et al. [7] similarly introduced a Dual Subnet (DSNet) to solve the problems of target detection in foggy environments.

However, the pipeline used is different in low-light situations, so it is difficult to handle more complex scenes. For low-light situations, a number of approaches have been proposed: Sasagawa et al. [8] proposed a detection method in low light conditions using a transfer learning approach that allows target detection in light conditions below 1 lux. However, prior domain knowledge is required, and the repeatability in different scenarios is not strong. Yuxuan et al. [9], inspired by Receptive Fields (RFs), solved the problem of detecting targets with unclear features to some extent by preserving the low-dimensional spatial features, but the requirement of computational power is higher. Peng et al. [10] proposed an NLE-YOLO model, which improves the model feature extraction capability and receptive field after modularization improvements based on YOLOV5 and performs well in the low-light detection problem. But these methods are not applicable in all scenarios. Zhao et al. [11] proposed DIOU, which solves the problem of face occlusion detection. Eran et al. [12] proposed a Soft-IOU layer that can achieve better performance in extremely dense artificial scenes, and Chen et al. [13] proposed bounding boxes with rotation parameters, which can better detect selective and dense objects. At the same time, Pixels-IOU [13] is introduced to improve the detection effect of tilted objects. Chi et al. [14] designed a mask-guided module to leverage the head information to enhance the ability to detect occluded pedestrians. Li et al. [15] used an occlusion-guided multi-task network (OGMN) that addresses the feature obfuscation problem for occluded targets. Cheng et al. [16] proposed a joint network with a image enhancement subnet to solve the occlusion problem and a detection subnet to detect marine objects.

However, the detection of naturally occluded scenes is still a very cutting-edge challenge. Qi et al. [17] devised a contrast training strategy and a multi-association detector, with the result that the model can be trained with fewer shots and can detect unlabeled categories; Zhu et al. [18] combined semantic relations with visual information for better results with few-shot object detections; Ren et al. [19] built a two-stage meta-learning model based on YOLO to avoid the detection error situation of a single YOLO model in few-shot detection. However, the generalization of the above models to different scenarios is not yet complete.

Despite the numerous proposals aimed at addressing the challenges of poor texture features, low illumination, occlusion, and lack of samples in complex situations, there is still a paucity of natural scene models that can perform robustly in all complex situations. It is more noteworthy that deploying target detection algorithms in border-complex environments and hardware conditions are also difficult to meet higher requirements. Therefore, we have to choose the model with the best generalization ability, i.e., propose adaptive improvements, with low arithmetic requirements. Therefore, we first aim at One-stage object detection. As such, we endeavor to enhance the YOLO model, which ranks among the most sought-after and easiest-to-improve models for tackling these practical problems.

In the YOLO series, YOLOV5 has distinguished itself as an efficient model in real-time scenes in the wild and other scenes [20,21,22]. Then the welcomed YOLOV8 shows great performance [23]. (The YOLOV6 and YOLOV7 models are not considered here, because they were not proposed by the same team as YOLOV8). However, after examining the model framework of YOLOV8 in detail, it is found that the improvement of YOLOV8 is not fully applicable to the practical application scenarios of weak target detection in complex scenes with limited arithmetic power and high real-time requirements. For example, although the anchor-free design does not rely on predefined anchors, it also means that the model will confront a larger solution space, which is easy to get too many false positives. The precision and detection speed will be reduced as a result. Moreover, replacing the C3 module with the deeper C2f module in YOLO V8 and using the Task-Aligned Assigner’s positive and negative sample allocation strategy will lead to the consumption of computational resources and reduce the inference speed. And it has been proved that YOLOV8’s generalization performance on some wild scenarios is not as impressive as YOLOV5 [24,25,26]. However, it is worth noting that the differences in the mAP, Precision, and other norms between the versions was relatively small, indicating that both versions can achieve high levels of object detection accuracy in different challenging environments.

In summary, the latest YOLOV8 model is now only a different improvement strategy from its predecessor, the YOLOV5 model, in several aspects; YOLOV8 proves better performance on public datasets, but does not prove that it improves generalization performance on arbitrary custom data. More importantly, the two are on the order of n and s. The model of v5 has only two-thirds of the number of parameters of the v8 model and one-half of the model complexity; thus, YOLOV5 better meet the requirements of our edge deployment in complex scenarios. And the YOLO framework is known for its easy improvement, so we also decided to make adaptive improvements on the YOLOV5 framework to meet weak object detection in complex real-world scenarios.

For adaptive improvement, the attention mechanism is the first that springs to mind, as it has been widely demonstrated and applied for improvements using various types of scenarios. In common detectors, there are usually three network structures that can be used for improvement: the head, the neck, and the backbone. The head and the neck, in turn, often act as a whole. To avoid confusion, these parts of the structure are in the following collectively referred to as the backbone network. Therefore our improvement is also carried out in both directions: the head and backbone. Adaptive improvements in these areas help to improve the model’s detection ability under conditions such as weak features, an ambiguous background, and weak targets. To surmount the problem of inadequate data, we utilize a combination of copy–paste enhancement and image enhancement techniques. And copy–paste enhancement has been proven accuracy and efficiency with translation, scaling, and rotation in similar scenarios [27,28,29]. We have also created a dataset of people in natural scenes, including mountains, mountain forests, and plains scenes.

All of the improvements currently proposed in the community claim that they reduce the arithmetic requirements while increasing the detection accuracy. Therefore, it is incomplete for us to analyse the advantages and disadvantages of these improvements only through subjective analysis. Experiments must be conducted based on real-scenario environments, and the results must be compared with the impacts of different improvements on the model in order to explore the usefulness of these improvements for real scenarios. Objectively speaking, although this will make some improvements, they are not advantageous or even show disadvantages. However, we still need to start by giving explanations for the more inferior strategies to provide ideas for researchers in more complex scenarios. The novelty of our work lies in questioning the current hot improvement methods and giving more realistic adaptive improvements. Perhaps the improvements we provide do not satisfy the task in all complex scenarios, but for similar scenarios, our work has some value and relevance.

Our contribution can be distilled into the following points:

A natural scene dataset, featuring people in mountain, mountain forest, and plains scenes, has been contributed through the combination of copy–paste and image enhancement techniques.
A framework is proposed for enhancing the capability of detectors in extracting weak objects in complex environment, which can be adapted to integrate with popular detectors.
Nine improved methods based on YOLOV5 are employed to compare and analyze the advantages and disadvantages of these strategies for weak target detection in complex environment, thereby guiding researchers in similar work.

2. Methods

Our research aims to improve the accuracy of human detection in complex environments and investigate whether the most popular improvements have a sufficient generalization performance in complex natural environment. To face challenges, we started by expanding our datasets through the use of copy–paste and image enhancement methods, resulting in a more diverse and comprehensive datasets. Firstly, we collected and annotated only 1100 images, and some of them contained only single human objects. Thus, we applied the copy–paste enhancement so that images that contain only a single target contain at least three targets. And the images that contain only one target are retained so that we end up with 1825 images, which not only expands the datasets but also improves the coverage of the datasets. Secondly, we randomly selected and combined each image according to probability among twenty enhancement methods. Then, the datasets amounts to 3650 images. Thirdly, we extracted 1000 images from the open-source datasets Visdrone and OA to further expand our dataset’s size and diversity. Finally, we contributed a natural scene human-detection datasets in mountain, mountain forest, and plains scenes with 5650 images.

Then, we chose YOLOV5s as the baseline model for our research due to its adaptability and low computational consumption. As the hottest one-stage detector, these three improvements to the YOLOV5 are widely proposed and used: adding an attention mechanism, improving the head network, and improving the backbone network. In an effort to make the detector perform better in complex environments and explore whether these improvements “actually” work, we proposed improvements in the following three directions: adding an attention mechanism, replacing the head network, and replacing the backbone network. It is worth noting that we used both YOLOV8s for comparison during the baseline experiments and later compared and discussed them with the YoloV5s model.

We proposed adding an attention mechanism to our model, which extracts local information for efficient feature extraction. This helps to improve the accuracy of detection, with fewer parameters introduced. We also tried to replace the head network to fully extract target features, especially for weak and tiny targets. Finally, we tried to replace the backbone network’s ability to extract target features, making the detector more effective in sparse textures and blurred backgrounds. By comparing the different options in the improvement strategy, we discuss and analyze which improvements are worth using on the YOLOV5s framework and which ones make the model redundant. The following Figure 1 illustrates the framework of ideas in this paper:

2.1. Review of YOLOV5

YOLOV5 follows the grid concept of previous models, where the architecture images are divided into grids, and each grid is allowed to predict one or more objects. In the process of training, the anchor frame will move closer or farther toward the grid where the real values exist, the difference between the direct width and height of the anchor frame and the real frame and the difference in the coordinates are regarded as the loss function, and the binary cross entropy is taken as the loss of confidence; then the target detection problem will be greatly simplified into a simple regression prediction and classification problem.

Currently, the sixth generation of YOLOV5 proposes a total of five network junctions: YOLOV5n, YOLOV5s, YOLOV5m, YOLOV5l, and YOLOV5x. YOLOV5n is a variant in the series, which has been deployed in large numbers on embedded platforms for real-time detection tasks. YOLOV5s is the smallest model in the series and can demonstrate the fastest detection speed on devices with limited computational resources. The network structure of YOLOV5s is shown in Figure 2.

As the depth of the network increases, the AP accuracy also increases, requiring more computational resources. The YOLOV5s model is described here.

YOLOV5, which has a wide range of applications in surveillance recognition, vehicle driving, etc., still needs different degrees of improvement to obtain a more accurate detection capability for small and weak targets in real, complex natural environments. In this chapter, we first take YOLOV5s as the basis and analyze the more advanced network modules; then, we propose the improvement scheme in three parts: the attention mechanism, head network, and backbone network.

2.2. Improved Detection Methods Based on the Attention Mechanism

SENet is a network structure proposed by Hu et al. [30], inserted in popular detection networks such as ResNet and VGG16 that have achieved improved results. YOLOV5 uses CSPDarknet53 as the backbone network to transform the original input image into a multilayer feature map. Adding SE structure into the backbone network can eliminate the invalid channel information, reduce the false judgment rate, and improve the detection accuracy. The SE mechanism consists of two main parts: Squeeze and Excitation. Squeeze refers to the compression of global information into a single-channel descriptor, i.e., using global average pooling of channels to compress a W × H × C feature map containing global information into a 1 × 1 × C feature vector z. The Squeeze calculation is defined as follows:

z_{c} = F_{s q} (u_{c}) = \frac{1}{H \times W} \sum_{i}^{H} \sum_{j}^{W} u_{c} (i, j),

(1)

where c in

z_{c}

stands for its ordinal number in z.

The Excitation module, in order to utilize the information aggregated after Squeeze, enables the module to obtain multi-channel information through two fully connected layers. The first fully connected layer compresses the C channels and then performs the RELU computation; the second fully connected layer restores the number of channels to C and then obtains the weight s use Sigmoid computation, which is used to inscribe the weights in the feature map. Excitation calculation is defined as follow:

s = F_{e x} (z, W) = σ (g (z, W))

(2)

Finally, the Scale operation multiplies the resulting attention weights with the corresponding channel feature weighting:

X_{c} = F_{s c a l e} (u_{c}, s_{c}) = s_{c} u_{c},

(3)

where X is the output of the SE, i.e., the channel characteristics with mutual dependencies.

And the initial network layer learning function in the network is not affluent. In order to obtain good results, we propose to produce a new backbone network structure after adding two SE modules to layers 6 and 9 of the original backbone network. The network structure is showed in Figure 3:

CBAM [31] is a network module that can infer attention through both channel and spatial dimensions. Compared to SENet, which focuses only on the channel attention mechanism, the CBAM module achieves better results. In YOLOV5, C3 module is used to extract image features. And C3 stands for Cross-Stage Partial Connection with 3 branches, which enhances feature transfer and reuse by partitioning the feature map into multiple branches and cross-connecting them between different layers. Figure 4 illustrates the improvement strategy proposed in this paper:

In CBAM, the channel attention expression is given as follows:

M_{c} (F) = σ (M L P (A v g P o o l (F) + M L P (M a x P o o l (F))),

(4)

where F stands for feature, MLP stands for multilayer perceptron, and

σ

stands for sigmoid function.

Spatial attention takes the feature map’s output from channel attention as input and performs mean pooling and maximum pooling in the channel dimension, separately. The calculation is as follows:

M_{s} (F) = σ (f^{7 \times 7} ([A v g P o o l (F); M a x P o o l (F)])),

(5)

where 7 × 7 stands for the size of the convolution kernel, which has been proven to have a better performance than the 3 × 3 kernel.

As a plug-and-play attention module, after we add the CBAM module into C3, we perform an attention mechanism compensation each time, which weights the attention on the targets on the feature graph in different dimensions and attenuates the attention on other irrelevant features.

SOCA [32] is a mechanism used to consider the correlation between high-level features for deeper networks. SOCA consists of two parts: covariance normalization and channel attention mechanism. The SOCA mechanism enables the network to be more focused on locally beneficial information, i.e., it has the capability of super-resolution images. A covariance matrix is used in SOCA to describe the correlation between the C channels:

C = \frac{1}{N} X^{T} X = V D V^{T},

(6)

where N stands for the number of channels, X stands for the feature map, V stands for the matrix of feature, and D stands for the diagonal matrix with eigenvalues. By means of the above equation, the feature map is decomposed. And the calculation of weighted feature map is given as follows:

\tilde{X} = \frac{1}{N} \sqrt{D} ⊙ X,

(7)

where

⊙

stands for the multiply symbol.

However, since SOCA is not capable of extracting features in shallow networks, this paper adds the SOCA module to the 23nd layer of the HEAD network. The specific structure is shown in Figure 5:

2.3. Improved Detection Methods Based on the Head Network

In YOLOX [33], a dual head structure (decoupled head) had been designed. Classification and localization tasks in object detection, the decoupled head network is able to make predictions for classification and localization, respectively, the classification task is more concerned with which target class the features in the network are more similar to, and the classification and localization task is more concerned with the coordinates of the location where the target is actually located. While in the YOLO series of algorithms, the detection head is always coupled. By introducing a branching operation network in the detection head, two-channel prediction of target frames and categories is performed, which greatly improves the detection speed. The calculation of the classification head is based on the following formula:

P (c | x) = \frac{e^{w_{c}}}{\sum_{j}^{C} e^{w_{j}}},

(8)

where

w_{c}

stands for the logits of the output, and C stands for the number of categories.

The regression header is used to predict the position of the bounding box for each anchor. The bounding box is defined by the center coordinates (x,y)(x,y), the width w, and the height h. The regression objective is to predict the offset of these values relative to the anchor point (dx,dy,dw,dh). Corresponding equation goes here:

\{\begin{cases} d x = (x_p r e d - x_a n c h o r) / w_a n c h o r \\ d y = (y_p r e d - y_a n c h o r) / h_a n c h o r \\ d w = \log (w_p r e d / w_a n c h o r) \\ d h = \log (h_p r e d / h_a n c h o r) \end{cases},

(9)

On the basis of the double-decoupled detection head, considering that the fully connected layer is more suitable for the classification task and the conv layer is more suitable for the localization task, this paper proposes a more concise and efficient detection head. The design structure is shown in Figure 6:

Adaptive Spatial Feature Fusion [34] is a pyramid feature fusion strategy that solves the problem of inconsistency of different scale features inside the feature pyramid by adaptively learning the fusion weights of each different feature map. For each pair of neighboring layers, the fusion feature map is computed by ASFF:

F_{f u s i o n} = w \cdot F_{l} + (1 - w) \cdot F_{l + 1},

(10)

where w can be computed by weighted network:

w = σ (G (F_{l}, F_{l + 1})),

(11)

where G stands for the weighted network, and

σ

stands for sigmoid.

This strategy can be used by any single-stage detector, so the improved model of the head structure proposed in this paper is shown in Figure 7:

BIFPN [35] is a repetitively weighted, bi-directional feature pyramid network that uses residual links to enhance the characterization of features. Unlike ASFF, the feature map will be processed through a bottleneck structure, which will contain several convolutional layers with the aim of reducing the dimensionality of the feature map. Subsequently, the upper feature maps are fused with the bottom level feature maps that have been processed by the bottleneck structure. The calculation is given as follows:

F_{f u s i o n} = α (C o n v (F)) + (1 - α) F_{u p},

(12)

where Conv stands for convolution,

F_{u p}

stands for upper feature maps, and

α

stands for fusion weight.

The number of BIFPN feature fusion strategies to use is not set artificially and was obtained in the original paper using a parametric grid search. In this paper, only one BIFPN network is added to replace the PANET strategy in the original model. The specific structure is shown in Figure 8:

2.4. Improved Detection Methods Based on the Backbone Network

A YOLOV5 backbone network based on DarkNet can achieve excellent performance under normal circumstances, but it does not fully satisfy the real needs. ConvNext [36] is based on ResNet50, modeled after Swin Transformer structural modification to obtain a network composed entirely of convolutional structures, and its accuracy and scalability have the ability to be able to compete with the Transformer. The biggest innovation in its backbone network is the use of Depth-wise Convolution (i.e., the number of convolutional kernels is equal to the number of input channels) and increasing the size of each convolutional kernel. This architecture allows each convolution kernel to process its respective channel independently, mirroring the self-attention mechanism’s ability to focus dynamically on different parts of the input data. In depth-wise convolution, each kernel acts independently across its channel, extracting features specifically relevant to that channel and mimicking the selective, dynamic focusing that characterizes self-attention mechanisms. This capability enables the model to perform efficient and precise feature extraction by focusing on local spatial details within each channel, similar to how self-attention mechanisms weigh the importance of different input parts to enhance model understanding and performance. The calculation is given as follows:

(F_{c} * K_{c}) (i, j) = \sum_{k} F (i, j, c) \cdot K_{c} (k, i, j),

(13)

where

F_{c}

stands for the channel number of the input feature map,

K_{c}

stands for the convolution kernel for this channel, and

(i, j)

stands for the position of feature map.

Drawing on the ConvNext-T model, the YOLOV5 backbone network is replaced with four ConvNext structural blocks. Among them, the LN layer is a normalization layer, which is not limited by the number of samples compared to the BN normalization and can be normalized for different features of a single sample. The Layer Scale layer is used for scaling and panning the input. Referring to the above network construction, our proposed improved structure is specifically shown in Figure 9:

MobileNet [37] is a lightweight deep learning neural network that uses depth-wise separable convolution instead of standard convolution, which dramatically reduces the computational effort and model parameters. Combined with the network structure proposed in the original MobileNet article, ordinary convolution and depth-wise separable convolution are combined to form a convolution block.

Unlike Depth-wise Convolution, after depth-wise convolution, a point-wise convolution (i.e., a 1 × 1 convolution kernel) is usually used with the purpose of combining the C feature maps generated in the previous step into a new feature map, so that the corresponding equation goes here:

F_{f u s i o n} = C o n v_{1 \times 1} ({(F_{c} * K_{c})}_{c = 1}^{C}),

(14)

where

{(F_{c} * K_{c})}_{c = 1}^{C}

stands for the set of feature maps obtained after applying depth-wise convolution independently for each channel. And it is precisely because point-wise convolution can contribute weights and therefore reduce the number of parameters.

In this paper, the MobileNet Block is stacked three times in the backbone network, and the output results are sent to the feature pyramid processing, the structure of which is shown in Figure 10:

Slimneck [38] proposed that Depth-wise separable Convolution reduces the computational effort, but its feature extraction and fusion capabilities are much lower than standard convolution, so a new method, GSConv, was introduced to make the output of DSC as close as possible to the standard convolution by disrupting the features in the standard convolution and mixing them into the output of DSC. GSConv enhances the feature representation by integrating the self-attention module into the convolution operation. The calcuation of GSConv can be represented as:

G S C o n v (X) = C o n v (F \cdot s o f t \max (\frac{Q K^{T}}{\sqrt{d_{k}}}) V),

(15)

where Q, K, and V individually stand for the Query, Key, and Value.

d_{k}

stands for the dimension of key, which is used to scale the dot product for a more stable gradient when performing softmax.

However, if it is used at every stage in the network structure, it increases the number of model parameters and reduces the inference speed. The original article similarly mentions adding this structure to the neck network. This part of the improvement is classified here in the backbone network for the sake of uniform analysis. The structure proposed in this paper is shown in Figure 11:

3. Experiments and Results

Before validating the improved models, Validation of the scientific validity of the custom datasets is needed. Similarly, we need to test the baseline model on our custom datasets to ensure the viability of the baseline model on the dataset.

3.1. Datasets

Our task is to detect human targets in real scenarios. These scenes cover mountains, snow, and mountain forests. Human eye recognition is difficult due to restricted conditions such as low data collection, fog jams, and fuzzy targets. Therefore, the priority is to enhance the data and expand the datasets.

The realities of scarcity of target images, limited acquisition conditions, and poor imaging quality exist in complex environments. Therefore, images must be processed and enhanced.

We have collected about 1100 human target images, but the number of positive samples is relatively limited, which makes it difficult to meet the model training requirements. Kisantal [39] et al. proposed to increase the number of weak targets in each image by using a “copy and paste” method and applying a stochastic transformation enhancement method to the targets before they are pasted to a new location to produce appropriate variations in the target size and rotation angle. Before we pasted, we need to extract the targets based on the annotations. Then, we randomly take out 1 or 2 targets, add them to the images with less than three targets, and keep the images with less than three targets. Finally, we get 1825 images. When pasting the targets, a judgment is made as to whether their new positions will be with the original image targets in order to avoid target cross-obscuration. By using this method, there are three more weak targets on the newly generated image than on the original image. The rendering is shown in Figure 12.

Then, we deployed mosaic data enhancement, random affine transformation, hybrid data enhancement, and HSV random image enhancement. Here, we used the Albumentations library in python, employing twenty of these augmentations and selecting them at random. With these methods, the amount of images can be further extended to 3650. The enhanced images are shown in Figure 13.

The single use of custom datasets not only reduces the generalization ability of the model, but also leads to overfitting of the model during the training process, so it is necessary to add some target detection datasets similar to those in complex scenarios. The final custom dataset consists of the following parts:

Real environmental data we have collected;
Visdrone [40] focuses on urban and rural scenarios with a drone perspective, including mainly human and vehicle targets;
OA Dataset [41] focuses on campuses and cities, using drone and street photography perspectives, with mainly human targets;
One thousand images from each of the two datasets to add to our dataset; the final dataset contains 5650 images.

3.2. Baseline and Evaluation Methods

The experiments in this paper on custom datasets are conducted by using YOLOV5s as the experimental baseline. The results of each metric in the training and validation process are shown as Figure 14:

The YOLOV5s pre-trained model is used for migration learning, and the loss function is optimized by stochastic descent algorithm. The initial learning rate is 0.01, and the learning rate will gradually decrease with the training process, while the learning momentum is 0.9. The original YOLOV5s model is being trained.

The prediction training loss continues to decrease, and the category training loss stabilizes, mainly due to single-category detection. The confidence loss obj_loss shows a slight increase after 50 generations, indicating that the model’s prediction ability improves in correctly predicted scenarios, while its performance decreases in incorrectly predicted scenarios. All other metrics are normal and stabilized, proving that YOLOV5s can be used as a benchmark model for this experiment.

In this study, the evaluation metrics include the parameters mAP_0.5, mAP_0.5:0.95, precision, recall, fps, and FLOPs. The parameter denotes the model size, fps denotes the number of frames per second processed by the model (higher is better), and FLOPs denotes the computational resources needed to run the model. Other metrics were calculated as follows:

P = \frac{T P}{T P + F P} \times 100 %,

(16)

R = \frac{T P}{T P + F N} \times 100 %,

(17)

m A P = \frac{\sum_{n} \int_{0}^{1} P (R) d R \times 100 %}{n} \times 100 %

(18)

where TP (True Positives) stands for the number of correctly detected targets; FP (False Positives) stands for the number of incorrectly detected background or other objects as targets; FN (False Negatives) stands for the actual number of targets that the model failed to detect; n stands for the number of the type of object (here, n takes 1); 0.5 stands for threshold setting at 0.5, and 0.5:0.95 stands for an IoU setting range of 0.5 to 0.95. And IoU is a metric that measures the extent to which the predicted bounding box overlaps the true bounding box.

On the basis of the experiments with the baseline model, we also fine-tuned and tested it with YOLOV8s to facilitate the subsequent discussion with the same hyperparameters.

As can be seen in Table 1, YOLOV8s do show a better performance. However, as discussed before, YOLOV8s does not show a dominating performance. While the larger model parameters give stronger feature extraction and, thus, improved mAP and Precision, the optimised inference speed and smaller parameter count of the YOLOV5s is still the preferred choice for edge-computing devices. It is not possible to conclude which model is better here, and in case of sufficient arithmetic power, the choice of YOLOV8s is unquestionable. And the purpose of our study is not to compare the two models, but to find suitable improvement strategies for models like YOLOV5 that are more suitable for deploying inference. Through the experimental results of the model that is considered the most SOTA by the community, the next improvement strategies can be better judged.

3.3. Experimental Results and Analysis

3.3.1. Experiments Based on the Attention Mechanism

After testing the standard model, we experimented with the improved model presented in 2.2. The training results are shown in the table following.

As can be seen in Table 2, the model with the added SE mechanism, mAP_0.5, improves by 0.8% for the baseline, mAP_0.5:0.95 improves by 1.9%, and accuracy improves by 0.8%, which is the most metrics-enhancing model of the other three models. Although the model with the addition of the SE mechanism has more parameters than the original model, the presence of the SE mechanism makes the model more parallel and therefore reduces the complexity of the model. The improvement of mAP at different IoU thresholds indicates that the improved model predicts the target location more accurately. However, at the same time, the overall value of mAP_0.5:0.9 is low. When the IoU threshold is set too high, the position of the target detection frame is also too demanding. Since it is not possible to demonstrate the poor accuracy of YOLOV5s at high IoU threshold settings with the addition of a publicly available dataset, it is not possible to demonstrate the poor accuracy of YOLOV5s at high IoU threshold settings. The model with the addition of the CBAM mechanism is greater than the other models in terms of the parameters and model complexity, but it also requires more computational resources and has a higher computational complexity, which leads to a decrease in the performance of the model. And even though the CBAM module is able to improve the detection ability in complex backgrounds, the decrease from the mAP and recall shows that it is more difficult for CBAM to show the ability to detect object, especially when complex background images dominate the dataset. The model with the addition of the SOCA mechanism improves the detection speed by 2.5%, but the accuracy rate decreases, indicating that the SOCA mechanism outperforms the other models in terms of complexity, but its super-resolution capability performs poorly for the low-resolution data in this experimental dataset. This makes it difficult for the SOCA mechanism to detect correlations between features when the target is not sufficiently salient.

The following Figure 15 shows the true value of surveillance video data in a centralized urban environment, where the green box represents the data labeling box. And it can be seen that there is an obvious omission of labeling, and only five targets are labeled in the figure.

The images are detected with the benchmark model and the improved model, and the results are shown in Figure 16. The baseline model and the model adding the CBAM and SOCA mechanisms are equipped with certain generalization ability, which can detect human targets other than the labeled targets, and the model adding the attention mechanism has a higher prediction confidence and more accurate localization. This happens mainly because the model does not obtain enough generalization ability due to the insufficient number of datasets in this scenario, and the omitted labeled data has a great impact on the model, which may lead to serious model leakage detection.

3.3.2. Improved Experiments Based on the Head Network

In this section, experiments are conducted on the model presented in Section 2.3. Table 3 shows each parameter of the training process:

From the following table, it can be seen that the improved model using the ASFF head network shows varying degrees of degradation in all the metrics, which is due to the complex structure of the ASFF head network and its applicability in target tasks with varying scales. Therefore, the potential of pyramidal feature representation on the dataset presented in this paper cannot be fully utilized. In tiny target detection tasks, multiscale feature fusion requires finer feature capture details, and multiscale features do not match weak target representations well enough, leading to increased interference from background noise. The models using BIFPN and decoupled head networks show improvements in all metrics, with the decoupled improved model showing an improvement of 23.68% and 31.8% in the most important metrics, mAP_0.5 and mAP_0.5:0.95, respectively. More notably, similar to the inclusion of decoupled head in YOLOV8, its improvement on YOLOV5s reaches 0.872, 0.538, and 87.5% in the metrics of mAP_0.5, mAP_0.5:0.95, and accuracy, respectively, achieving similar improvements as BIFPN. And it shows better performance with less number of parameters.

The previous section doesn’t show the performance improvements from the attention mechanism due to insignificant differences in real dataset results. Figure 17 presents a more challenging graph:

The above data show data from a real blurred scene with 8 targets. It can be seen that the scene is very fuzzy, especially the targets that are obscured by each other. Compared to the real picture shown in the first experiment, the picture is more blurred, with more background texture features and background interference behind the targets, which makes it more difficult to detect. The results are shown in Figure 18:

The figure above shows the detection of the baseline and the improved model with the data. 7 targets were successfully detected by the baseline, but one target missed detection, which indicates that the conf-thres and IoU-thres parameters of the baseline need to be adjusted. Increasing the confidence intervals and decreasing the IoU values between the prediction frames from the default values of 0.25 and 0.45 can better handle the recognition task in such scenarios. The models using BIFPN head showed misclassification to some extent, proving that some of the features are similar to the target features in the current scene and the feature detection module used did not improve the detection. Whereas the models using ASFF showed a missing detection due to the multi-scale features of ASFF are difficult to match with small targets, and the resulting background noise can easily produce a false judgment situation. Similarly, BIFPN uses bidirectional feature fusion, and the noise in the complex background leads to the fusion of features that may incorrectly mix the features of these interferences with the target features, resulting in misjudgments. The improved network using Decoupled detection header on the other hand is able to detect the eight targets in the figure well and has better performance than the baseline.

3.3.3. Improved Experiments Based on the Backbone Network

In the previous section, we proposed improved models based on backbone network, and Table 4 shows the improved model training parameters:

As can be seen from the table, the final results appear to be very different due to the different structure of each improved backbone network. The improved model using the ConvNext backbone network is far more structured than the baseline model due to the inclusion of four layers of backbone blocks, each with a parameter count of more than 27.8M, resulting in a network structure that far exceeds that of the baseline model. Therefore, this improvement does not match well enough with the lightweight design of YOLOV5s, leading to the extraction of overly complex features that are difficult to process. The model using the MobileNet backbone network, which is typically used in the lightweighting of large models, showed a 10.2% improvement in the mAP_0.5 metric and a 64.3% improvement in detection speed in this experiment, illustrating the superiority of the MobileNet model in accelerating the model and improving the quality of the detection. However, at the same time, the insufficient feature extraction capability causes this improvement to be ineffective in enhancing the detection performance, and the rest of the metrics are degraded to some extent. The model using the Slimneck backbone network improved 20.4%, 18.2%, 7.4%, and 33.3% in the four metrics of mAP_0.5, mAP_0.5:0.95, detection accuracy, and recall, respectively, compared to the original model, which illustrates that the use of the Slimneck backbone network greatly improves the model’s weak target detection. We also note that with a similar number of model parameters as YOLOV8s, the Slimneck-based improvement is superior in all performance metrics. Figure 19 shows the detection results of each model in the above real scenario:

The above figure shows the detection results of the improved models. The improved models with ConvNext and Mobilenet have misidentification. It also indicates that these two popular improvement methods are not applicable to YOLOV5s, and the feature capability extraction and feature recognition capabilities do not match. From a certain point of view the misjudged target is very similar to the correct target features, and the amount of data in such scenarios needs to be increased to improve the recognition ability of the model. In contrast, the model using Slimneck backbone network did not experience misclassification, and the eighth target was detected, such results have produced a large improvement over the baseline model.

4. Conclusions

The research presented in this paper has proposed the way to face the challenge of object detection in complex natura environments, a critical issue in security and surveillance. We first aimed at single-stage algorithms, because they are good enough for demanding hardware and real-time requirements. Secondly, we chose the YOLO framework, because it is the most popular in the industry and easy to deploy. Finally, for the model, we chose YOLOV5s because of its lightweight design and greater room for improvement. By employing an enhanced version of the YOLOv5 algorithm, this study has not only demonstrated a significant improvement based on decoupled head and Slimneck within diverse and challenging setting, but also revealed the drawbacks that arise from blindly making improvements to the model. In fact, YOLOV5s is lighter while having some improved strategies that do not drastically increase the number of model parameters, and its detection performance even exceeds that of YOLOV8s.

One of the pivotal contributions of this work is the innovative approach to dataset augmentation. By combining image enhancement techniques with copy–paste methods, we have not only enriched our dataset but also expanded it to include other scenarios, thereby creating a robust and diverse dataset of human targets. This comprehensive dataset has been instrumental in training the YOLOv5 model to better detect and respond to the intricacies of natural environments.

Furthermore, multiple adaptive improvement strategies into the YOLOv5 algorithm has yielded remarkable results. Our modifications, tailored to the specific demands of detection tasks, improvements based on BIFPN, decoupled head, and Slimneck, have significantly enhanced the model’s performance in various indicators. The rigorous testing conducted on our expanded dataset has validated the effectiveness of these improvements, leading to a model that is not only accurate but also highly adaptable to requirements for deployment and real-time. Meanwhile, some of the other hottest improvements through the results proved not to be suitable on our task.

We discuss whether the hottest improvement options in the current community work in real complex scenarios through experimental comparisons. The addition of attention mechanisms such as CBAM and SOCA does not improve YOLOV5s much. And improvements based on the head network, such as ASFF, despite having stronger feature fusion capabilities, do not match the feature extraction capabilities of the backbone network of YOLO5s, resulting in performance degradation and misclassification during recognition. The settings of backbone networks such as ConvNexT and MobileNet are experimentally proved to not be blindly replaced into the YOLO framework, which will not only lead to an increase in the arithmetic demand, but also a decrease in the detection performance.

In conclusion, this paper has made strides in addressing the detection of tiny objects in complex environments, providing valuable insights for further research. However, it is important to note the existing limitations: Firstly, the model’s training parameters and hyperparameters may not be optimal and require extensive experimentation to find the best settings. Secondly, the learning strategy used is conventional and does not specifically cater to the challenges of few samples. Exploring alternative strategies, such as active learning and supervised deep correlation tracking, could help address data scarcity issues. Lastly, the dataset augmentation technique, which involves copying and pasting objects, does not consider the context between the target and the new background, resulting in unrealistic synthetic images. Incorporating context and using seamless cloning techniques such as poisson blending could improve the naturalness of the augmented data. By addressing these areas, future research can aim to achieve more robust and natural results in the challenging domain of target detection.

Author Contributions

Data curation, J.Z.; Formal analysis, J.Z.; Funding acquisition, X.H.; Investigation, J.Z.; Methodology, Q.C.; Resources, Z.L.; Software, J.Z.; Supervision, Q.C.; Validation, J.Z.; Visualization, J.Z.; Writing—original draft, J.Z.; Writing—review & editing, Q.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant number: 62376279).

Data Availability Statement

The data that support the findings of this study are available from the author Jikun Zhong (zhongjikun19@nudt.edu.cn) upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Krišto, M.; Ivasic-Kos, M.; Pobar, M. Thermal Object Detection in Difficult Weather Conditions Using YOLO. IEEE Access 2020, 8, 125459–125476. [Google Scholar] [CrossRef]
Liu, W.; Ren, G.; Yu, R.; Guo, S.; Zhu, J.; Zhang, L. Image-adaptive YOLO for object detection in adverse weather conditions. AAAI Conf. Artif. Intell. 2022, 36, 1792–1800. [Google Scholar] [CrossRef]
Hnewa, M.; Radha, H. Multiscale domain adaptive yolo for cross-domain object detection. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 3323–3327. [Google Scholar]
Huang, S.-C.; Le, T.-H.; Jaw, D.-W. DSNet: Joint semantic learning for object detection in inclement weather conditions. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2623–2633. [Google Scholar] [CrossRef] [PubMed]
Sasagawa, Y.; Nagahara, H. Yolo in the dark-domain adaptation method for merging multiple models. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 345–359. [Google Scholar]
Xiao, Y.; Jiang, A.; Ye, J.; Wang, M.-W. Making of night vision: Object detection under low-illumination. IEEE Access 2020, 8, 123075–123086. [Google Scholar] [CrossRef]
Peng, D.; Ding, W.; Zhen, T. A novel low light object detection method based on the YOLOv5 fusion feature enhancement. Sci. Rep. 2024, 14, 4486. [Google Scholar] [CrossRef] [PubMed]
Zhao, Y.; Geng, S. Face occlusion detection algorithm based on yolov5. J. Phys. Conf. Ser. 2021, 2031, 012053. [Google Scholar] [CrossRef]
Goldman, E.; Herzig, R.; Eisenschtat, A.; Goldberger, J.; Hassner, T. Precise detection in densely packed scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5227–5236. [Google Scholar]
Chen, Z.; Chen, K.; Lin, W.; See, J.; Yu, H.; Ke, Y.; Yang, C. Piou loss: Towards accurate oriented object detection in complex environments. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 195–211. [Google Scholar]
Chi, C.; Zhang, S.; Xing, J.; Lei, Z.; Li, S.Z.; Zou, X. Pedhunter: Occlusion robust pedestrian detector in crowded scenes. AAAI Conf. Artif. Intell. 2020, 34, 10639–10646. [Google Scholar] [CrossRef]
Li, X.; Diao, W.; Mao, Y.; Gao, P.; Mao, X.; Li, X.; Sun, X. OGMN: Occlusion-guided multi-task network for object detection in UAV images. ISPRS J. Photogramm. Remote Sens. 2023, 199, 242–257. [Google Scholar] [CrossRef]
Cheng, N.; Xie, H.; Zhu, X.; Wang, H. Joint image enhancement learning for marine object detection in natural scene. Eng. Appl. Artif. Intell. 2023, 120, 105905. [Google Scholar] [CrossRef]
Fan, Q.; Zhuo, W.; Tang, C.-K.; Tai, Y.-W. Few-shot object detection with attention-RPN and multi-relation detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4013–4022. [Google Scholar]
Zhu, C.; Chen, F.; Ahmed, U.; Shen, Z.; Savvides, M. Semantic relation reasoning for shot-stable few-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8782–8791. [Google Scholar]
Ren, X.; Zhang, W.; Wu, M.; Li, C.; Wang, X. Meta-YOLO: Meta-Learning for Few-Shot Traffic Sign Detection via Decoupling Dependencies. Appl. Sci. 2022, 12, 5543. [Google Scholar] [CrossRef]
Johnston, J.; Zeng, K.; Wu, N. An evaluation and embedded hardware implementation of yolo for real-time wildfire detection. In Proceedings of the 2022 IEEE World AI IoT Congress (AIIoT), Seattle, WA, USA, 6–9 June 2022; pp. 138–144. [Google Scholar]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A review of yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Nuwara, Y.; Wong, W.K.; Juwono, F.H. Modern computer vision for oil palm tree health surveillance using YOLOv5. In Proceedings of the 2022 International Conference on Green Energy, Computing and Sustainable Technology (GECOST), Miri Sarawak, Malaysia, 26–28 October 2022; pp. 404–409. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 March 2024.).
Casas, E.; Ramos, L.; Bendek, E.; Rivas-Echeverria, F. YOLOv5 vs. YOLOv8: Performance Benchmarking in Wildfire and Smoke Detection Scenarios. J. Image Graph. 2024, 12, 127–136. [Google Scholar] [CrossRef]
Masum, M.I.; Sarwat, A.; Riggs, H.; Boymelgreen, A.; Dey, P. YOLOv5 vs. YOLOv8 in Marine Fisheries: Balancing Class Detection and Instance Count. arXiv 2024, arXiv:2405.02312. [Google Scholar]
Gašparović, B.; Mauša, G.; Rukavina, J.; Lerga, J. Evaluating YOLOv5, YOLOv6, YOLOv7, and YOLOv8 in underwater environment: Is there real improvement? In Proceedings of the 2023 8th International Conference on Smart and Sustainable Technologies (SpliTech), Split/Bol, Croatia, 20–23 June 2023; pp. 1–4. [Google Scholar]
Zhang, L.; Xing, Z.; Wang, X.J.E. Background Instance-Based Copy-Paste Data Augmentation for Object Detection. Electronics 2023, 12, 3781. [Google Scholar] [CrossRef]
Zhang, L.; Wang, X. Contextual copy-paste data augmentation for object detection. In Proceedings of the Third International Conference on Artificial Intelligence, Virtual Reality, and Visualization (AIVRV 2023), Chongqing, China, 7–9 July 2023; pp. 636–640. [Google Scholar]
Zheng, X.; Zou, J.; Du, S.; Zhong, P.J.S. Small Target Detection in Refractive Panorama Surveillance Based on Improved YOLOv8. Sensors 2024, 24, 819. [Google Scholar] [CrossRef] [PubMed]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Dai, T.; Cai, J.; Zhang, Y.; Xia, S.-T.; Zhang, L. Second-order attention network for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11065–11074. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A better design paradigm of detector architectures for autonomous vehicles. arXiv 2022, arXiv:2206.02424. [Google Scholar]
Kisantal, M.; Wojna, Z.; Murawski, J.; Naruniec, J.; Cho, K. Augmentation for small object detection. arXiv 2019, arXiv:1902.07296. [Google Scholar]
Liu, Z.; He, Z.; Wang, L.; Wang, W.; Yuan, Y.; Zhang, D.; Zhang, J.; Zhu, P.; Van Gool, L.; Han, J. VisDrone-CC2021: The vision meets drone crowd counting challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2830–2838. [Google Scholar]
Barekatain, M.; Martí, M.; Shih, H.-F.; Murray, S.; Nakayama, K.; Matsuo, Y.; Prendinger, H. Okutama-action: An aerial view video dataset for concurrent human action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 28–35. [Google Scholar]

Figure 1. Framework of ideas in this paper. The main challenges demonstrate the difficulty of target detection in complex environments. Strategies demonstrate solution ideas to address different challenges. The methodology demonstrates the model used in this paper to implement the above strategy.

Figure 2. Network structure of YOLOV5s. The head part of the figure is also commonly referred to as the neck part in the usual sense and is part of the backbone. The detect part is commonly referred to as the head.

Figure 3. Network structure of YOLOV5s. SE is usually a plug-and-play module, so our improvements are not always the best way to go.

Figure 4. Benefiting from the less number of parameters of CBAM modules, we added CBAM modules after each C3 of the backbone network. To avoid ambiguity, only the CBAM structure after each C3 module is shown here.

Figure 5. Network structure adding SOCA. The 23rd layer corresponds to the 23rd layer in the YOLOV5S structure. Adding in the 23rd layer handle high-dimensional features well.

Figure 6. Network structure using decoupled head. The improvement abandons the operation of uniform convolution of the original features and instead adds a fully connected layer to return to the target class.

Figure 7. Network structure adding ASFF. Replacing the original feature pyramid with the asff structure, due to improving the scale invariance of the features and fully utilizing the semantic information of the high-level features and the fine-grained features of the bottom features.

Figure 8. Network structure adding BIFNP. Traditional feature fusion is often just a simple feature map superposition without distinguishing the feature maps that are added at the same time. The BIFPN strategy is a simple and efficient mechanism for weighted feature fusion.

Figure 9. Network structure using ConvNext Block. Convnext achieves advanced results by borrowing tuning techniques from methods such as Swin Transformer, while maintaining the CNN structure.

Figure 10. Network structure using MobileNet Block. The technique of MobileNet block is mainly used to compress the model. The main purpose of using the technique here is to explore whether the detector has the possibility of compressing the model under the condition that it meets the normal detection requirements.

Figure 11. Network structure adding Slimneck. The improvements made here in this paper actually add GSConv and VoVGSCSP to the portion of the network after the backbone, i.e., the neck network.

Figure 12. This figure shows the effect of the copy–paste method. (a) Only 1 human target is visible in the original image; (b) 3 targets from other images were added to the enhanced image.

Figure 13. Enhanced images. The figure randomly shows the effects of eight of the twenty or so enhancements. The subfigures in the first row are enhanced with horizontal flip, Blur, HSV, and random affine, respectively. The subfigures in the second row use BrightnessContrast, ChannelShuffle, ChannelDropout, and RGBShift, respectively.

Figure 14. Results of the baseline model.

Figure 15. Datasets with missing labels. The green boxes in the figure show the labelled targets. It can be seen that there are some missed labelling in the data.

Figure 16. Model detection results for missed label data.

Figure 17. A more challenging graph. The green boxes in the figure show the labelled targets. It can be seen that there are some missed labelling in the data.

Figure 18. Detection results based on head network.

Figure 19. Detection results based on backbone network.

Table 1. Parameters of YOLOV5s and YOLOV8s.

Models	Param	mAP_0.5	mAP_0.5:0.95	P	Recall	fps	FLOPs
YOLOV5S	7.03 M	0.705	0.405	83.6%	0.63	47.84	16.0 G
YOLOV8S	11.6 M	0.743	0.461	87.17%	0.69	43.2	29.1 G

Table 2. Parameters of the improved model based on the attention mechanism.

Models	Param	mAP_0.5	mAP_0.5:0.95	P	Recall	fps	FLOPs
YOLOV5S	7.03 M	0.705	0.405	83.6%	0.63	47.84	16.0 G
YOLOV5S + SE	7.06 M	0.711	0.413	84.3%	0.626	46.94	15.8 G
YOLOV5S + CBAM	8.76 M	0.704	0.403	84.2%	0.622	42.55	17.2 G
YOLOV5S + SOCA	7.02 M	0.71	0.41	81.7%	0.64	49.01	15.8 G

Table 3. Parameters of the improved model based on the head network.

Models	Param	mAP_0.5	mAP_0.5:0.95	P	Recall	fps	FLOPs
YOLOV5S	7.03 M	0.705	0.405	83.6%	0.63	47.84	16.0 G
YOLOV5S_ASFF	46.12 M	0.692	0.39	82%	0.617	35.3	107.7 G
YOLOV5S_BIFPN	8.09 M	0.758	0.525	89.1%	0.76	43.47	17.2 G
YOLOV5S_decoupled	8.68 M	0.872	0.538	87.5%	0.805	41.58	26.5 G

Table 4. Parameters of the improved model based on the backbone network.

Models	Param	mAP_0.5	mAP_0.5:0.95	P	Recall	fps	FLOPs
YOLOV5S	7.03 M	0.705	0.405	83.6%	0.63	47.84	16.0 G
YOLOV5s_ConvNexT	115.82 M	0.516	0.214	63.2%	0.51	8.3	83.4 G
YOLOV5s_MobileNet	2.1 M	0.604	0.358	77%	0.72	78.5	3.1 G
YOLOV5s_Slimneck	14.3 M	0.849	0.479	89.8%	0.80	36.6	56.3 G

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhong, J.; Cheng, Q.; Hu, X.; Liu, Z. YOLO Adaptive Developments in Complex Natural Environments for Tiny Object Detection. Electronics 2024, 13, 2525. https://doi.org/10.3390/electronics13132525

AMA Style

Zhong J, Cheng Q, Hu X, Liu Z. YOLO Adaptive Developments in Complex Natural Environments for Tiny Object Detection. Electronics. 2024; 13(13):2525. https://doi.org/10.3390/electronics13132525

Chicago/Turabian Style

Zhong, Jikun, Qing Cheng, Xingchen Hu, and Zhong Liu. 2024. "YOLO Adaptive Developments in Complex Natural Environments for Tiny Object Detection" Electronics 13, no. 13: 2525. https://doi.org/10.3390/electronics13132525

APA Style

Zhong, J., Cheng, Q., Hu, X., & Liu, Z. (2024). YOLO Adaptive Developments in Complex Natural Environments for Tiny Object Detection. Electronics, 13(13), 2525. https://doi.org/10.3390/electronics13132525

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO Adaptive Developments in Complex Natural Environments for Tiny Object Detection

Abstract

1. Introduction

2. Methods

2.1. Review of YOLOV5

2.2. Improved Detection Methods Based on the Attention Mechanism

2.3. Improved Detection Methods Based on the Head Network

2.4. Improved Detection Methods Based on the Backbone Network

3. Experiments and Results

3.1. Datasets

3.2. Baseline and Evaluation Methods

3.3. Experimental Results and Analysis

3.3.1. Experiments Based on the Attention Mechanism

3.3.2. Improved Experiments Based on the Head Network

3.3.3. Improved Experiments Based on the Backbone Network

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI