YOLO-SSFS: A Method Combining SPD-Conv/STDL/IM-FPN/SIoU for Outdoor Small Target Vehicle Detection

Gu, Zhenchao; Zhu, Kai; You, Shangtao

doi:10.3390/electronics12183744

Open AccessArticle

YOLO-SSFS: A Method Combining SPD-Conv/STDL/IM-FPN/SIoU for Outdoor Small Target Vehicle Detection

by

Zhenchao Gu

¹,

Kai Zhu

^2,*

and

Shangtao You

¹

School of Mechanical Engineering, Jiangsu University of Technology, Changzhou 213000, China

²

School of Automobile and Traffic Engineering, Jiangsu University of Technology, Changzhou 213000, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(18), 3744; https://doi.org/10.3390/electronics12183744

Submission received: 16 August 2023 / Revised: 31 August 2023 / Accepted: 3 September 2023 / Published: 5 September 2023

Download

Browse Figures

Versions Notes

Abstract

:

As a vital part of autonomous driving, vehicle detection, especially for outdoor small target vehicles, has attracted great attention from researchers during recent years. To ameliorate the difficulty in accurately identifying outdoor small vehicle targets in dense environments, this paper proposes a new structure named YOLO-SSFS, in which SPD-Conv, a small target detection layer (STDL), the Improved Feature Pyramid Network structure (IM-FPN), and the SCYLLA-IoU (SIoU) loss function are introduced. Firstly, the multi-scale fusion module of the original algorithm is improved by adding a detection layer for smaller targets. This detection layer preserves shallow semantic information, which helps to refine the algorithm’s detection accuracy for small targets. Then, a new Convolutional Neural Network (CNN) building block named SPD-Conv is constructed to replace the pooling layers and convolutional layers in the YOLOv5 algorithm, which reduces information loss, ensures the original fine-grained details of the image and improves the learning ability. Afterwards, a new FPN structure is created to retain more information conducive to small target detection during the feature fusion process so as to enhance the robustness of the method. Finally, to speed up the convergence of the loss function, the SIoU loss function is introduced to replace Complete-IoU (CIoU) in the original algorithm. In order to verify the authenticity of the improved algorithm, we conduct a series of experiments on the VisDrone dataset and perform comparative analyses of the results. The results obtained demonstrate that compared with the original YOLOv5, the proposed model performs better in small target detection. The mean average precision (mAP) is 83.07%, which is 7.63% higher than that for YOLOv5, while the detection speed reaches 52 frames per second (FPS), meeting the requirements for real-time detection.

Keywords:

vehicle detection; small target detection; YOLOv5s; deep learning

1. Introduction

As artificial intelligence (AI) technology is developing rapidly, making great contributions to intelligent transportation systems and autonomous driving. In autonomous driving technology, accurate vehicle identification is a critical module that provides in-time data support for the whole system. Therefore, it is highly practical to design an accurate and real-time vehicle detection algorithm.

The actual application of the vehicle detection process is often implemented in complex environments so the precision of target recognition is not ideal due to the varying size of detection objects. The traditional image target detection algorithm is usually divided into the following steps: candidate region selection, feature extraction, and classification. These algorithms perform well in detection accuracy, but the generalization ability is weak. Different features require corresponding feature extraction methods, and their abilities rely on the structure, which influences the real-time vehicle detection efficiency.

With the unremitting efforts of scholars in recent years, vehicle detection algorithms based on deep learning have increased in number and provide better detection performance than the original algorithms. In this paper, to compensate for the shortcomings of traditional methods, we put forward an improved algorithm based on YOLOv5 as the main framework to realize small target detection in dense conditions. This algorithm aims to simplify the detection process while quickly and accurately detecting targets, reducing the false or missed detections. The external perception system can accurately identify vehicles and assist the system by making appropriate judgments during the autonomous driving process, which can optimize driving strategies and further improve the energy recovery efficiency of new energy vehicles.

Therefore, we take the YOLOv5 as a detector that incorporates the advantages of the earlier version of YOLO, improving its detection speed and accuracy for detecting targets. However, for different scenarios, the existing algorithm still has several disadvantages, such as missed and false detections for small vehicle targets, which needs to be refined.

To deal with the shortcomings of YOLOv5, we form the optimizing directions of the algorithm by improving the sensitivity of the algorithm to small vehicle targets, reducing the probability of misjudgment of vehicle targets during the detection process, and enabling the algorithm to achieve rapid and accurate detection of autonomous driving. In this paper, we adopt YOLOv5s, the minimum volume model in YOLOv5, as the basis for modification and optimization to create a new structure named YOLO-SSFS, and the contributions are as follows:

(1) To enhance YOLOv5s’s detection ability for small targets on complex roads, we choose shallow layers with smaller receptive fields for improvement. We assume that increasing the scale of feature detection and changing the feature fusion mode can greatly enhance the sensitivity of the network for small target detection.

(2) To improve YOLOv5s’s ability to detect dense targets on complex roads, we insert a new convolution module SPD-conv into the algorithm structure, which can effectively reduce the loss of fine granularity and avoid false or missing detection caused by low pixels.

(3) In order to retain more shallow and deep semantic information during the feature fusion stage, we draw on the core idea of the Bi-directional Feature Pyramid Network (BiFPN) structure, which adds cross-scale connecting lines to allow small target information in the shallow layer to participate in the feature fusion, improving the algorithm’s detection ability for small outdoor vehicle targets.

(4) The initial loss function CIoU of YOLOv5 is replaced by the SIoU loss function, which speeds up the rate of convergence and detection speed.

2. Related Works

2.1. Development of Object Detection Algorithms

Classical detection algorithms utilize sliding windows with different sizes to select the possible regions of the detection target on the image and then adopt manually designed features to extract features from these regions, such as Scale Invariant Feature Transform (SIFT) [1], Histogram of Oriented Gradient (HOG) [2], Deformable Part Model (DPM) [3], Local Binary Pattern (LBP) [4], etc. Subsequently, the selected features are transferred into Support Vector Machine (SVM) [5] or Adaptive Boosting (Adaboost) [6] and other classifiers for classification and output. Due to the influence of the sliding window size and the manually constructed feature extraction method, the algorithms are not applicable in different scenes, which leads to window redundancy, poor generalization, and other problems.

At the turn of this century, deep learning algorithms inherited from machine learning emerged. In 1990, Yann et al. released an innovative deep learning network known as LeNet [7], which was treated as the first applicable convolutional neural network model. They laid a foundation for developing subsequent convolutional neural networks. In 2012, AlexNet [8] emerged as the winner of the large-scale visual identification competition, revealing the priority of neural networks in object recognition. He et al. [9] put forward Spatial Pyramid Pooling Networks (SPPNet), which employed spatial pyramid pooling layers to extract region-specific features from the image feature map using spatial relationships, which substantially improves the generalization of the algorithm. In 2014, R. Girshick et al. [10] introduced Region-Based Convolutional Neural Networks (R-CNN), a deep learning-based target recognition algorithm that utilized CNNs for automatic feature extraction from images. Subsequently, these authors made improvements to the R-CNN and proposed the Fast R-CNN [11] algorithm in 2015. It integrated the advantages of SPPNet and R-CNN, introducing the RoI Pooling layer into the network, and adopted a multitask loss function to optimize the model and improve the accuracy of the algorithm. Faster R-CNN [12] was launched later and adopted a regional proposal network (RPN) to replace the selective search, aiming at recommending candidate regions. Based on Faster R-CNN, many two-stage target detection algorithms were created. Dai et al. [13] produced Region-based Fully Convolutional Networks (RFCN), in which a location-sensitive score map, location-sensitive area pooling, and location-sensitive regression were introduced to update the model classification and detection accuracy. Aiming at the defects of Region of Interest (RoI) Pooling in Faster R-CNN, He et al. [14] proposed Mask R-CNN, which used RoI Align instead of RoI Pooling to maintain consistent feature map size and improve algorithm performance. Also, in 2015, YOLO [15] came into being as a one-stage algorithm in which object detection can be simplified into a regression problem, and the category probability and location parameter of the target can be obtained through regression models directly. Liu et al. [16] proposed Single Shot Detection (SSD) that introduced the primary frame mechanism to return target borders of various sizes and utilized feature maps of multiple shapes to detect targets of different pixel sizes. Soon, YOLOv2 [17] was also built, using Darknet-19 for feature extraction in the backbone for the first time. Then, YOLOv3 [18] was established, which introduced Darknet53 [19] into the backbone to enhance algorithm detection accuracy. At the beginning of the design of two-stage algorithms, most of them focused on the accuracy of detection, and their numerous parameters resulted in the disadvantage of slow recognition speed. The first-stage algorithm simplifies the detection process but inevitably leads to a decrease in detection accuracy. Fortunately, the creation of YOLOv4 [20] and YOLOv5 facilitated the target detection algorithm in both detection accuracy and detection speed.

2.2. Achievements Related to Small Target Detection

Small target detection is a challenge that attracts the attention of many researchers. Kisantal et al. [21] copied and pasted small targets that were difficult to detect into the image and changed their attitude angles. By oversampling, they improved the detection accuracy of small targets, addressing the issue of their low representation in images. Zhao et al. [22] used a long and short memory (LSTM) network to reconstruct FPN architecture and fused it with SSD to establish a new feature fusion network called Memory SSD (MSSD). The algorithm achieved reasonable experimental results on the Pascal VOC dataset. Nayan et al. [23] achieved real-time detection using the algorithm, shifting the focus of the algorithm to previously overlooked small targets. This algorithm mainly uses upsampling and skip connections to extract features from different scales, providing a solution to existing problems. Zhou et al. [24] employed an enhanced version of the K-Means algorithm to develop an a priori anchor box that is well-suited for the unique shape characteristics of ship targets and optimized the Non-Maximum Suppression (NMS) algorithm to eliminate ship candidate boxes in overlapping areas, thus avoiding missing detection caused by relatively close ships. Cai et al. [25] introduced the Cascade classifier to adjust the Intersection over Union (IoU) threshold and proposed the Cascade R-CNN, which was able to optimize noise interference in detection boxes. This approach provided a new approach to solving small target detection problems. Bai et al. [26] proposed a generator to improve the resolution of existing images, allowing small and blurry objects to be detected and ultimately achieve more accurate detection results. Li et al. [27] built a generative adversarial network based on the difference between small and large targets. Inspired by the idea of learning random noise to image mapping from traditional generative counter-measure networks, they employed the network to learn the mapping between small targets and large target features. Lim et al. [28] incorporated multi-scale features and contextual information from different levels, while also implementing an attention-based object detection method that could accurately identify small targets in complex images. This approach resulted in improved detection accuracy, and the fusion of features from multiple levels further enhanced the effectiveness of the algorithm. Xu et al. [29] applied Focal Loss to replace the loss function in the SSD network, which could reduce the influence of easily separable sample loss on the total loss and improve the model’s accuracy. Based on YOLOv3-Tiny, Sri et al. [30] raised the network’s learning ability by integrating spatial pyramid pooling and feature stitching, facilitating speed and accuracy in vehicle detection. Lin et al. [31] enhanced the Rectified Linear Unit (ReLU) activation function and incorporated a local response normalization layer in the conventional Convolutional Neural Network. They also utilized the maximum superposition method, refined segmentation and correction of the original activation function, and successfully achieved precise image recognition. Liu et al. [32] proposed an exponentially learnable power function softmax pooling layer which can improve the detection rate. Sergio et al. [33] adopted a Sparse Automatic Encoder (SAE) algorithm to reconstruct small fingerprint images. SAE is fine-tuned and optimized with L2 and sparsity regularization, thereby improving the efficiency of architecture learning and improving detection accuracy. To enhance the ability to detect small targets, lightweight models were developed, such as SqueezeNet [34], MobileNet [35], and ShuffleNet [36], which are commonly applied to the deep learning network. These proposed methods provide some solutions for small target detection, but there is still practical significance to researching solutions for detecting objects against complex backgrounds. We need to further develop faster and more accurate algorithms to meet practical needs.

3. Improved YOLOv5 Network

3.1. Introduction of YOLOv5

YOLO (You Only Look Once) uses a convolutional neural network structure to predict the target determine category. As a typical one-stage algorithm, it utilizes a single neural network to process the entire image region [37], and then decomposes it into several components to evaluate the probability of candidate regions in each region. Compared with the other multi-stage detection algorithms, YOLOv5 can ensure real-time detection while maintaining excellent accuracy [38].

The YOLOv5 model was developed from the YOLOv3 model. Unlike earlier iterations of YOLO, YOLOv5 is a collection of code clusters, mainly composed of four categories. The fundamental structures of these four categories are in the same, but the parameters of network depth and width vary, resulting in different performance of the network. The following four components are built into the model of YOLOv5: Input, Backbone, Neck, and Output. The overall model is displayed in Figure 1.

The input of YOLOv5 adopts the Mosaic data enhancement mode and uses random cropping, scaling, combination, and other methods to compose new photos. The primary purpose is to enrich the dataset and enhance the consistency of the algorithm. Meanwhile, the model also uses adaptive picture scaling to reduce calculation and speed up reasoning.

YOLOv5 implements downsampling in the backbone and concentrates the picture information on the channel. Next, CSPBottleNeck is replaced by a new module in the backbone, called the C3 module, to enhance the algorithm’s learning ability and detection accuracy, and thereby realize a more lightweight model.

The Neck component of YOLOv5 follows the structure of YOLOv4 and utilizes a combination of FPN and PAN network structures. In transmitting the semantic information of detection objects, FPN mainly transfers semantic information from deep to shallow feature maps. In contrast, PAN mainly transfers positioning data from an external feature map to a deep feature map. FPN interacts with PAN to realize parameter aggregation of different detection layers, significantly strengthening feature fusion.

In the output, the bounding box regression function for YOLOv5 is CIoU. In the detection process, the main considerations are the overlapping areas between the bounding boxes, center point distance, and aspect ratio, which greatly reduces the interference of occlusion overlap on the precision of target detection. The evaluation of the algorithm for prediction box regression has also become more accurate. The sigmoid activation function, characterized by continuity and smoothness, is used in the detection layer and is convenient for derivation.

3.2. SPD-Conv

In the target detection process, the detection precision for small objects is often very low compared with that for normal objects. In the entire frame, the proportion of small target pixels is relatively low, and the background information for model learning is minimal. In addition, small targets are often accompanied by other large ones, which often dominate the learning procedure in the detection process, thus making small targets undetectable. In this condition, the reliability of convolutional neural networks deteriorates obviously. In the early layers of a convolutional neural network structure, the image resolution is suitable for studying large targets, and a large amount of redundant information can be well filtered in step length volume, making the feature effect of model learning better. However, the quantity of redundant data is less when the number of image pixels is low or the detection target is small. In this case, strided convolution and pooling can lead to the loss of fine-grained information, resulting in insufficient learning by algorithms for small targets, which is an important reason for the low efficiency of small target detection.

To solve the problems caused by strided convolution and pooling, we add a new module called SPD-Conv [40] in YOLOv5, which replaces the convolutional step size and pooling layer in the original convolutional network. The SPD module comprises two layers: a spatial depth layer and a volume layer. This module downsamples the features in the image to retain the required information in the channel, and is followed by a non-step convolution that is used to streamline the number of channels that utilize learnable parameters in the added convolution layers. We will illustrate it with the following example.

We assume a general input image X of size

W \times W \times D_{1}

, which is cut into a series of sub-feature sequences at every step size as

f_{0,0} = X [0 : W : P, 0 : W : P], f_{1,0} = X [1 : W : P, 0 : W : P], \dots, f_{P - 1,0} = X [P - 1 : W : P, 0 : W : P] f_{0,1} = X [0 : W : P, 1 : W : P], f_{1,1} = X [1 : W : P, 1 : W : P], \dots, f_{P - 1,1} = X [P - 1 : W : P, 1 : W : P] ⋮ f_{0, P - 1} = X [0 : W : P, P - 1 : W : P], f_{1, P - 1}, \dots, f_{P - 1, P - 1} = X [P - 1 : W : P, P - 1 : W : P] .

Taking any intermediate feature map X as an example, all sub maps

f_{x, y}

are composed of the feature vectors

X (x, y)

, where

x + i, y + j

can be proportionally divided. Therefore, each sub map can downsample feature map X in a certain proportion. Figure 2 gives an example when P = 2; we obtain the sub feature sequence as

f_{0,0} = X [0 : W : P, 0 : W : P], f_{1,0} = X [1 : W : P, 0 : W : P] f_{0,1} = X [0 : W : P, 1 : W : P], f_{1,1} = X [1 : W : P, 1 : W : P]

The sub map is of shape

(\frac{W}{2}, \frac{W}{2}, D_{1})

, and the downsampling factor is 2.

Afterwards, the divided sub-feature maps are merged and form a new feature map

X^{'}

on the channel dimension, which has a spatial dimension reduced by the scaling factor P and a channel dimension expanded by the scaling factor P². That is to say, after the operation above, the feature map

X (W, W, D_{1})

is transformed into an intermediate feature map

X^{'} (\frac{W}{P}, \frac{W}{P}, P^{2} D_{1})

.

After completing the SPD feature conversion of the image, we add a non-step convolution layer with a

D_{2}

filter (i.e., stride = 1) where

D_{2} < P^{2} D_{1}

, and convert

X^{'} (\frac{W}{P}, \frac{W}{P}, P^{2} D_{1}

) to

X^{″} (\frac{W}{P}, \frac{W}{P}, D_{2}

). The non-step convolution is utilized to maintain the discriminant feature information to the greatest extent possible. Otherwise, when we use a filter with an odd step size, such as a step size of 3, the feature map will be scaled down, but each pixel will only be sampled once. If the filter has an even step size, such as a step size of 2, it will result in imbalanced sampling, where the sampling time for even and odd rows (columns) is inconsistent.

In summary, using a filter with a step size greater than 1 can result in the loss of individual information in the layer. Although on the surface, it transforms the original feature map

X (W, W, D_{1})

into

X^{″} (\frac{W}{P}, \frac{W}{P}, D_{2})

, the intermediate feature map

X^{'} (\frac{W}{P}, \frac{W}{P}, P^{2} D_{1}

) does not exist, which often leads to the omission of small target information. The new structure obtained after importing the SPD layer is shown in Figure 3.

3.3. Small Target Detection Layer (STDL)

One of the important features of the Yolov5 series of algorithms is that the detection results are obtained by integrating multiple scales. There is no strict regulation on the input size, with 640 × 640-sized images usually used. The algorithm uses a feature detection layer of 20 × 20 to detect large-size targets. Then we use a double upsampling of 20 × 20, which is a 40 × 40 feature layer, to detect medium targets. Finally, we use a double up-sampling of 40 × 40, which is an 80 × 80 feature layer, to detect small targets. Yolov5 detects targets of different sizes on three scales, which overcomes the shortcomings of previous target detection algorithms with single scales and effectively improves detection accuracy.

In complex road situations, many targets occupy a low percentage of images or videos as they are more distant. The 80 × 80 scales used by the original YOLOv5 often miss some targets with small numbers of pixels due to the limitation of receptive field. For these targets, we need a smaller feature detection layer. Therefore, we add a 160 × 160 feature layer to focus on smaller targets, which compensates for the insensitivity of the original algorithm to small targets. Simultaneously, the new algorithm has also undergone changes in its feature fusion method. By combining information from the 80 × 80 feature layer with the newly added 160 × 160 feature layer, the algorithm is now capable of detecting small targets in the image with greater accuracy. Through this improvement, the detection ability of the algorithm is further improved, which can greatly reduce the missed detection rate of small target vehicles. We introduce a P2 layer that keeps more shallow semantic information in feature fusion to retain the location information of the detected target, ultimately achieving the goal of improving the detecting accuracy of small target vehicles. An example of different output feature maps at four scales of detection is shown in Figure 4.

3.4. Improved Feature Pyramid Network (IM-FPN)

The traditional FPN structure is only composed of a single information flow from top to bottom, but the PANet network adds a bottom-up enhanced information flow on top of the FPN, effectively preserving richer shallow features. BiFPN has made improvements to address the shortcomings of PANet, as displayed in Figure 5c. In the original network structure, BiFPN carries out feature fusion from layers 3 to 7, and it is believed that if a node only has one input edge, its contribution to the network is relatively small. Therefore, for the sake of minimizing the computational complexity of the model structure, certain feature fusion nodes from layer 3 and layer 7 in the original network are deleted. At the same time, an extra edge is added to the remaining layers to realize the cross-scale connection, and the extracted features are directly fused with the corresponding size of features in the bottom-up path, which can retain the shallow semantic information required for detecting small targets without causing the loss of deep semantic information. Our method adds a new detection scale for small object detection. However, this improvement includes the second layer in the fusion. In the fusion process, the model saves too much shallow semantic information, resulting in severe loss of deep semantic information in the network.

Although retaining too much shallow information leads to the loss of deep information, given that the location information of small targets often exists in shallow layers, we need to retain the shallow information. Therefore, we have developed an idea to solve this problem. The new structure is inspired by BiFPN, and two cross-scale connection lines are added to fuse more deep and shallow semantic information, and thereby improve the ability to learn features. A fusion feature in Level 3 is taken as an example.

P_{3}^{o u t}

is the output characteristic of the third layer in the bottom-up path. According to the structural diagram in Figure 5d, the fusion feature calculation Formula (1) can be obtained as follows:

P_{3}^{o u t} = C o n v (\frac{w_{1} \cdot p_{3}^{t d} + w_{12} \cdot p_{3}^{i n} + w_{3} \cdot R e s i z e (p_{2}^{o u t})}{e + w_{1} + w_{2} + w_{3}})

(1)

where

C o n v ()

represents the convolutional operations for feature processing,

w_{i}

represents the learnable weights,

R e s i z e ()

represents the upsampling or downsampling operations, and

e

represents the learning rate.

Adding cross-scale connecting lines ensures that more complete features can be obtained during the information fusion process without increasing computational costs. The comparison of feature maps before and after adding IM-FPN is shown in Figure 6.

3.5. New Loss Function of Boundary Box—SIoU

The loss function is the difference between the predicted value and the true value of the discriminative model, which can provide the correct direction for training. In the basic YOLOv5s model, the CIoU loss function is used as the boundary box loss function. It can be defined by the formula:

C I o U = I o U - (\frac{ρ^{2} (b, b^{g t})}{c^{2}} + α ν)

(2)

where

ρ^{2} (b, b^{g t})

refers to the distance between the centroids of the bounding boxes,

c

refers to the diagonal of the minimum enclosed area, and

α ν

refers to the impact factor of the penalty term.

As an excellent loss function, CIoU considers the influence of shape on loss, especially the increased loss of detection due to frame length and width, but ignoring the influence of angle on the results. Therefore, we introduce SIoU instead of CIoU to accelerate the convergence of the distance between two prediction boxes, which reduces the inference time and computational costs. Figure 7 shows the contribution of calculation angle cost in the loss function. SIoU not only considers the cost of geometric parameters, but also considers the angle cost between the predicted box and the actual box. The formula for calculating its angle is as follows:

Λ = 1 - \sin^{2} (\arcsin x - \frac{π}{4})

(3)

where

x = s i n α = \frac{c_{h}}{σ}, σ = \sqrt{{(b_{c_{x}}^{g t} - b_{c_{x}})}^{2} + {(b_{c_{y}}^{g t} - b_{c_{y}})}^{2}}

, and

c_{h} = \max (b_{c_{y}}^{g t}, b_{c_{y}}) - m i n (b_{c_{y}}^{g t}, b_{c_{y}})

.

The definition formula for the distance cost of SIoU is

Δ = 2 - e^{- γ ρ_{x}} - e^{- γ ρ_{y}}

(4)

where

γ = 2 - Λ

,

ρ_{x} = \frac{b_{c_{x}}^{g t} - b_{c_{x}}}{c_{w}}

,

ρ_{y} = \frac{b_{c_{y}}^{g t} - b_{c_{y}}}{c_{h}}

, and the impact of distance cost is determined by α.

The formula for shape cost is as follows:

Ω = {(1 - e^{- w_{w}})}^{θ} + {(1 - e^{- w_{h}})}^{θ}

(5)

where

w_{w} = \frac{| w - w^{g t} |}{m a x (w, w^{g t})}

,

w_{h} = \frac{| h - h^{g t} |}{m a x (h, h^{g t})}

, and θ represents attention to shape cost. According to [41], it is calculated that θ is between 2 and 6, so in this article, we select the median θ = 4.

IoU loss is expressed by the following formula:

I o U = \frac{| b \cap b^{g t} |}{| b \cup b^{g t} |}

(6)

The final calculation formula of SIoU is given by

L_{S I o U} = L_{Λ} + L_{Δ} + L_{Ω} + L_{I o U} = 1 - I o U + \frac{Δ + Ω}{2}

(7)

The overall network structure is shown in Figure 8. Although this measure results in a certain increase in computation, it can achieve better results for improving the network framework for datasets with dense targets and simple features.

4. Experimental Results

4.1. Experimental Dataset and Experimental Settings

We select the open-source VisDrone2019 as the object detection dataset in this experiment. The VisDrone2019 dataset was collected by a research team at Tianjin University, who produced 288 video clips containing over 10,000 photos. The images are gathered from 14 Chinese cities that are far apart from each other, and are taken in both urban and rural road environments. The dataset is carefully labeled to identify pedestrians, bicycles, and vehicles on the road. We mainly select vehicle (car) tags as training and testing samples from the dataset, with a total of 8178 pictures. The training set, value set, and test set are divided according to the ratio 8:1:1, whereby the training set and the test set are strictly independent. At the same time, the Mosaic data augmentation method is adopted to enrich the dataset and improve the stability of the algorithm. Figure 9 shows some typical pictures in the dataset, and Table 1 shows the data distribution.

The server’s hardware environment in this experiment is as follows: AMD Ryzen 7 5800H with Radeon Graphics CPU, 16GB running memory, NVIDIA GeForceRTX3060 laptop GPU, and 6GB video memory. The server uses Windows 11 and Python 3.7, and the GPU-accelerated software is CUDA11.3 and CUDNN8.2.1.

The experimental settings are basically set according to the officially recommended parameters of YOLOv5. The image size is set to 640 × 640, and the batch size is set to 8. To increase the diversity of the training sample and improve the generalization ability, YOLOv5 uses Mosaic data enhancement. The initial learning rate lr0 = 0.01; the Cosine annealing strategy was used to update the learning rate, and its parameter lrf = 0.1; SGD function was used to optimize parameters, and the momentum factor was set to 0.937. The experiment included 300 training epochs in total. As shown in Figure 10, the algorithm consists of two parts: data processing and recognition classification. The specific steps are shown in the flowchart.

4.2. Evaluation Metrics

In the experiments, the model is trained and verified on the VisDrone2019 dataset. In the evaluation of model performance, average precision (AP) and mean average precision (mAP) are generally used as evaluation indicators of model accuracy, which comprehensively consider precision (P) and recall (R). The speed of model detection is determined by parameters and frames per second (FPS). Precision (P), recall (R), AP, and mAP are expressed as

P = \frac{T P}{T P + F P}

(8)

R = \frac{F P}{T P + F N}

(9)

A P = \int_{0}^{1} p (R) d r

(10)

m A P = \frac{\sum_{i = 1}^{n} A P (i)}{n}

(11)

where

T P

(true positives) represents the number of targets detected correctly,

F P

(false positives) is the number of targets detected incorrectly,

F N

(false negatives) stands for the number of targets not detected, and

n

(classes) indicates the number of categories that need to be classified.

A P

indicates the average precision of a target class.

FPS is the number of frames in a video, with more frames meaning smoother action. Typically, real-time target detection requirements can be achieved when the FPS (frames per second) is above 50. The calculation formula is shown as follows:

F P S = \frac{1000}{δ \cdot β \cdot N M S}

(12)

where

δ

is the image preprocessing time,

β

denotes inference speed, and

N M S

represents post-processing time.

4.3. Experiment Results

4.3.1. Experimental Analysis of Introducing the SPD-Conv Model

To demonstrate the performance of SPD-Conv in detecting target vehicles with small pixel proportions in images, we introduced the SPD-Conv module into the YOLOv5 algorithm, with other parts of the algorithm remaining unchanged. We conducted an experimental comparison on the VisDrone2019 dataset, and the data are recorded in Table 2.

The results in Table 2 clearly indicate that when the SPD-Conv component is inserted, although the parameter quantity is increased, its precision, recall, and mAP are all improved, and the detection accuracy is increased by 2.96%. Taken together, these results demonstrate the feasibility of introducing this module.

4.3.2. Experimental Analysis of Importing Small Target Detection Layer (STDL)

From the perspective of expanding the detection scale of the algorithm, this paper focuses on the expansion of the Receptive field and adds a detection layer for detecting smaller targets. Correspondingly, adaptation improvement is also carried out in the feature fusion part of the algorithm, keeping the other factors unchanged. The same experimental parameters were set to verify whether the improved method can further optimize the algorithm.

According to the experimental data in Table 3, it can be observed that the effect of the extra small target detection layer is more prominent. With only a small amount of computation added, the detection accuracy increased by about 6.71%. Although FPS decreases in the detection process, the real-time detection task can be guaranteed. Therefore, the improved algorithm is effective in actual measurement.

4.3.3. Experimental Analysis of Importing Improved FPN (IM-FPN)

Due to the addition of a small target detection layer, the channel length of the feature fusion network increased, resulting in information loss. Therefore, an improved FPN structure was added to the original network. In this experiment, we introduced this new structure to verify its effectiveness in improving the detection of small targets.

The data in Table 4 indicate that the addition of the improved FPN structure alone results in the mAP reaching 76.40%. When the number of parameters increases, the FPS decreases somewhat, but its accuracy does not increase significantly, only 0.96%. It can be concluded from the above results that the optimization of the FPN structure has improved the accuracy of detection, but the overall effect is not obvious. Therefore, we designed more experiments and found similar results.

4.3.4. Experimental Analysis of Importing Improved SIoU

In order to speed up the detection rate, we substituted the CIoU loss function in the original algorithm with SIoU. In order to further verify its impact on the detection effect, we changed the loss function in the basic algorithm, and the experimental results are shown in Table 5.

From the above experimental results, it can be seen that replacing the loss function not only allows the algorithm to perform better in detection accuracy, but also improves the detection speed.

After the above experiments, we found that there was no overfitting phenomenon during any of training processes. However, in case of overfitting, we can improve it using the following approaches: (1) rescreen the data to remove duplicate data; (2) use data augmentation methods to enrich the dataset; or (3) reduce the complexity of the model by appropriately diminishing the number of layers, nodes, or convolutional kernels, thereby avoiding the risk of overfitting.

4.3.5. Ablation Experiment

The four improvement methods proposed in this paper are adding the SPD-Conv module, adding a small target detection layer, improving the structure of the FPN and importing the SIoU loss function. To verify the effectiveness of these four improved methods, the ablation experiment was designed as follows.

To test the impact of adding each improvement method to the original YOLOv5s algorithm, we added them in sequence and evaluated their effectiveness.

Table 6 shows that the proposed improvements in this paper led to varying degrees of improvement in detection accuracy compared with unimproved algorithms on the VisDrone dataset. The ablation experiment shows that gradually introducing each improvement further enhances the algorithm’s sensitivity to small target vehicles. It can be inferred from the above experimental results that improving the FPN structure alone cannot significantly improve the effect. However, when combined with the other improvement methods, the detection of vehicle targets is considerably enhanced. It can be observed from the above experimental data that the introduction of the improved FPN structure retains the deep-level semantic information through cross-scale connectors so that it can participate in the final feature fusion and make up for the defect of too much shallow information. However, when the improved structure was introduced alone into the basic YOLOv5 algorithm, only P3 to P5 layers were used for feature fusion, and the semantic information was not excessively retained or ignored. In this case, adding cross-scale connection lines was of little significance, so the improvement was not evident. When the improved algorithm is transformed from a three-scale to four-scale structure, the function of this cross-scale connection line is obvious and the improvement is significant.

In multiple experiments, we obtained similar results, so the new structure YOLO-SSFS proposed in this article is beneficial for improving the detection effect. The detection accuracy increased by 7.63%, and the FPS reached 52 frames/s, guaranteeing detection accuracy and real-time performance to a certain extent. The experiment demonstrates that the algorithm proposed in this paper is capable of effectively detecting vehicle targets in various scenarios.

4.3.6. Contrast Experiment

To further verify the superiority of the YOLO-SSFS model, the original YOLOv5s model, the two-stage network Faster R-CNN, the one-stage anchor-free network FCOS, YOLOv3, and YOLOv4 were trained in the same experimental environment. With all the experimental data converging, we summarize the results in Table 7.

It can be seen from the table that there is a certain gap between the FPS of Faster R-CNN and FCOS algorithms and the other algorithms, both of which cannot meet the real-time detection requirements. In the comparison of the YOLO series algorithms, YOLOv3 is superior in Precision and Recall, but it is inferior to YOLO-SSFS in mAP by 2.65%. At the same time, although YOLO-SSFS results in a certain reduction in detection speed compared with YOLOv5s, it is still higher than that of the YOLOv3 and YOLOv4 algorithms, with a detection speed of 1.5 ms (52FPS) per image. Finally, YOLO-SSFS only differs from YOLOv5s in having more parameters but its detection accuracy is the greatest among these algorithms. The variation in mAP values of YOLO series algorithms during training is shown in Figure 11.

Figure 12 reveals the detection performance of YOLO series algorithms in different weather environments. Under the same environmental conditions, the new structure YOLO-SSFS can detect more distant vehicle small targets, greatly improving detection accuracy.

4.3.7. Generalization Experiment

In order to further verify the generalization of the algorithm and its ability to run correctly on different datasets, we conducted further experiments on the UA-DETRAC dataset. This dataset was filmed at 24 different locations in Beijing and Tianjin, with a large number of small target vehicles. The implementation results are shown in Table 8.

Table 8 shows the results of our algorithm trained on UA-DETRAC. After the improvement in the algorithm, both precision and recall improved to a certain extent, and its mAP has also increased by 3.18%, which proves that our algorithm has stable detection performance with different datasets and strong generalization ability.

Figure 13 compares the results between the YOLOv5 algorithm and the YOLO-SSFS algorithm. It is clear that YOLO-SSFS can detect smaller vehicle targets, further verifying the effectiveness of the proposed method.

5. Conclusions and Future Works

To accurately identify small vehicle targets in congested road conditions, this paper advances an optimized structure called YOLO-SSFS. In the newly designed network architecture, to retain the discriminative feature information, the SPD-Conv modules are adopted at a suitable position in the algorithm structure, which helps to preserve identification features as far as possible. In addition, the detection scale of YOLOv5 is adjusted to a larger receptive field to detect small targets, making the algorithm more sensitive to small targets. In order to involve more shallow semantic information in the final feature fusion, a new structure is constructed by adding cross-scale connecting lines to fuse more information. Finally, we adopt SIoU to improve the detection speed and reduce computational costs. The images used for experimental training are extracted from the VisDrone vehicle dataset. The results indicate that compared with the original YOLOv5, YOLOv3, YOLOv4, Faster R-CNN, and FCOS network models, the mAP of the proposed model is increased by 7.63%, 2.65%, 4.76%, 26.07%, and 25.47%, respectively. Meanwhile, the detection speed of the optimized model is 1.5 ms (52 fps) per image, which satisfies real-time requirements. Finally, to verify the generalization of the algorithm, we conducted tests on the UA-DETRAC dataset, for which the detection performance was also excellent. The different experimental results demonstrate that YOLO-SSFS can effectively realize real-time detection of small target vehicles.

However, the improved method only focuses on small target detection of vehicles under complex road conditions, failing to consider other factors that may affect the safety of autonomous driving, such as non-motorized vehicles and pedestrians. To tackle these limitations, in future studies, we will collect more datasets of pedestrians and non-motorized vehicles and study the differences between different vehicle targets to enhance the generalization of the algorithm. In addition, the introduction of new modules requires more computational cost, and therefore, developing a lightweight solution of the model will be addressed in future work.

Author Contributions

Methodology, Z.G.; validation, K.Z.; writing—original draft, Z.G.; writing—review and editing, K.Z. and S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (19KJB620001).

Data Availability Statement

The study used open data.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision & Pattern Recognition, San Diego, CA, USA, 20–25 June 2005. [Google Scholar]
Felzenszwalb, P.F.; Girshick, R.B.; Mcallester, D.; Ramanan, D. Object Detection with Discriminatively Trained Part-Based Models. IEEE Trans. Softw. Eng. 2010, 32, 1627–1645. [Google Scholar] [CrossRef]
Guo, Z.H.; Zhang, L.; Zhang, D. A completed modeling of local binary pattern operator for texture classification. IEEE Trans. Image Process. 2012, 19, 1657–1663. [Google Scholar]
Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. In Proceedings of the Conference on Learning Theory, Nashville TN, USA, 6–9 July 1997. [Google Scholar]
Lecun, Y.; Bottou, L. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 1904–1916. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Washington, DC, USA, 23–28 June 2014. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks; Curran Associates Inc.: Red Hook, NY, USA, 2016. [Google Scholar]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the Computer Vision & Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.E.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Müller, S.; Hutter, F. TrivialAugment: Tuning-free Yet State-of-the-Art Data Augmentation. In Proceedings of the IEEE/CVF international conference on computer vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Kisantal, M.; Wojna, Z.; Murawski, J.; Naruniec, J.; Cho, K. Augmentation for small object detection. arXiv 2019, arXiv:1902.07296. [Google Scholar]
Zhao, T.; Liu, J.Y.; Shen, Q. An improved multi-gated feature pyramid network. Acta Opt. Sin. 2019, 39, 235–244. [Google Scholar]
Nayan, A.A.; Saha, J.; Mozumder, A.N. Real Time Detection of Small Objects. Int. J. Innov. Technol. Explor. Eng. 2020, 9, 837. [Google Scholar]
Zhou, Y.; Cai, Z.; Zhu, Y.; Yan, J. Automatic ship detection in SAR Image based on Multi-scale Faster R-CNN. J. Phys. Conf. Ser. 2020, 1550, 042006. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving Into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Bai, Y.; Zhang, Y.; Ding, M. SOD-MTGAN: Small Object Detection via Multi-Task Generative Adversarial Network. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Li, J.; Liang, X.; Wei, Y.; Xu, T.; Feng, J.; Yan, S. Perceptual Generative Adversarial Networks for Small Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Lim, J.S.; Astrid, M.; Yoon, H.J.; Lee, S.I. Small Object Detection using Context and Attention. In Proceedings of the 2021 international Conference on Artificial intelligence in information and Communication (ICAIIC), Jeju Island, Republic of Korea, 13–16 April 2021. [Google Scholar]
Xu, H.; Yang, D.; Jiang, Q. Improvement of lightweight vehicle detection network based on SSD. Comput. Eng. Appl. 2022, 58, 209–217. [Google Scholar]
Sri, J.S.; Esther, R.P. LittleYOLO-SPP: A Delicate Real-Time Vehicle Detection Algorithm. Optik 2020, 225, 165818. [Google Scholar]
Liu, M.; Wang, J.; Dong, G.G.; Yi, W.M. Weakly labeled sound event detection based on improved pooling layer. J. Signal Process. 2021, 37, 1907–1913. [Google Scholar] [CrossRef]
Lin, G.; Shen, W. Research on convolutional neural network based on improved Relu piecewise activation function. Procedia Comput. Sci. 2018, 131, 977–984. [Google Scholar] [CrossRef]
Saponara, S.; Elhanashi, A.; Gagliardi, A. Reconstruct fingerprint images using deep learning and sparse autoencoder algorithms. In Real-Time Image Processing and Deep Learning 2021; SPIE: Bellingham, WA, USA, 2021; Volume 11736. [Google Scholar]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Abdusalomov, A.B.; Mukhiddinov, M.; Kutlimuratov, A.; Whangbo, T.K. Improved Real-Time Fire Warning System Based on Advanced Technologies for Visually Impaired People. Sensors 2022, 22, 7305. [Google Scholar] [CrossRef]
Norkobil Saydirasulovich, S.; Abdusalomov, A.; Jamil, M.K.; Nasimov, R.; Kozhamzharova, D.; Cho, Y.-I. A YOLOv6-Based Improved Fire Detection Approach for Smart City Environments. Sensors 2023, 23, 3161. [Google Scholar] [CrossRef] [PubMed]
Mukhiddinov, M.; Abdusalomov, A.B.; Cho, J. A Wildfire Smoke Detection System Using Unmanned Aerial Vehicle Images Based on the Optimized YOLOv5. Sensors 2022, 22, 9384. [Google Scholar] [CrossRef]
Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Cham, Switzerland, 2022. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580. [Google Scholar]
Long, S.; Song, X.F.; Zhang, S.; Zhang, Q.L. Improved YOLOv5s aerial image vehicle detection research. Laser J. 2022, 43, 22–29. [Google Scholar] [CrossRef]
Zou, Z.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2019, 111, 257–276. [Google Scholar] [CrossRef]

Figure 1. The main structure of YOLOv5 and its main structure [39].

Figure 2. Iconography of SPD-Conv when P = 2.

Figure 3. New Conv structure after adding SPD layer.

Figure 4. Feature maps from different scales: (a) original picture; (b1) 160 × 160 feature map; (b2) 80 × 80 feature map; (b3) 40 × 40 feature map; (b4) 20 × 20 feature map.

Figure 5. Feature pyramid network design.

Figure 6. Variation in the feature fusion map of the Concat module after adding cross-scale lines.

Figure 7. Scheme for calculating angle costs.

Figure 8. Diagram of YOLOv5-SSFS structure.

Figure 9. Typical images of VisDrone dataset.

Figure 10. Flowchart of the proposed vehicle small target detection method.

Figure 11. Variation in mAP during training process.

Figure 12. These images show different algorithms working under different lighting conditions: (a) sunny, (b) cloudy, and (c) evening. It can be seen that the algorithm is suitable for detection under different conditions.

Figure 13. These images show different algorithms working under different lighting conditions: (a) daytime, and (b) night.

Table 1. Dataset partition.

Category	Train	Val	Test	Total
Quantity	6543	818	817	8178

Table 2. Experimental analysis of introducing SPD-Conv module.

Methods	Params (10⁶)	Precision (%)	Recall (%)	mAP (%)	FPS
YOLOv5s	7.01	84.50	67.44	75.44	57
YOLOv5s-SPD	8.56	85.90	68.88	78.40	55

Table 3. Experiment analysis of introducing the Small Target Detection Layer (STDL).

Methods	Params (10⁶)	Precision (%)	Recall (%)	mAP (%)	FPS
YOLOv5s	7.01	84.50	67.44	75.44	57
YOLOv5s (STDL)	7.30	84.30	75.40	82.10	53

Table 4. Experiment analysis of introducing Improved FPN (IM-FPN).

Methods	Params (10⁶)	Precision (%)	Recall (%)	mAP (%)	FPS
YOLOv5s	7.01	84.50	67.44	75.44	57
YOLOv5s (IM-FPN)	7.06	83.76	68.41	76.40	57

Table 5. Experiment analysis of introducing SIoU.

Methods	Params (10⁶)	Precision (%)	Recall (%)	mAP (%)	FPS
YOLOv5s	7.01	84.50	67.44	75.44	57
YOLOv5s (SIoU)	7.01	84.56	68.31	78.37	59

Table 6. Experiment results of ablation experiment.

SPD-Conv	STDL	IM-FPN	SIoU	mAP0.5 (%)	FPS
				75.44	57
√				78.40 (+2.96)	55
√	√			80.25 (+1.85)	54
√	√	√		81.41 (+1.16)	55
√	√	√	√	83.07 (+1.66)	52

“√” means the introduction of this module.

Table 7. Experiment results of contrast experiment.

Methods	Params (10⁶)	Precision (%)	Recall (%)	mAP (%)	FPS
Faster R-CNN [42]	60.52	82.20	53.77	57.00	33
FCOS [43]	50.96	/	/	57.60	16
YOLOv3	61.52	84.77	76.54	80.42	46
YOLOv4	63.94	83.25	70.75	78.31	34
YOLOv5s	7.01	84.50	67.44	75.44	57
YOLOv5-SSFS	8.97	84.10	76.16	83.07	52

Table 8. Experiment result of generalization experiment.

Methods	Params (10⁶)	Precision (%)	Recall (%)	mAP (%)	FPS
YOLOv5s	7.01	86.91	67.98	85.72	57
YOLO-SSFS	8.97	87.42	69.18	88.90	52

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gu, Z.; Zhu, K.; You, S. YOLO-SSFS: A Method Combining SPD-Conv/STDL/IM-FPN/SIoU for Outdoor Small Target Vehicle Detection. Electronics 2023, 12, 3744. https://doi.org/10.3390/electronics12183744

AMA Style

Gu Z, Zhu K, You S. YOLO-SSFS: A Method Combining SPD-Conv/STDL/IM-FPN/SIoU for Outdoor Small Target Vehicle Detection. Electronics. 2023; 12(18):3744. https://doi.org/10.3390/electronics12183744

Chicago/Turabian Style

Gu, Zhenchao, Kai Zhu, and Shangtao You. 2023. "YOLO-SSFS: A Method Combining SPD-Conv/STDL/IM-FPN/SIoU for Outdoor Small Target Vehicle Detection" Electronics 12, no. 18: 3744. https://doi.org/10.3390/electronics12183744

APA Style

Gu, Z., Zhu, K., & You, S. (2023). YOLO-SSFS: A Method Combining SPD-Conv/STDL/IM-FPN/SIoU for Outdoor Small Target Vehicle Detection. Electronics, 12(18), 3744. https://doi.org/10.3390/electronics12183744

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-SSFS: A Method Combining SPD-Conv/STDL/IM-FPN/SIoU for Outdoor Small Target Vehicle Detection

Abstract

1. Introduction

2. Related Works

2.1. Development of Object Detection Algorithms

2.2. Achievements Related to Small Target Detection

3. Improved YOLOv5 Network

3.1. Introduction of YOLOv5

3.2. SPD-Conv

3.3. Small Target Detection Layer (STDL)

3.4. Improved Feature Pyramid Network (IM-FPN)

3.5. New Loss Function of Boundary Box—SIoU

4. Experimental Results

4.1. Experimental Dataset and Experimental Settings

4.2. Evaluation Metrics

4.3. Experiment Results

4.3.1. Experimental Analysis of Introducing the SPD-Conv Model

4.3.2. Experimental Analysis of Importing Small Target Detection Layer (STDL)

4.3.3. Experimental Analysis of Importing Improved FPN (IM-FPN)

4.3.4. Experimental Analysis of Importing Improved SIoU

4.3.5. Ablation Experiment

4.3.6. Contrast Experiment

4.3.7. Generalization Experiment

5. Conclusions and Future Works

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI