Skip to Content
SensorsSensors
  • Article
  • Open Access

26 April 2021

Small Object Detection in Traffic Scenes Based on Attention Feature Fusion

,
,
,
and
Faculty of Vehicle Engineering and Mechanics, School of Automotive Engineering, Dalian University of Technology, Dalian 116024, China
*
Author to whom correspondence should be addressed.

Abstract

There are many small objects in traffic scenes, but due to their low resolution and limited information, their detection is still a challenge. Small object detection is very important for the understanding of traffic scene environments. To improve the detection accuracy of small objects in traffic scenes, we propose a small object detection method in traffic scenes based on attention feature fusion. First, a multi-scale channel attention block (MS-CAB) is designed, which uses local and global scales to aggregate the effective information of the feature maps. Based on this block, an attention feature fusion block (AFFB) is proposed, which can better integrate contextual information from different layers. Finally, the AFFB is used to replace the linear fusion module in the object detection network and obtain the final network structure. The experimental results show that, compared to the benchmark model YOLOv5s, this method has achieved a higher mean Average Precison (mAP) under the premise of ensuring real-time performance. It increases the mAP of all objects by 0.9 percentage points on the validation set of the traffic scene dataset BDD100K, and at the same time, increases the mAP of small objects by 3.5%.

1. Introduction

In traffic scenes, the visual perception technology of intelligent vehicles can help automatic driving systems to perceive complex environments accurately and in time, which is a requirement for avoiding collisions and for safe driving. With the rapid development of computer vision technology, vehicle visual perception is increasingly being adopted in the field of automatic driving. For example, object detection based on deep learning has played a very important role in the field of automatic driving.
Object detection involves the delineation of the bounding box of an object to be detected in the given image, and then the determination of the class that the object in the box belongs to. Due to their large amount of calculations, redundant marker boxes, and poor robustness of manual features, traditional object detection algorithms are currently being replaced by their deep learning counterparts. Lightweight real-time object detection models, such as the “you only look once” (YOLO) algorithm [1,2,3], the single shot multibox detector (SSD) algorithm [4], Light-Head R-CNN [5], and ThunderNet [6], have already demonstrated good detection effects in actual application scenarios.
At present, the prevailing deep learning-based object detection algorithms, such as YOLOv5 [7], treat each region of the whole feature map equally by default, that is, each region has the same contribution to the final detection result. This means that they do not weigh the convolution features extracted from the network according to their position and importance. However, compared with simple ordinary scenes, there are usually more complex and rich semantic features around the object to be detected in actual traffic scenes. If the features of the object area are weighted according to their importance, the objects to be detected can be better positioned in the feature map and the detection accuracy and generalization ability of the model can be improved.
Furthermore, in traffic scenes, there are many small objects in the distance. These objects offer limited feature information due to their relatively small size, which makes detection more difficult. Research on small object detection includes a deconvolutional single shot detector (DSSD) [8], scale normalization for image pyramids (SNIP) [9], high-resolution detection network (HRDNet) [10], etc. The DSSD algorithm mainly improves the detection performance of the object detector for small objects by using a better feature extraction network and adding context information. The SNIP algorithm uses a novel training scheme, called scale normalization for image pyramids (SNIP), which selectively back-propagates the gradients of object instances of different sizes as a function of the image scale to better detect small objects. The HRDNet algorithm feeds high-resolution input into a shallow network to reserve more positional information while feeding low-resolution input into a deep network to extract more semantics. By extracting various features from high to low resolutions, the algorithm improves the detection performance of small objects as well as maintaining the detection performance of medium and large objects. These algorithms each have their own advantages and limitations. Improving the detection of small objects in traffic scenes as much as possible is also one of the current research hotspots in the field of visual perception for autonomous vehicles. The YOLOv5 model is a milestone object detection method, which achieves a good balance between accuracy and speed, but it still has the possibility for improvement in small object detection problems in traffic scenes.
In response to the above problems, in this paper, we first propose an MS-CAB to alleviate the problems caused by scale changes to small object detection. This block effectively improves the feature inconsistency between objects at different scales, and at the same time, focuses attention on the objects in the area that need to be focused on, which reduces the unnecessary shallow feature information of the background. In other studies [11,12], the attention mechanism also considers the scale, such as by aggregating contextual information through convolution kernels of different sizes or from the feature pyramid inside the attention module. The MS-CAB proposed here aggregates contextual information along the channel dimensions of the feature map. It can not only focus on large objects that are distributed globally, but also deal with small objects that are distributed more locally. This block helps the model to detect and identify objects with extreme size differences.
Second, based on MS-CAB, an AFFB is proposed that is different from linear fusion schemes such as addition and concatenation, which are completely context-independent. The block is non-linear and can better capture the contextual information from different network layers by fusing features that are inconsistent semantically and in terms of scale. By replacing the simple addition or concatenation operation with the AFFB, a network model with fewer parameters and higher detection accuracy can be obtained, and the detection effect of small objects is improved greatly.
The remainder of this paper is organized as follows: Section 2 introduces the related works and existing problems of the three topics of object detection, attention mechanisms, and feature fusion. Section 3 briefly introduces the benchmark model, YOLOv5s, and then elaborates on the principle and structure of the proposed MS-CAB and the AFFB. Section 4 presents the experiments and an analysis of the results. The paper ends with our conclusions and suggestions for future work.

3. Benchmark Model and Proposed Methods

In this section, we briefly introduce the benchmark model YOLOv5s, then elaborate on the principle and structure of the proposed MS-CAB, and finally present the AFFB based on MS-CAB.

3.1. The YOLOv5s Benchmark Model

The development of the YOLO series ushered in a change in object detection technology through the adoption of deep learning. At present, the YOLO series includes YOLOv1 [1], YOLOv2 [2], YOLOv3 [3], YOLOv4 [41], and YOLOv5 [7]. The YOLOv5 model is the latest iteration of the model, and constitutes an improvement over YOLOv4. The model is faster, more accurate, has fewer model parameters, and can be more easily adapted to various devices embedded in vehicles. The YOLOv5 model refers to four models of different sizes, namely, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, where smaller models have fewer parameters, lower accuracy, and are faster. To better meet the real-time requirements of object detection in traffic scenes, in this study, we chose the YOLOv5s model as the benchmark model for improvement.

3.2. Multi-Scale Channel Attention Block

Based on the idea of combining local and global features in the convolutional neural networks adopted in ParseNet [42] and multi-scale channel attention [43], we propose MS-CAB, with the main difference being that we use 1 × 1 convolution rather than kernels of different sizes to control the channel attention scale. Similar to spatial attention, channel attention also has a scale, and the variable that controls that scale is the size of the pooling. Figure 1 shows a diagram of the MS-CAB structure, which is divided into two scales, the local scale and the global scale, where context features are aggregated through both scales. The branch that uses global average pooling is the global scale, while the other is the local scale. This block gathers contextual information along the channel dimension of the feature map, and can simultaneously focus on large objects that are more distributed in the global range and small objects that are distributed more in the local range, which helps the model to detect and identify objects with extreme scale changes in traffic scenes. In the following, we introduce the details of the implementation of the proposed MS-CAB.
Figure 1. The MS-CAB structure. The global average pooling branch is the global channel attention, while the other is the local channel attention.
Suppose that the output of a certain layer in the middle of the network is X and X R C × H × W , where C is the channel number of the feature map, and H and W are the height and width of the feature map, respectively. Then, X is used as the input of MS-CAB. The global and local channel attention can be obtained by changing the pooling size, and 1 × 1 convolution is used as the local channel context aggregator to extract the channel interaction at each spatial location. The local channel context L ( X ) R C × H × W can be expressed as
L ( X ) = B N ( C o n v 2 ( H s ( B N ( C o n v 1 ( X ) ) ) ) ) ,
where the convolution kernel parameters of C o n v 1 and C o n v 2 are C r × C × 1 × 1 and C × C r × 1 × 1 , r is the channel reduction ratio, B N stands for batch normalization [44], and H s stands for the Hardswish activation function [45]. The local channel context L ( X ) has the same shape as the input feature map X , and retains and highlights the richly detailed information of the low-level features. It focuses more on the small object information present in the local range.
The global channel context G ( X ) R C × 1 × 1 can be expressed as
G ( X ) = B N ( C o n v 2 ( H s ( B N ( C o n v 1 ( H s ( g ( X ) ) ) ) ) ) ) ,
g ( X ) = 1 H × W i = 1 H j = 1 W X [ : , i , j ]
where g ( X ) R C stands for global average pooling. Here, G ( X ) has the same number of channels as the input feature map X and pays more attention to large object information that is distributed more globally.
Combining the local channel context L ( X ) and the global channel context G ( X ) , the output Y R C × H × W of the MS-CAB can be expressed as follows:
Y = X M S C A B ( X ) = X σ ( L ( X ) G ( X ) )
where M S C A B ( X ) R C × H × W represents the output weight of the MS-CAB, σ represents the sigmoid function, represents element-wise multiplication, and represents the addition of the broadcast mechanism.
The proposed MS-CAB was embedded in the four Concat operation branches of the YOLOv5s model, and a new network model, MS-CAB_YOLOv5s, was obtained. The network structure diagram is shown in Figure 2. In the diagram, “Input” refers to the network input, and “Prediction” is the prediction result made by the network on the feature map on three scales. “Upsample” represents an upsampling operation, “Concat” denotes a concatenation operation, and “Conv” denotes a convolution operation. The composition of the “Focus” block is shown in Figure 3. It performs a slicing operation on the input red/green/blue (RGB) image, ultimately integrating the width and height information into the channel dimension. Its main function is to reduce floating point operations and improve the running speed of the model. The CBL block is composed of a convolution layer, batch normalization, and the Hardswish activation function, and its composition is shown in Figure 4. The YOLOv5s model contains two cross stage partial (CSP) structures [46], of which the CSP1 structure is used in the backbone of the network, while the CSP2 structure is used in the neck of the network. The composition of CSP1_X is shown in Figure 5. Here, CSP1_X indicates that it contains X residual units; for example, CSP1_1 contains one residual unit, and CSP1_3 contains three residual units. The composition of each residual unit is shown in Figure 6. The composition of CSP2_X is shown in Figure 7. Here, CSP2_X means that, in addition to the first CBL component, there are 2 × X CBL components in the middle. The size of the convolution kernel in the first CBL component is 1 × 1 , while in the second CBL component it is 3 × 3 . For example, in addition to the first CBL component in CSP2_1, there are 2 × 1 = 2 CBL components in the middle, and the convolution kernel sizes in the two CBL components are 1 × 1 and 3 × 3 , respectively. The SPP block uses the maximum pooling method to perform “Concat” operations on feature maps of different scales, and its composition is shown in Figure 8.
Figure 2. The MS-CAB_YOLOv5s network structure.
Figure 3. Composition of the “Focus” block.
Figure 4. Composition of the CBL block.
Figure 5. Composition of the CSP1_X block.
Figure 6. Composition of the residual unit block.
Figure 7. Composition of the CSP2_X block.
Figure 8. Composition of the SPP block.

3.3. Attention Feature Fusion Block

In combination with the multi-scale channel attention block proposed above, we propose AFFB, which can better capture contextual information from different network layers by fusing semantic and scale-inconsistent features and thus achieve better object detection. Figure 9 is a structure diagram of the AFFB. Due to the presence of the multi-scale channel attention block, the output Z R C × H × W of the AFFB can be expressed as
Z = M S C A B ( X 1 X 2 ) X 1 + ( 1 M S C A B ( X 1 X 2 ) ) X 2
where X 1 R C × H × W and X 2 R C × H × W are two input feature maps, with X 1 being a low-level semantic feature map and X 2 a high-level semantic feature map. The values of the fusion weights M S C A B ( X 1 X 2 ) and 1 M S C A B ( X 1 X 2 ) are both between 0 and 1, which corresponds to a weighted averaging operation between X 1 and X 2 .
Figure 9. The AFFB structure.
In YOLOv5s, linear feature fusion is performed through concatenation, which only yields a fixed linear aggregation of feature maps, and is not adaptable to the object to be detected. The AFFB has fewer parameters, is non-linear, and can capture the contextual information from different network layers better through the fusion of features that are inconsistent semantically and in terms of scale. The four “Concat” operations are then replaced in the YOLOv5s model with the proposed AFFB to obtain a new network model AFFB_YOLOv5s, as shown in Figure 10.
Figure 10. The AFFB_YOLOv5s network structure.

4. Experiments and Result Analysis

4.1. Datasets and Experimental Settings

4.1.1. Datasets

In this paper, the object detection task is oriented towards traffic scenes, and thus the experimental part mainly used the BDD100K dataset [47], while the PASCAL VOC dataset [48] was used as an auxiliary validation dataset.
The BDD100K dataset is the largest open autonomous driving dataset, and includes ten categories of traffic scene objects: car, bus, person, bike, truck, motor, train, rider, traffic sign, and traffic light. It has a very rich diversity of geography, environments, and weather to enable models to recognize a variety of complex traffic scenes and make the models’ generalization ability stronger at the same time. The dataset has a total of 100,000 images with a resolution of 1280 × 720 pixels. The official usage guidelines recommend splitting the dataset into a training set, a validation set, and a test set at a 7:1:2 ratio. As the labels of the test set are not disclosed, we used the validation set to test the model and evaluate the model’s detection performance of the model. The final training set consisted of 70,000 images, and the test set consisted of 10,000 images. (The BDD100K dataset is available at https://bdd-data.berkeley.edu, accessed on 25 November 2020).
The PASCAL VOC dataset is a commonly used object detection dataset, and it includes two parts, VOC2007 and VOC2012, with a total of 20 categories: airplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person, potted plant, sheep, sofa, train, and TV monitor. In this paper, 22,136 images of the VOC2007 and VOC2012 training and validation sets were used for model training. The test set of VOC2007 has a total of 4952 images and was used to evaluate the detection performance of the model. (The PASCAL VOC dataset is available at http://host.robots.ox.ac.uk/pascal/VOC/, accessed on 30 November 2020).

4.1.2. Experimental Settings

(a)
Network loss function
The loss function of the network designed in this paper is divided into three parts: bounding box regression loss L b o x , confidence loss L o b j , and classification loss L c l s . The total loss of the network is the sum of the three functions. The bounding box regression loss uses the complete intersection over union (CIoU) loss [49], and both the confidence loss and classification loss use the binary cross-entropy (BCE) with logits loss (BCEWithLogitsLoss). The CIoU loss considers three important geometric factors of the bounding box regression loss: the overlap area between the prediction and the ground truth boxes; the center point distance of the prediction and the ground truth boxes; and the aspect ratio between the prediction and the ground truth boxes, which improves the speed and accuracy of bounding box regression. The bounding box regression loss L b o x can be expressed as follows:
L box = 1 C I o U = 1 ( I o U ρ 2 c 2 α v )
where intersection-over-union (IoU) is the ratio of the intersection area to the union area of the prediction box and the ground truth box, ρ is the Euclidean distance between the center points of the prediction and the ground truth boxes, and c is the diagonal length of the smallest enclosing box covering both the prediction box and the ground truth box. Besides, α is the trade-off parameter, which is defined as
α = v ( 1 I o U ) + v
here, v is a parameter that measures the consistency of the aspect ratio between the ground truth box and the prediction box, and it is expressed as follows:
v = 4 π 2 ( a r c t a n w g t h g t a r c t a n w p h p ) 2
where w g t and h g t are the width and height of the ground truth box, while w p and h p are the corresponding values of the prediction box.
The BCEWithLogitsLoss mainly measures the binary cross-entropy between the target value and the output value of the model. It can be expressed as
L n = w n [ y n l o g σ ( x n ) + ( 1 y n ) l o g ( 1 σ ( x n ) ) ]
where w n is the loss weight of each category, y n is the target value, x n is the output value of the model, and σ is the sigmoid function.
(b)
Training parameter settings
In this study, we used the stochastic gradient descent algorithm [50] to optimize the loss function. The momentum was set to 0.937, the weight decay coefficient was set to 0.0005, and the initial learning rate was set to 0.01. We used warmup training [51], cosine annealing [52], gradient accumulation, exponential moving average, and other optimization strategies. In terms of data augmentation, in addition to the most advanced mosaic data augmentation method [41], common data augmentation methods, such as random hue, saturation, value transformation, image horizontal and vertical translation, image scaling, and image left and right flip, were also used. The batch size was set to 32, the epochs were set to 300, and the resolution size of the input image was set to 640 × 640 . The channel reduction ratio r was set to 4. The k-means clustering algorithm was used to obtain new anchor boxes. Other parameter settings were consistent with the default settings of YOLOv5. The computer configuration used in the experiment is shown in Table 1.
Table 1. Computer configuration.
(c)
Testing parameter settings
The batch size was set to 1, the resolution size of the input image was set to 640 × 640 , the confidence threshold for the filtering prediction box was set to 0.001, and the IoU threshold for non-maximum suppression was set to 0.6. Other parameter settings were consistent with the default YOLOv5 settings.

4.2. Quantitative Result Analysis

The three models, YOLOv5s, MS-CAB_YOLOv5s, and AFFB_YOLOv5s, were trained on the BDD100K dataset to test the effectiveness of the proposed MS-CAB and AFFB blocks. Five indicators commonly used in the field of object detection, namely, precision, recall, mAP, frames per second (FPS), and the number of parameters, were used to quantitatively evaluate the accuracy of the model [7]. To quantitatively study the impact of the proposed improvements on the detection of small objects, we examined small objects of the size defined by the COCO dataset [53], that is, those with a pixel area smaller than 32 × 32 pixels. Moreover, to verify the generalization ability of the model on other datasets, we used the same parameter settings as above on the public dataset PASCAL VOC for network training, and then tested to complete the auxiliary validation.
The accuracy evaluation results of the three models on the BDD100K validation set are shown in Table 2. It is evident that under the premise of ensuring the real-time requirements of a vehicle’s environment perception, compared with the original YOLOv5s model, the precision, recall, and mAP of the MS-CAB_YOLOv5s and AFFB_YOLOv5s models proposed in this paper were improved to varying degrees. Among them, the mAP of the AFFB_YOLOv5s model increased by 0.9 percentage points, which is a significant improvement given the complexity of the BDD100K traffic scene dataset. The 63 FPS achieved by both improved networks can fully meet the real-time requirements of vehicles’ environment perception systems. Furthermore, the parameters of the model were reduced to a certain extent. The size of the model is only 14.7 MB, which makes it quite suitable for embedded vehicle platforms.
Table 2. Model performance comparison on the BDD100K validation set.
The BDD100K dataset is a traffic scene dataset, and thus contains many cars and traffic signs at a distance with a pixel area less than 32 × 32 pixels. These objects are defined as small objects that need to be detected. Table 3 shows the comparison results of the three models for small object detection performance. Compared with the original YOLOv5s model, the MS-CAB_YOLOv5s and AFFB_YOLOv5s models proposed in this paper had a significantly improved precision of small object detection, while the recall decreased slightly, and the mAP, respectively, improved by 1.6 and 3.5 percentage points. This shows that the MS-CAB and AFFB significantly improved the model’s detection effect on small objects.
Table 3. Comparison of models on small object detection performance.
To verify the generalization ability of the model, the three models were trained and tested on the PASCAL VOC dataset. The performance comparison for each model is shown in Table 4. Under the premise of ensuring real-time performance, the two models, MS-CAB_YOLOv5s and AFFB_YOLOv5s, had improved precision, recall, and mAP. This again verifies the effectiveness of the MS-CAB and AFFB to improve the performance of object detection. At the same time, it shows that our improved model can adapt to different datasets or scenes and has good generalization ability.
Table 4. Performance comparison of models on PASCAL VOC test set.

4.3. Comparative Analysis of Detection Results

Figure 11 shows a visual comparison of the detection results of the YOLOv5s model, the MS-CAB_YOLOv5s model, and the AFFB_YOLOv5s model. To see the differences between the three models more easily, the yellow rectangles in the detection result of column (a) in Figure 11 indicate the objects that were not detected by YOLOv5s. Similarly, the yellow rectangles in the detection result of column (b) indicate the objects that were not detected by MS-CAB_YOLOv5s. The AFFB_YOLOv5s model could detect small objects with small pixel areas, such as cars, people, and traffic signs, at long distances that were not detected by the YOLOv5s model. At the same time, the detection effect was also excellent under dark night conditions. Moreover, compared with the benchmark model YOLOv5s, the detection effect of the MS-CAB_YOLOv5s model was better. It could detect some objects that the YOLOv5s model did not detect, but its effect was not as good as that of AFFB_YOLOv5s. For example, in column (b) of Figure 11, the person on the left side of the figure on the second row and the traffic sign on the right side of the figure on the third row were not detected by the MS-CAB_YOLOv5s model, but they were all accurately detected by the AFFB_YOLOv5s model. Based on these detection results in Figure 11, both the MS-CAB_YOLOv5s model and the AFFB_YOLOv5s model could improve the effect of object detection in traffic scenes, and the AFFB_YOLOv5s model had the best detection effect, especially for small objects that are away from the vehicle, which is of great significance for improving the stability and efficiency of automatic driving systems and preventing traffic accidents.
Figure 11. Comparison of the detection results of YOLOv5s, MS-CAB_YOLOv5s, and AFFB_YOLOv5s.

5. Conclusions and Future Work

The high accuracy and fast real-time performance of object detection algorithms are very important for the safety and real-time control of autonomous vehicles. In this paper, we presented a small object detection method for traffic scenes based on attention feature fusion for autonomous driving systems as an improvement to the YOLOv5s architecture. To aggregate the effective information at the local and global scales, MS-CAB simultaneously focuses on small objects that are more distributed within a local range and large objects that are more distributed on the global range. Using AFFB to fuse contextual information from different network layers, we obtain a model with fewer parameters and higher accuracy. Under the condition of meeting the real-time requirements of vehicles’ environment perception systems, compared with the benchmark model YOLOv5s, the model proposed in this paper increased the mAP of all objects on the validation set of the traffic scene dataset BDD100K by 0.9 percentage points. Specifically, small objects’ mAP was increased by 3.5%. Therefore, the model achieves a better balance between object detection accuracy and speed in traffic scenes, and can effectively improve the performance of vision-based object detection systems for autonomous vehicles.
Since our proposed method is essentially based on deep learning, there are some general limitations. First, the interpretability of deep learning is poor. It learns the implicit relationship between input and output features, but not the causal relationship. Secondly, the neural network has many parameters, and network training requires a large amount of time and relatively large computing power. Therefore, the deep learning method requires stronger computer hardware equipment. Finally, the accuracy of the model based on the deep learning method greatly relies on the collected data, and the accuracy of the dataset label directly determines the accuracy of the model detection. A traditional method based on manual feature extraction is a beneficial supplement to the deep learning method. In future research, we will try to combine the two methods to further improve object detection performance. We plan to deploy the model proposed in this paper to embedded vehicle devices to develop more convenient portable applications. Moreover, we will explore the extent to which the proposed blocks improve the performance of larger YOLOv5 models.

Author Contributions

Conceptualization, J.L. and Y.Y.; methodology, J.L.; software, J.L., Y.Y., and Z.W.; validation, L.L. and Y.Z.; investigation, L.L.; resources, Y.Z.; writing—original draft preparation, Y.Y.; writing—review and editing, Y.Y., L.L., and Z.W.; visualization, Y.Y.; supervision, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant Nos. 51775082, 61976039) and the China Fundamental Research Funds for the Central Universities (Grant Nos. DUT19LAB36, DUT20GJ207), and Science and Technology Innovation Fund of Dalian (2018J12GX061).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  2. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
  3. Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  4. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Computer Vision-ECCV 2016; Springer: Cham, Switzerland, 2016. [Google Scholar]
  5. Li, Z.; Peng, C.; Yu, G.; Zhang, X.; Deng, Y.; Sun, J. Light-Head R-CNN: In defense of two-stage object detector. arXiv 2017, arXiv:1711.07264. [Google Scholar]
  6. Qin, Z.; Li, Z.; Zhang, Z.; Bao, Y.; Yu, G.; Peng, Y.; Sun, J. ThunderNet: Towards real-time generic object detection. arXiv 2019, arXiv:1903.11752. [Google Scholar]
  7. Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Ingham, F.; Poznanski, J.; Fang, J.; Yu, L.; et al. YOLOv5. Available online: http://doi.org/10.5281/zenodo.4154370 (accessed on 16 November 2020).
  8. Fu, C.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. DSSD: Deconvolutional single shot detector. arXiv 2017, arXiv:1701.06659. [Google Scholar]
  9. Singh, B.; Davis, L.S. An Analysis of Scale Invariance in Object Detection—SNIP. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3578–3587. [Google Scholar]
  10. Liu, Z.; Gao, G.; Sun, L.; Fang, Z. HRDNet: High-resolution detection network for small Objects. arXiv 2020, arXiv:2006.07607. [Google Scholar]
  11. Li, H.; Xiong, P.; An, J.; Wang, L. Pyramid attention network for semantic segmentation. arXiv 2018, arXiv:1805.10180. [Google Scholar]
  12. Wang, W.; Zhao, S.; Shen, J.; Hoi, S.C.H.; Borji, A. Salient Object Detection with Pyramid Attention and Salient Edges. In Proceedings of the 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1448–1457. [Google Scholar]
  13. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. In Computer Vision-ECCV 2014; Springer: Cham, Switzerland, 2014; pp. 346–361. [Google Scholar]
  14. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  15. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  16. Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  17. Lin, T.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
  18. Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
  19. Zhu, C.; He, Y.; Savvides, M. Feature selective anchor-free module for single-shot object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 840–849. [Google Scholar]
  20. Zhou, X.; Wang, D.; Krhenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
  21. Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
  22. Fan, D.; Wang, W.; Cheng, M.; Shen, J. Shifting more attention to video salient object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8546–8556. [Google Scholar]
  23. Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
  24. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  25. Woo, S.; Park, J.; Lee, J.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision-ECCV 2018; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
  26. Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
  27. Roy, A.G.; Navab, N.; Wachinger, C. Concurrent Spatial and Channel ‘Squeeze & Excitation’ in Fully Convolutional Networks. Med Image Comput. Comput. Assist. Interv. 2018, 11070, 421–429. [Google Scholar]
  28. Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea, 27–28 October 2019; pp. 1971–1980. [Google Scholar]
  29. Huang, Z.; Wang, X.; Wei, Y.; Huang, L.; Shi, H.; Liu, W.; Huang, T.S. CCNet: Criss-cross attention for semantic segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
  30. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3141–3149. [Google Scholar]
  31. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  32. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. Med Image Comput. Comput. Assist. Interv. 2015, 9351, 234–241. [Google Scholar]
  33. Cai, Z.; Fan, Q.; Feris, R.S.; Vasconcelos, N. A Unified multi-scale deep convolutional neural network for fast object detection. In Computer Vision-ECCV 2016; Springer: Cham, Switzerland, 2016; pp. 354–370. [Google Scholar]
  34. Li, Z.; Zhou, F. FSSD: Feature Fusion Single Shot Multibox Detector. arXiv 2018, arXiv:1712.00960. [Google Scholar]
  35. Chaib, S.; Liu, H.; Gu, Y.; Yao, H. Deep Feature Fusion for VHR Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4775–4784. [Google Scholar] [CrossRef]
  36. Lim, J.; Astrid, M. Small object detection using context and attention. arXiv 2019, arXiv:1912.06319. [Google Scholar]
  37. Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra R-CNN: Towards balanced learning for object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 821–830. [Google Scholar]
  38. Ghiasi, G.; Lin, T.; Le, Q.V. NAS-FPN: Learning scalable feature pyramid architecture for object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7029–7038. [Google Scholar]
  39. Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
  40. Gao, F.; Wang, C.; Li, C. A combined object detection method with application to pedestrian detection. IEEE Access 2020, 8, 194457–194465. [Google Scholar] [CrossRef]
  41. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  42. Liu, W.; Rabinovich, A.; Berg, A.C. ParseNet: Looking wider to see better. arXiv 2015, arXiv:1506.04579, 2015. [Google Scholar]
  43. Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional Feature Fusion. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 5–9 January 2021; pp. 3560–3569. [Google Scholar]
  44. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 6–11 July 2015; pp. 448–456. [Google Scholar]
  45. Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
  46. Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580. [Google Scholar]
  47. Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. BDD100K: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2633–2642. [Google Scholar]
  48. Everingham, M.; Zisserman, A.; Williams, C.; Gool, L.V.; Allan, M.; Bishop, C.M.; Chapelle, O.; Dalal, N.; Deselaers, T.; Dorkó, G.; et al. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  49. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar]
  50. Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
  51. Goyal, P.; Dollár, P.; Girshick, R.; Noordhuis, P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; He, K. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv 2017, arXiv:1706.02677. [Google Scholar]
  52. Loshchilov, I.; Hutter, F. SGDR: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
  53. Lin, T.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Computer Vision-ECCV 2014; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.