BFE-Net: Bidirectional Multi-Scale Feature Enhancement for Small Object Detection

Zhang, Qian; Ren, Jie; Liang, Hong; Yang, Ying; Chen, Lu

doi:10.3390/app12073587

Open AccessArticle

BFE-Net: Bidirectional Multi-Scale Feature Enhancement for Small Object Detection

by

Qian Zhang

,

Jie Ren

^*,

Hong Liang

,

Ying Yang

and

Lu Chen

College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(7), 3587; https://doi.org/10.3390/app12073587

Submission received: 20 March 2022 / Revised: 30 March 2022 / Accepted: 30 March 2022 / Published: 1 April 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Small object detection becomes a challenging problem in computer vision due to low resolution and less feature information. Making full use of high-resolution features is an important factor in improving small object detection. In this paper, to improve the utilization of high-resolution features, this work proposes the Bidirectional Multi-scale Feature Enhancement Network (BFE-Net) based on RetinaNet. First, this work introduces a bidirectional feature pyramid structure to shorten the propagation path of high-resolution features. Then, this work utilizes residually connected dilated convolutional blocks to fully extract high-resolution features of low-feature layers. Finally, this work supplements the high-resolution features lost in the high-level feature propagation process by leveraging the high-level guided lower-level features. Experiments show that our proposed BFE-Net achieves stable performance gains in the object detection task. Specifically, the improved method improves RetinaNet from 34.4 AP to 36.3 AP on the challenging MS COCO dataset and especially achieves excellent results in small object detection with an improvement of 2.8%.

Keywords:

small object detection; dilated convolution; attention mechanism; feature pyramid

1. Introduction

Object detection is one of the core problems in the field of computer vision research. With the development of the convolutional neural network [1], many advanced detectors based on deep learning have appeared in recent years. At present, there are two types of mainstream object detection algorithms. One is the two-stage object detection based on the candidate box represented by R-CNN [2] and its variants, such as Fast R-CNN [3], Faster R-CNN [4], Mask R-CNN [5], etc. The two-stage object detection is developing rapidly, and the detection accuracy is constantly improving, but the problem of its architecture leads to high network computing costs and slow detection speed, which cannot meet the real-time requirements. The other type of mainstream object detection algorithm is the one-stage object detection based on regression analysis represented by SSD [6], YOLOv1 [7] and its variants [8,9,10], RetinaNet [11], EfficientDet [12], etc. The one-stage object detection algorithm directly predicts the target coordinates and categories after extracting the input image features through CNN, which greatly improves the detection efficiency, but the most important deficiency is that the detection accuracy needs to be further improved.

How to improve the accuracy of one-stage object detection has attracted the attention of many researchers. DSSD [13] adds a deconvolution model to SSD [6]. FSSD [14] reconstructs the pyramid feature map and integrates the features of different scales. YOLOv3 [9] solves the multi-scale problem through the residual network. RefineDet [15] combines the idea of a two-stage and one-stage target detector, adopts the regression idea from coarse to fine, and introduces the feature fusion operation similar to FPN [16]. Using the idea of a thermodynamic diagram for reference, CornerNet [17] predicts the target by predicting a group of corners. EfficientDet [12] uses a weighted bidirectional feature pyramid network for feature fusion and scales the model through a composite feature pyramid network. All of the above improve the object detection effect in varying degrees, but it also inevitably leads to problems such as high model complexity, difficult convergence, and slow detection speed.

One of the one-stage detectors, RetinaNet [11], introduces focal loss to solve the problem of sample imbalance and achieves comparable accuracy to the two-stage object detection framework while ensuring real-time performance; however, small targets have always been a difficult point in target detection. Due to the small coverage area of small targets, low resolution, and lack of diversity in locations, various target detection methods cannot achieve the same accuracy as large targets for the detection of small targets. In particular, RetinaNet is ideal for detecting large-sized objects, but it is not effective for detecting small objects smaller than 32 × 32 pixels [18].

By analyzing the network structure of RetinaNet, this paper found that the main reason for the low detection accuracy of small objects is the limited utilization of high-resolution information. It is mainly manifested in two aspects. First, the standard RetinaNet does not make full use of the lower feature layer C2 containing high-resolution features. The shallow layer C2 of the backbone network contains rich location information that is conducive to small target recognition. In the bottom-up path, in order to expand the receptive field of the backbone network, the feature map in the network will continue to shrink, and the value of the stride is likely to be larger than the size of the small target, resulting in the loss of the feature information of the small target during the convolution process. If the shallow information cannot be fully utilized, the small object features will be lost. Second, the standard RetinaNet uses the FPN structure, upsampling the high-level feature map, and then adding and merging with the shallow-level features to obtain a new feature map with stronger expressive ability. Feature maps fused with multi-scale features have better robustness when detecting objects of different sizes; however, in the bottom-up backbone network, the path from the low-level features to the top-level features is long (such as Resnet50 and ResNet101 [19]), and some high-resolution information is lost in the process of upsampling, which increases the difficulty of obtaining accurate positioning information.

In view of the above two reasons for the low utilization of low-level high-resolution information, this paper makes the following improvements to the standard RetinaNet:

First, to supplement and improve the utilization of high-resolution features, inspired by PANet [20], this paper introduces a bidirectional feature pyramid structure, which fully combines deep and shallow information for feature extraction, thereby improving the expression of target semantic information by the shallow feature network. Based on the standard RetinaNet, this paper adds a bottom-up enhancement path, which greatly shortens the propagation path of low-level information and improves the utilization of low-level information. The semantic and contextual information of small and medium-scale objects is enhanced by bottom-up branching. In addition, the bidirectional feature fusion also fully integrates multi-scale information, which is beneficial to the target detection effect.

Second, to make full use of the high-resolution information, this paper reuses the low-level feature C2 of the backbone network. At the same time, in order to solve the shortcomings of the limited receptive field of low-level features and weak semantic information, and inspired by DetNet [21], this paper uses dilated convolution to expand the receptive field and connect it laterally with P3 from the top-down path. In addition, to enhance the feature fusion between the features extracted from different receptive fields, inspired by AC-FPN [22] and DenseNet [23], this paper connects dilated convolutions with different dilation rates in a residual connection.

Third, to reduce the loss of high-level semantic features caused by the bidirectional feature pyramid structure, this paper designs high-level features C6 directly from the backbone network and supplemented by low-level high-resolution features. It does not need to go through a top-down fusion path and fuses information from lower layers to obtain rich localization information while minimizing the loss of semantic information.

Combining the above analysis and strategies, the detection method proposed in this paper has the following advantages:

(1) To shorten the propagation path of high-resolution information and improve the utilization of high-resolution features, this paper introduces a bidirectional feature pyramid structure.

(2) To take full advantage of the low-level features, this paper designs the Dilated Feature Enhancement structure to extract the high-resolution information of the low-level features through the dilated convolution of residual connections.

(3) To compensate for the high-resolution information lost during the long propagation process of the backbone network, this paper designs the High-level Guided Model, which is guided by high-level features to extract high-resolution information in lower layers.

2. Related Work

2.1. Deep Object Detectors

Two-stage Detectors: The two-stage object detection algorithm generates object candidate regions and refines the regions of interest using high accuracy but relatively inefficient classifiers and regressors. Fast R-CNN [3] proposed a pooling layer structure for ROI pooling, which efficiently solves the operation of R-CNNs that must crop and scale image regions to the same size. Faster R-CNN [4] proposed a Region Proposal Network (RPN) and developed an end-to-end framework that significantly improves the detector’s efficiency. The proposed RPN integrates proposal generation with a single convolutional network. Cascade R-CNN [24] introduced multi-level refinement in Faster R-CNN to achieve more accurate target location prediction. Mask R-CNN [5] replaced the ROI Pooling layer with ROIAlign and added a branch FCN layer for semantic mask recognition on top of the edge recognition. The detection process of this type of algorithm is complex. Although the algorithm has high accuracy, the detection speed is relatively slow.

One-stage Detectors: The main difference between the one-stage object detection algorithm and the two-stage object detection algorithm is that the former does not have a candidate region recommendation stage, the training process is relatively simple, the target category can be directly determined, and the position detection frame can be obtained in one stage. YOLOv1 [7] directly performs target coordinate regression and category classification by dividing the grid on the output feature map, omitting the explicit process of extracting candidate regions. This design method greatly reduces the time taken by the detection process. Subsequent research [8,9,10] continues to improve based on [7], which further improves the detection accuracy while ensuring the detection speed. SSD [6] and its variants [14,15] combine YOLO, which has the advantage of fast detection speed, and Faster R-CNN [4], which has the advantage of accurate positioning, and introduce multi-scale and multi-resolution detection technology. The network of different layers detects objects with different scales, which greatly improves the detection effect of small targets. Although the single-stage object detector is significantly faster than the two-stage object detector based on candidate region recommendation, its accuracy has not been comparable to the two-stage object detector. Lin et al. proposed RetinaNet [11] by introducing a focal loss function, which reconstructs the standard cross-entropy loss function so that the detector will pay more attention to the samples that are difficult to classify during the training process. The introduction of focal loss solves the problem of unbalanced sample samples and achieves an accuracy comparable to the two-stage target detection framework.

2.2. Context Augmentation

Context information can facilitate the performance of localizing the region proposals and thereby improve the results of detection and classification. Deeplab-v2 [25] proposes atrous convolution to extract multi-scale context. CoupleNet [26] proposed the fusion strategy to fully integrate global information and local information. X et al. proposed a gated bidirectional neural network to better integrate contextual features of the objects [27]. FPN [16] first proposed to construct a multi-scale feature fusion structure, and the detection effect of small objects was significantly improved. PANet [20] and EfficientDet [12] successively optimized the FPN structure and proposed a bidirectional feature pyramid structure. PSPNet [28] utilizes pyramid pooling to obtain a hierarchical global context. NAS-FPN [29] employs a neural architecture search to better learn fusion among all cross-scale connections. AC-FPN [17] guides the use of different ratios of dilated convolution to adequately acquire semantic information from different sensory fields in the highest feature layer by integrating attention orientation. FPT [30] fully fuses contextual features by leveraging the attention mechanism to achieve non-local interactions of features across space and scale.

3. Method

This section introduces the overall architecture of our proposed model BFE-Net (Figure 1). This paper replaces the feature pyramid structure of the standard RetinaNet with a bidirectional feature pyramid structure, which can better achieve multi-scale feature fusion. It consists of a High-level Guided model (HGM) and Dilated Feature Enhancement (DFE).

3.1. Overall Architecture

In the process of feature extraction, the shallow feature map contains location information with high resolution, which can be used to improve the accuracy of bounding box regression; however, due to less convolution, it has lower semantics and more noise. Although a deep feature map contains strong semantic information, its resolution is low and its detail expression ability is poor.

Multi-scale fusion is an important method to improve the accuracy of small target detection. The feature pyramid network (FPN) [16] uses the feature expression structure of each network layer for the same scale image in different dimensions from the bottom to the top and refers to the feature information of the multi-scale feature map while taking into account the strong semantic features and location features, which is conducive to the detection of small objects. The standard RetinaNet [11] uses the feature pyramid network (FPN) in [16] as the backbone network. At the same time, in order to obtain richer semantic features and better detection results for large targets, the standard RetinaNet innovates the generation layers C6 and C7 in the bottom-up backbone network (Figure 2b). To further shorten the information transmission path and make full use of high-resolution information, PANet creates a bottom-up path enhancement based on the standard FPN (Figure 2a). Inspired by PANet, this work designs a bidirectional feature pyramid structure (Figure 2c) with levels N2 through N6, all pyramid levels in this structure have C = 256 channels.

This structure uses feature pyramid levels P2 to P6, where P3 to P5 are obtained from the corresponding ResNet residual stages (C3 through C5) using top-down and lateral connections as in RetinaNet [11]. P6 is obtained through HGM, P2 is obtained by using DFE to enhance the C2 feature and then connecting it laterally with P3, which is different from [11]. N2 to N5 are obtained from the corresponding P2 to P5 via lateral connections and bottom-up paths, as in PANet [20]. N6 is obtained by lateral connections with C6 by stride convolution instead of down sampling (Figure 3a), which is different from [20].

HGM obtains C6 by using C5 as a guide to extract lower-level feature information, which supplements the high-resolution information lost during bottom-up long-path propagation by standard backbone networks. DFE is used to increase the receptive field size of the convolution kernel by expanding the convolution, extract richer contextual information in the lower layers, and obtain richer low-level localization features. We introduce them in detail below.

3.2. Dilated Feature Enhancement

The standard RetinaNet discards use the high-resolution pyramid level P2 computed from the output of the corresponding ResNet residual stage C2. The shallow feature C2 in the backbone network contains rich feature location information with higher resolution, which will be beneficial to the localization of small targets. We can employ effective strategies to make full use of the high-resolution features from C2.

In this paper, we present the Dilated Feature Enhancement (Figure 4a), which combines the feature enhanced high-resolution feature C2 with P3 in the top-down path to obtain P2 as the input of the enhanced bottom-up structure. The structure consists of Dilated block (Figure 4b) and a lateral connection by 1 × 1 conv just as in [16]. The dilated convolution is used to increase the receptive field size of the convolution kernel, extract richer context information, and obtain richer low-level localization features without significantly increasing the computational load. This operation can be constructed as:

P 2 = F_{concat} (C o n v_{1 \times 1} (C 2), D (C 2), C o n v_{3 \times 3} (P 3_{})),

(1)

N 2 = C o n v_{1 \times 1} (P 2),

(2)

where D(·) is the Dilated block operation, which is described in detail below, and the detailed description is given in Equation (4). N2 represents the input of the bottom-up augmentation path. F_concat represents the features that are fused in the concatenate method.

Dilated Block: In most network structures with FPN, feature maps corresponding to different receptive fields are simply merged by element-wise addition in the top-down path. Merging the feature maps corresponding to different receptive fields through element addition cannot make the information captured by different receptive fields interact effectively, resulting in limited utilization of extracted information.

As shown in Figure 4b, to merge multi-scale information with different receptive fields elaborately, this paper employs residual connection in the Dilated block, where the output of each dilated layer is concatenated with the original input feature maps and then fed into the next dilated layer. At the same time, each dilated layer is given a separate output feature as a part of the output of the Dilated block. Finally, in order to maintain the coarse-grained information of the initial inputs, this paper concatenates the outputs of all dilated layers and feeds them into a 1 × 1 convolutional layer to fuse the coarse-and-fine grained features.

In this model, we have designed two other feature extraction methods. As shown in Figure 5, Dilated block A (Figure 5a) uses the parallel method in Inception to extract features on the input feature map using dilated convolutions with different expansion rates. Finally, the extracted features are fused in a concatenated manner. Dilated block B (Figure 5b) uses the general feature extraction method to extract features in series and finally obtains only one feature layer. We analyze the effects of different structures in the experimental results in Section 4.4.

F_{d i l a t e d} (i) = {\begin{matrix} C o n v_{1 \times 1} (C 2) \oplus D c o n v_{3 \times 3} (C 2) \begin{matrix} i = 0 \end{matrix} \\ F_{d i l a t e d} (i - 1) \oplus C o n v_{1 \times 1} (C 2) \begin{matrix} i > 0 \end{matrix} \end{matrix},

(3)

where F_dilated(i) represents the dilated convolution operation function with residual structure. The symbol ⊕ denotes feature fusion by addition. Dconv_3×3(·) represents a dilated convolution with a convolution kernel of 3 × 3 and expansion rates of 3, 6, and 12, respectively. Then, the three output feature maps containing multi-scale context information are fused in using the concatenate method.

D (C 2) = \sum_{i = 1}^{3} F_{concat} (F_{d i l a t e d} (i)),

(4)

3.3. High-Level Guided Model

In order to extract richer abstract features, the current general backbone networks such as ResNet [19] and VGG [31] have a deeper network structure; however, in the detection process, with the increase in depth, through continuous down sampling and feature extraction, the position information of the deep feature map is weakened, which is not conducive to target detection and localization.

In the standard RetinaNet [11], in order to obtain more abstract features, P6 is obtained via a 3 × 3 stride-2 conv on C5; P7 is computed by applying ReLU followed by a 3 × 3 stride-2 conv on P6. This operation not only obtains rich semantic information but also loses a lot of location information. In order to make up for the loss of high-resolution information caused by too deep network layers in the standard RetinaNet, the High-level Guided Model was designed. Inspired by [30], this model uses the highest-level output feature of the backbone network as Query, and then uses the lower-level features as both Key and Value, respectively, and integrates the higher-resolution lower-level information into the highest-level features using the visual attributes of low-level pixels to render high-level concepts to achieve high-level and bottom-level feature fusion.

High-level Guided Model (HGM) can be categorized as a low-level information enhancement transformer, which utilizes high-level features with rich category information for the weighted selection of low-level information. By selecting precise resolution details, the structure makes up for the high-resolution information lost in the propagation process and enriches the highest-level information.

We denote the output of the convolutional layer in each scale as {C2, C3, C4, C5} according to the settings in [16]. Specifically, the highest-level C5 in the backbone network is defined as Q, {C2, C3, C4} are used as K and V at the same time, respectively. We take C5 as Q and C4 as K and V as an example to describe the feature fusion process, as shown in Figure 6.

First, K uses global average pooling (GAP) to obtain the weighted value W and V uses a 3 × 3 convolution kernel with step size S for down sampling to obtain V_down, so that the size of a is the same as Q. Second, Q uses the 1 × 1 convolutional layer for dimension reduction according to the dimensions of K. Then W weights Q to obtain Q_att. Finally, Q_att and V perform element-wise addition to obtain M₄. In particular, we formulate this procedure as follows.

M_{i} =_{} C o n v_{3 \times 3} ((C o n v_{1 \times 1} (Q) \otimes G A P (K)) \otimes S c o n v_{3 \times 3} (V), (i = 2, 3, 4),

(5)

where the symbol ⊗ denotes feature fusion by matrix multiplication. Sconv_3×3 is a 3 × 3 stride-2 conv to reduce the size of V.

We perform the above with C2, C3 as K and V, and we obtain {M₂, M₃}. Then, we perform the concatenate operation on {M₂, M₃, M₄} to obtain M.

M = \sum_{i = 2}^{4} F_{c o n c a t} (M_{i}),

(6)

M and C5 are fused via the concatenate method, and then by 3 × 3 stride-2 conv for down sampling, and finally, C6 is obtained after dimensionality reduction by a 1 × 1 conv, as shown in Figure 3a.

C 6 = C o n v_{1 \times 1} (S c o n v_{3 \times 3} (F_{concat} (M, C 5))),

(7)

4. Experiments

4.1. Dataset and Evaluation Metrics

This paper performs all experiments on the MS COCO detection dataset with 80 categories. We train our model on MS-COCO 2017, which consists of 115 k training images and 5k validation images (minival). We also report the results on a set of 20 k test images (test-dev). The COCO-style Average Precision (AP) is chosen as the evaluation metrics, which averages AP across IoU thresholds from 0.5 to 0.95 with an interval of 0.05. Objects with a ground truth area smaller than 32 × 32 are regarded as small objects, objects larger than 32 × 32 and smaller than 96 × 96 are regarded as medium objects, and objects larger than 96 × 96 are regarded as large objects. The target proportion of small size is 41.43%, the target proportion of medium size is 34.32%, and the target proportion of large size is 24.24%. The small object accuracy rate is regarded as AP_S, the medium object accuracy rate is regarded as AP_M, and the large object accuracy rate is regarded as AP_L.

4.2. Implementation Details

To demonstrate the effectiveness of the BFE-Net proposed in this paper, we conducted a series of experiments on the MS COCO dataset for verification. For all experiments in this section, we used the SGD optimizer to train our models on a machine whose CPU is Intel i7-9700k, 32 RAM, 1 NVIDIA GeForce GTX TITAN X GPUs, deep learning framework is Pytorch 1.7.1, and the CUDA version is 10.1. The momentum is set as 0.9 and the weight decay is 0.0001. The batch size is set to 32 to prevent model overfitting and model convergence. We initialize the learning rate as 0.01 and decrease it to 0.001 and 0.0001 at 8th epoch and 11th epoch. The classical networks ResNet-50 and ResNet-101 were adopted as backbones for comparative experiments.

4.3. Main Results

In this section, we evaluate the BFE-Net on the COCO test-dev and compare it with other state-of-the-art one-stage and two-stage detectors. Original settings of RetinaNet, such as hyper-parameters for anchors and focal loss, were implemented to ensure a fair comparison. For all studies, we used an image scale of 500 pixels unless noted for training and testing. All the results are shown in Table 1.

By analyzing the experimental results, it can be found that when ResNet101 is used as the backbone network with an image short size of 500 pixels, the standard RetinaNet [11] achieves excellent results in detecting large targets, with an AP_L of 49.1%; however, when detecting small targets, it is different from the two-stage target detection method, with an AP_S of 14.7%. Compared with Faster R-CNN w FPN, it is 3.5% lower, and the detection results are not very ideal. In particular, the BFE-Net proposed in this paper has the same effect as detecting large targets when detecting small targets, and the AP_S reaches 17.9%, with a 3.2% improvement compared to the standard RetinaNet. At the same time, its detection accuracy is equally competitive with some two-stage object detection results. Compared with the standard RetinaNet, BFE-Net achieves higher AP of various sizes, which shows the effectiveness of the model proposed in this paper, especially for small object detection.

The extra layers this work adds to the model, such as low-level feature C2 and bidirectional feature pyramid structure, introduce additional overhead, which makes our proposed model not as fast as the standard RetinaNet [11]. To compare the speed of the improved model, this work used an image scale of 500 pixels on the MS COCO dataset to compare with other object detection models. As shown in Table 2, compared with the standard RetinaNet [11], EfficientDet-D0 [12] has similar speed and accuracy, and the BFE-Net proposed in this paper has better accuracy but slightly slower speed; however, compared with Faster R-CNN w FPN [16], BFE-Net achieves competitive results in both AP_S and AP, and is also much faster than Faster R-CNN w FPN [16]. Compared with the one-stage object detection models SSD513 [6] and DSSD513 [13], BFE-Net also has obvious advantages in speed and accuracy. The BFE-Net proposed in this paper achieves state-of-the-art accuracy compared to the standard RetinaNet, while maintaining reasonable speed compared to other models.

4.4. Ablation Study

In this section, we conduct extensive ablation experiments to analyze the effects of individual components in our proposed method. This paper also analyzes the effect of each proposed component of BFE-Net on COCO val2017. The purpose of this study is as follows.

To analyze the importance of each component in BFE-Net, we gradually apply a bidirectional feature pyramid network, High-level Guided Model, and Dilated Feature Enhancement to the model to verify its effectiveness. Meanwhile, the improvements brought by the combination of different components are also presented to demonstrate that these components complement each other. The baseline method for all ablation studies is ResNet50. All results are shown in Table 3.

From the experimental data in the table, it can be seen that the three structures proposed in this paper have different degrees of improvement in the accuracy of the standard RetinaNet. Combining the improved strategies in this paper can effectively improve the detection performance of the detection algorithm for small targets. After adding the bidirectional feature pyramid structure, the accuracy rate has increased by 1.1%, and the accuracy of small targets (AP_S) has reached 15.1%, with an improvement of 1.2%. In addition, after adding HGM and DFE, the accuracy of small targets (AP_S) is increased by 0.4% and 0.8%, respectively, which also shows that AGH is of great help to the recognition of small-scale targets by fully extracting shallow features. The accuracy of target detection of other sizes is also improved to varying degrees. The overall accuracy rate has increased from 32.5% to 34.6%, and the accuracy rate of small targets has achieved a very meaningful improvement, from 13.9% to 16.7%, a 2.8% increase.

To verify the effectiveness of HGM, the following ablation experiments were performed. R-C6 represents that C5 directly obtains C6 through a 3 × 3 convolutional layer with stride 2, which is the same as in RetinaNet [11]. H-C6 represents C6 obtained by means of HGM. The results are shown in Table 4.

By analyzing the experimental data, it can be found that the accuracy of small target detection (AP_S) using H-C6 is 0.3% higher than that of R-C6. At the same time, R-C6 also achieved the best results in the recognition of large targets and medium-sized targets. This result shows that the high-resolution features guided by C5 can supplement the lost high-resolution information in the bottom-up propagation process, which is very helpful for object detection at different scales.

To verify the effectiveness of connecting dilated convolutions with different dilation rates in a residual manner in DFE, the following ablation experiments were performed. Feature extraction is performed in the following three ways: P-Dilated (Figure 4a) represents three dilated convolution layers for feature extraction in parallel connection; S-Dilated (Figure 4b) represents three dilated convolution layers in series for feature extraction; R-Dilated (Figure 3b) represents the feature extraction that is finally used in this paper in a residual manner. The results are shown in Table 5.

By analyzing the experimental results, it can be found that the use of the R-dilated method has achieved the best effect, with an average accuracy (AP) of 34.6% and an improvement of 0.6%. In particular, the accuracy of small targets (AP_S) has increased from 15.1% to 16.7%, and the improvement effect is as high as one percentage point. In addition, the effect of P-Dilated is better than that of S-Dilated, especially in small target recognition. We analyze that the reason for this phenomenon may be serial connection expansion convolution. Due to the large expansion rate, information will be lost during the convolution process, and the extracted features will be incoherent.

4.5. Visualization of Results

In order to more intuitively show the actual effect of the model proposed in this paper, we compared the qualitative results between BFE-Net and the standard RetinaNet in object detection at COCO val2017, and the results are shown in Figure 7. The first column represents the original image in the MS COCO dataset, the second column represents the detection result using RetinaNet, and the third column represents the detection result using the improved model BFE-Net in this paper.

By analyzing the detection results in the comparison graphs, it can be found that in the first set of graphs, the standard RetinaNet missed the detection of distant horses, while it was successfully detected in BFE-Net. The same phenomenon exists in the second and third sets of detection results; the BFE-Net is able to detect more birds with small object sizes than RetinaNet. In the last set of pictures, RetinaNet mistakenly identified distant people as birds and missed the boat, while BFE-Net successfully detected people and boats. These experimental results show that the improved model in this paper can further enhance the representation ability of the model and can greatly improve the missed detection and false detection of small targets.

5. Conclusions

This paper conducts an in-depth analysis of the single-stage detector RetinaNet, and points out a reason for the poor detection performance of small objects. An object detector with Bidirectional Multi-scale Feature Enhancement Network (BFE-Net) is proposed to solve this problem. The bidirectional feature pyramid structure is adopted, the high-resolution feature layer C2 is fully utilized, and the high-resolution features lost in the propagation process of high-level features are supplemented to improve the utilization of high-resolution features. We conducted sufficient experiments and stable detection improvements on the challenging COCO dataset, and the experimental results show that the detection effect of the improved method is greatly improved, and the AP is improved by 2.1%. AP_S is improved by 2.8%, which effectively improves the detection effect of small targets. The effectiveness of our model is proved by sufficient experiments and we believe this work can help future object detection research.

Author Contributions

Conceptualization, H.L. and Q.Z.; methodology, J.R.; validation, L.C. and Y.Y.; writing—original draft preparation, J.R.; writing—review and editing, Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Research on the Construction of Multi-scale Concept Lattice and Knowledge Discovery Method, grant number 61673396.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in MS COCO at https://doi.org/10.1007/978-3-319-10602-1_48 (accessed on 26 March 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the NIPS, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE Transactions on Pattern Analysis & Machine Intelligence, Venice, Italy, 22–29 October 2017. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Fu, C.Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. Dssd: Deconvolutional single shot detector. arXiv 2017, arXiv:1701.06659. [Google Scholar]
Li, Z.; Zhou, F. FSSD: Feature fusion single shot multibox detector. arXiv 2017, arXiv:1712.00960. [Google Scholar]
Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 13–18 June 2018; pp. 4203–4212. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 13–18 June 2018. [Google Scholar]
Li, Z.; Peng, C.; Yu, G.; Zhang, X.; Deng, Y.; Sun, J. Detnet: A backbone network for object detection. arXiv 2018, arXiv:1804.06215. [Google Scholar]
Cao, J.; Chen, Q.; Guo, J.; Shi, R. Attention-guided context feature pyramid network for object detection. arXiv 2020, arXiv:2005.11475. [Google Scholar]
Iandola, F.; Moskewicz, M.; Karayev, S.; Girshick, R.; Darrell, T.; Keutzer, K. Densenet: Implementing efficient convnet descriptor pyramids. arXiv 2014, arXiv:1404.1869. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 13–18 June 2018. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhu, Y.; Zhao, C.; Wang, J.; Zhao, X.; Wu, Y.; Lu, H. Couplenet: Coupling global structure with local parts for object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4126–4134. [Google Scholar]
Zeng, X.; Ouyang, W.; Yan, J.; Li, H.; Xiao, T.; Wang, K.; Liu, Y.; Zhou, Y.; Yang, B.; Wang, Z.; et al. Crafting gbd-net for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2109–2123. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar]
Zhang, D.; Zhang, H.; Tang, J.; Wang, M.; Hua, X.; Sun, Q. Feature pyramid transformer. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 323–339. [Google Scholar]
Tychsen-Smith, L.; Petersson, L. Denet: Scalable real-time object detection with directed sparse sampling. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 428–436. [Google Scholar]
Huang, J.; Rathod, V.; Sun, C.; Zhu, M.; Korattikara, A.; Fathi, A.; Fischer, I.; Wojna, Z.; Song, Y.; Guadarrama, S.; et al. Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7310–7311. [Google Scholar]

Figure 1. The overall architecture of our proposed model (BFE-Net).

Figure 2. Comparisons of different backbone: (a) PANet; (b) RetinaNet; (c) BFE-Net (our model).

Figure 3. Illustration of feature layer fusion: (a) C6; (b) N6.

Figure 4. Illustration of Dilated Feature Enhancement (DFE): (a) the abbreviated structure of DFE; (b) the structure of Dilated block.

Figure 5. Illustration of other structures of Dilated block: (a) series connection; (b) parallel connection. The dilated convolution of residual connections can be described as a recursive formulation.

Figure 6. Illustration of High-level Guided Model (HGM).

Figure 7. Comparisons of small object detection results: (a) the original images; (b) RetinaNet test results; (c) BFE-Net test results.

Table 1. BFE-Net vs. other two-stage and one-stage detectors on COCO test-dev.

Method	Backbone	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
Two-stage methods
DeNet [31]	ResNet-101	33.8	53.4	36.1	12.3	36.1	50.8
CoupleNet [26]	ResNet-101	34.4	54.8	37.2	13.4	38.1	50.8
Faster R-CNN by G-RMI [32]	Inception-ResNet-v2	34.7	55.5	36.7	13.5	38.1	52.0
Faster R-CNN +++ [19]	ResNet-101	34.9	55.7	37.4	15.6	38.7	50.9
Faster R-CNN w FPN [16]	ResNet-101	36.2	59.1	39.0	18.2	39.0	48.2
Mask R-CNN [5]	ResNet-101	38.2	60.3	41.7	20.1	41.1	50.2
Cascade R-CNN [24]	ResNet-101	42.8	62.1	46.3	23.7	45.5	55.2
One-stage methods
YOLOv2 [8]	DarkNet-19	21.6	44.0	19.2	5.0	22.4	35.5
SSD513 [6]	ResNet-101	31.2	50.4	33.3	10.2	34.5	49.8
YOLOv3 [9]	DarkNet-53	33.0	57.9	34.4	18.3	35.4	51.1
DSSD513 [13]	ResNet-101	33.2	53.3	35.2	13.0	35.4	51.1
RetinaNet [11]	ResNet-50	32.5	50.9	34.8	13.9	35.8	46.7
EfficientDet-D0 [12]	EfficientNet	34.6	53.0	37.1	-	-	-
RetinaNet [11]	ResNet-101	34.4	53.1	36.8	14.7	38.5	49.1
RefineDet512 [15]	ResNet-101	36.4	57.5	39.5	16.6	39.9	51.4
Ours
BFE-Net	ResNet-50	34.6	52.9	37.2	16.7	37.3	47.8
BFE-Net	ResNet-101	36.3	54.4	38.5	17.9	39.6	50.1

Table 2. Comparison of Speed on COCO test-dev.

Method	Backbone	AP	AP_S	FPS
Faster R-CNN w FPN [16]	ResNet-101	36.2	18.2	2.4
SSD513 [6]	ResNet-101	31.2	10.2	8
DSSD513 [14]	ResNet-101	33.2	13.0	6.4
RetinaNet [11]	ResNet-101	34.4	14.7	11.1
EfficientDet-D0 [12]	EfficientNet	34.6	-	10.3
BFE-Net(ours)	ResNet-101	36.3	17.9	9.2

Table 3. Effect of each component on COCO val-2017. BidiFPN: bidirectional feature pyramid network; HGM: High-level Guided Model; DFE: Dilated Feature Enhancement.

RetinaNet	BidiFPN	HGM	DFE	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
√				32.5	50.9	34.8	13.9	35.8	46.7
√	√			33.4	51.6	35.9	14.6	36.3	47.0
√	√	√		34.0	52.3	36.5	15.1	36.8	47.5
√	√		√	34.1	52.1	36.7	15.9	36.9	47.4
√	√	√	√	34.6	52.9	37.2	16.7	37.3	47.8

Table 4. Ablation studies of Dilated Feature Enhancement on COCO val-2017. B + R + H: RetinaNet + BidiFPN + HGM.

R + B + D	R-C6	H-C6	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
√			34.1	52.1	36.7	15.9	36.9	47.4
√	√		34.4	52.9	37.1	16.4	37.2	47.6
√		√	34.6	52.9	37.2	16.7	37.3	47.8

Table 5. Ablation studies of High-level Guided Model on COCO val-2017. B + R + D: RetinaNet + BidiFPN + DFE.

R + B + H	P-Dilated	S-Dilated	R-Dilated	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
√				34.0	52.3	36.5	15.1	36.8	47.5
√	√			34.5	52.7	37.0	16.4	37.0	47.8
√		√		34.3	52.6	36.8	16.0	37.2	47.8
√			√	34.6	52.9	37.2	16.7	37.3	47.8

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Q.; Ren, J.; Liang, H.; Yang, Y.; Chen, L. BFE-Net: Bidirectional Multi-Scale Feature Enhancement for Small Object Detection. Appl. Sci. 2022, 12, 3587. https://doi.org/10.3390/app12073587

AMA Style

Zhang Q, Ren J, Liang H, Yang Y, Chen L. BFE-Net: Bidirectional Multi-Scale Feature Enhancement for Small Object Detection. Applied Sciences. 2022; 12(7):3587. https://doi.org/10.3390/app12073587

Chicago/Turabian Style

Zhang, Qian, Jie Ren, Hong Liang, Ying Yang, and Lu Chen. 2022. "BFE-Net: Bidirectional Multi-Scale Feature Enhancement for Small Object Detection" Applied Sciences 12, no. 7: 3587. https://doi.org/10.3390/app12073587

APA Style

Zhang, Q., Ren, J., Liang, H., Yang, Y., & Chen, L. (2022). BFE-Net: Bidirectional Multi-Scale Feature Enhancement for Small Object Detection. Applied Sciences, 12(7), 3587. https://doi.org/10.3390/app12073587

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

BFE-Net: Bidirectional Multi-Scale Feature Enhancement for Small Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Deep Object Detectors

2.2. Context Augmentation

3. Method

3.1. Overall Architecture

3.2. Dilated Feature Enhancement

3.3. High-Level Guided Model

4. Experiments

4.1. Dataset and Evaluation Metrics

4.2. Implementation Details

4.3. Main Results

4.4. Ablation Study

4.5. Visualization of Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI