Reinforced Neighbour Feature Fusion Object Detection with Deep Learning

: Neural networks have enabled state-of-the-art approaches to achieve incredible results on computer vision tasks such as object detection. However, previous works have tried to improve the performance in various object detection necks but have failed to extract features efﬁciently. To solve the insufﬁcient features of objects, this work introduces some of the most advanced and representative network models based on the Faster R-CNN architecture, such as Libra R-CNN, Grid R-CNN, guided anchoring, and GRoIE. We observed the performance of Neighbour Feature Pyramid Network (NFPN) fusion, ResNet Region of Interest Feature Extraction (ResRoIE) and the Recursive Feature Pyramid (RFP) architecture at different scales of precision when these components were used in place of the corresponding original members in various networks obtained on the MS COCO dataset. Compared to the experimental results after replacing the neck and RoIE parts of these models with our Reinforced Neighbour Feature Fusion (RNFF) model, the average precision (AP) is increased by 3.2 percentage points concerning the performance of the baseline network. H.L.; Y.L. H.L.;


Introduction
Target detection is an essential task in deep learning; it answers the question "what objects are located where". Traditional object detection algorithms mainly use artificially designed feature modeling to extract geometric information such as edges, colors, and textures and then detect them through support vector machines. The method has some obvious shortcomings. For example, the detection accuracy error is large in some complex scenes, such as significant alterations in the background and object scale or occlusion. With the advancement of deep learning, detection algorithms using convolutional neural networks have been gradually proposed, and the detection accuracy has been greatly improved. It has a potential impact on the development of the fundamentals of deep learning techniques, and it may help to reduce the amount of required labeled data in many deep learning tasks, such as recognition, instance segmentation, etc. [1]. Object detection has many applications in self-driving vehicles, medical image analysis, business analytics, and face identification. Object detection in transportation situations is still a challenging difficulty which is the key to supervising traffic order and maintaining road safety. The existing deep learning-based object detection algorithms are mainly divided into one-stage detection and two-stage detection. The one-stage object detection algorithm does not require a region of interest suggestion network, and the features extracted by the deep convolutional network are directly classified and the object position coordinate value, such as SSD [2], YOLOv1 [3], YOLOv2 [4], YOLOv3 [5], YOLOv4 [6], RetinaNet [7], CornerNet [8], CornerNet-Lite [9], CenterNet [10], FCOS [11], ExtremeNet [12], etc.
Since the one-stage detection network does not use candidate regions to generate the network, the scale is small, so the detection speed is faster than the two-stage network, and the accuracy is low. Based on one-stage target detection, Region-based Convolutional Neural Network (RCNN) [13] introduced region proposals. It uses a priori box to filter out the fields where objects could exist and use selective search means to merge these regions to generate candidate regions finally. Perform position and classification regression in the detector. Some R-CNN [13] frameworks, such as Libra R-CNN [14], Generic Region of Interest Extractor (GRoIE) [15], CBNet [16], ThunderNet [17], and CSPNet [18], fuse features from different levels to obtain one-level features that simultaneously include semantic information and location information. Some networks, such as Cascaded R-CNN [19], improve the average precision (AP) by extracting features many times, and guided anchoring [20] modifies the process of anchor frame generation to improve the AP. Feature Pyramid Network (FPN) [21] based on top-down, independently detects each layer of features and introduces Faster RCNN [22]. After FPN, there have been many feature fusion methods based on FPN, such as PANet [23], ThunderNet [17], Balanced FPN in Libra RCNN [14], BiFPN in NAS-FPN [24] and EfficientDet [25], etc. Nonetheless, the current algorithm does not completely solve the multi-scale problem, and there is still a loss of position and semantic information. In the top-down process, the background information gradients of small-scale features will generate enormous errors, thus exacerbating the scale imbalance in the feature fusion stage in the neck of the network. For the neck part, we report NFPN experiments conducted on LISA [26], and Table 1 shows the enhancement of the AP and the advantages for objects of different scales. Then, we report experiments conducted to test the Recursive Feature Pyramid (RFP) and ResRoIE methods using the Faster R-CNN architecture proceeding the MS COCO [27], including comparisons with Faster R-CNN in Table 2 and several other processes using the Faster R-CNN architecture in Table 3.

Related Work
The existing deep learning-based target detection algorithms are divided into onestage detection and two-stage detection.

One-Stage Detection
Contemporary researches concentrate on developing object detectors from several aspects: Scale awareness, spatial awareness, and task awareness. YOLOv1 [3] takes the image to be detected as the input of the network and classifies and regresses the features in the output layer to obtain the prediction frame and category of the object. YOLOv2 [4] optimizes the speed and accuracy of the model based on YOLOv1 and expands to be able to detect 9000 categories at the same time, so it is also called YOLO9000. The YOLOv3 [5] model draws on the ideas of ResNet and extracts features based on the Darknet-53 backbone network. It achieves faster speed, and better performance than ResNet [28]. At the same time, compared to YOLOv2, it uses the FPN feature pyramid to optimize multiscale object detection. YOLOv4 [6] object detection based on the CSP method balances both up and down and regards to small and large networks while sustaining optimal speed and precision. The most common model scaling technique is to change the depth (number of convolutional layers in a CNN) and width (number of convolutional filters in a convolutional layer) of the backbone and then train CNNs suitable for different devices [29].
Given that FPN makes network structure complex, brings memory burdens, and slows down the detectors, we offer a mild but highly efficient way without using FPN to address the optimization problem differently, denoted as YOLOF [30]. In this paper, the issue associated with nudity detection at various scales and backgrounds were addressed [31]. CornerNet [8] concludes that the advantage of anchor frames, particularly in one-stage detectors, has drawbacks, such as slowing down the training speed, and introducing additional hyperparameters. CenterNet [10] further improves CornerNet and detects each object by submitting another critical point as a triplet of crucial points in the proposed center. FCOS [11] uses the idea of semantic segmentation to resolve the difficulty, abandoning the standard anchor boxes and object proposals in object detection, making it unnecessary to tune the hyper-parameters involving anchor boxes and object proposals. ExtremeNet [12] turns the target detection problem into a simple appearance information-based key point predicting situation, thus cleverly avoiding region classification and specific feature learning. First, the extreme points can reflect the object information better than the bounding box, compared with the existing object detection model. Secondly, the author also proposed that a more detailed octagonal segmentation estimation result can be obtained by using a simple trick. Finally, if you are not satisfied, you can use it in combination with Deep Extreme Cut [32] to convert extreme points into segmentation masks. An algorithm based on a two-stage target detection network is proposed to realize the classification and detection of people, vehicles, pets, etc., to achieve the detection of objects far away and surrounding [33].

Two-Stage Detection
Fast RCNN abandons RCNN's method of extracting features for each suggested region and introduces the RoI pooling algorithm to select features of the entire image. The purpose is to resolve the time-consuming problem of RCNN repeatedly calculating the features of each candidate region. Faster RCNN proposes a Region Proposal Network (RPN) to generate candidate regions, which greatly improves generating candidate regions. The possible feature extraction methods in the backbone part mainly include ResNet [28], ResNeXt [34], Res2Net [35], and HRNet [36], while the feature fusion in the neck part integrates the output features from each level of the backbone. The feature extraction method of SSD [2] is shown in Figure 1a. SSD is a typical method that uses multi-scale features without fusion. SSD uses features of different resolutions for detection to avoid the exponential drop in resolution caused by the CNN layer's deeper. FPN in Figure 2 introduces a top-down fusion method in the feature fusion part, which significantly promotes large objects with deep features. However, the location information is lost due to the reduced resolution. While the position information is kept sufficient at the time of the object, the semantic information is not extracted. PANet shown in Figure 1b uses a bottom-up approach to further feature fusion after FPN. Based on FPN, the position information of the shallow layer is propagated to the deep layer. However, there are similar problems with FPN. The error caused by FPN upsampling will continue to propagate in the bottom-up process with the downsampling of PANet, and even amplify the error. Various features nearby to each other might be picked; non-maximal suppression will be brought out after detection to remove those not-so-significant feature points [37]. ZigZagNet [38] improves on PANet so that in the top-down and bottom-up fusion process, information interaction between each layer of features is also carried out so that the multi-scale context information in both directions is enhanced. ThunderNet [17] Figure 1e to compare with the proposed EFM. We connect the anchor-based and anchor-free branches with symmetric structures. Compared with a single unit, symmetry is applied to combine knowledge extracted from two components. Moreover, the parallel anchor-based branch and anchor-free branch run in symmetry to select the most desirable trait and anchor box [39].   Figure 3 shows the overall flow of our model. Our goal is to reduce the feature imbalance caused by changes in the object scale. We design the following three methods to solve this problem. We first create a Neighbour Feature Pyramid Network (NFPN) architecture to verify the imbalance among different object scales at different feature levels for feature extraction. Then, we propose a Recursive Feature Pyramid (RFP) fusion method, which reduces the imbalance while integrating feature information from different layers. Finally, we offer ResNet Region of Interest Feature Extraction (ResRoIE) to reduce the mutual influence on the gradients caused by objects of different scales.

Neighbour Feature Pyramid Network (NFPN)
For the neck part of the network, we design the NFPN architecture Figure 4 to verify the deterioration in the AP caused by small objects and the improvement caused by large objects after multiple up-sampling operations for objects of different sizes between different layers. NFPN is to fuse the features of two adjacent levels. The lower-level features are up-sampled or interpolated and then convolved to optimize the interpolated features. The upper-level features use convolution with a step size of 2 after pooling. Then use the convolution with a step size of 1, add the features of this layer for fusion, and add the combined characteristics to the initial input features to conquer the small object knowledge loss caused by upsampling. In a word, make full use of the feature to integrate it into the next layer better to reduce feature loss and interference. In the algorithm below, P i denotes the output feature from the i-th layer of an FPN structure Figure 4, O i is the output from the i-th layer of NFPN. When it comes to P i+1 relative to O i , it stands for up-sampling, the nearest impending interpolation of bilinear interpolation. As for P i−1 , the O i process represents pooling, either average pooling, maximum pooling, or minimum pooling. The former part of O 3 is the up-sampling of P 4 plus P 3 , and the two are subjected to a convolution to get a result. The latter part is the feature P 3 of this layer. The primary part of O 7 is the pooled P 6 plus P 7 , and the two undergo a convolution operation. The following part is the feature P 7 of this layer. For O 4 , O 5 , and O 6 are also require two sections; the first part is the feature P i of this layer, the second part is P i plus the up-sampling, the convolution of P i+1 , and the convolution of P i−1 after pooling. After the process of fusion of adjacent layers, the specific algorithm is as follows.

Resnet Region of Interest Feature Extraction (Resroie)
In the RoIE (RoI Extractor) part of the network in Figure 5, because only FPN output features are extracted, there will still be an imbalance among multi-scale objects. Due to the increase in the number of network layers of adjacent feature fusion, it impacts the gradient of the object with a small scale itself and produces the phenomenon of gradient disappearance. Compared with Faster RCNN RPN50, the accuracy of large-scale objects in a row of BiFPN × 2 is lower because the fusion network is too extensive, which leads to a certain degree of gradient disappearance problem, which reduces the accuracy of objects at all scales. Nevertheless, the accuracy of its large-scale objects is still higher than Faster RCNN, which verifies the feature enhancement effect of the top-down process on large-scale objects and the feature weakening effect on small-scale objects. In the RoIE stage, map to the corresponding feature layer according to the region's suggested area, convolve the layer to obtain 7 × 7 features, and perform classification and regression. The features obtained in this way only contain the information obtained by one fusion method. To quickly and efficiently confirm this guess, we conducted experiments on the Laboratory for Intelligent, and Safe Automobiles Traffic Sign Dataset (LISA) [26], in which the resolution varies from 6 × 6 to 167 × 168. After the first ResNet [28] layer in the network backbone, the feature resolution has been reduced to 1/4. Large objects will have location information, whereas small objects, such as 6 × 6 objects, may have only weak semantic information because their remaining resolution will be only 1.5 × 1.5 if convolution is not considered. In addition, the resolution will be only 4 × 4 if 3 × 3 convolutions are used twice, and only 1 × 1 pixel will not be influenced by background pixels; that is, 15/16 pixels will be influenced by invalid pixels, which can cause incorrect gradients for deeper-level features when back-propagation is applied in top-down feature fusion. Due to the top-down fusion method, the fused features at shallow levels contain many invalid pixels, which the RoIE layer will also extract. Therefore, we extract the FPN features in our neck structure so that the gradient can bypass the top-down process to reduce the influence of invalid background pixels. For example, the features fused using the bottomup method may be beneficial to large objects, but the accuracy of small objects will be lost. However, only using the top-down procedure to fuse information, it is impossible to use the two-way fusion to have a beneficial impact on the large-scale object. As in Figure 5, using the idea of shortcut in ResNet, the sum of FPN-out and NFPN-out is used as the output of the neck part to solve imbalance among multi-scale objects. After the adjacent features are merged in the neck stage, the FPN features are output simultaneously.

Recursive Feature Pyramid (RFP)
Because the fused features after the top-down operation are generated based on the currently existing pixels rather than the original pixels, if 3 × 3 convolutions are used, the influence of background pixels will still exist due to the enlargement of the receptive field. To address this issue, we refer to the idea applied in DetectoRS Figure 6 [39], namely, the idea of "thinking twice" for detection. It is built on the FPN, which combines additional feedback connections from the FPN to the bottom-up backbone layer. For the fused features, the backbone is used to select the parts again, as shown in Figure 7, and then the extracted features are fused with the features before extraction. After the backbone extracts the components, the parts before extraction are fused. After combining the features from each layer, we use a convolution block corresponding to the depth in the backbone to extort features again to reduce the impact of the background pixels after up-sampling and 3 × 3 convolutions. Then, the extracted features are summed with the output features from NFPN fusion. By merging additional feedback links of the FPN to the bottom-up backbone layer. RFP serves as the feature pyramid network, which takes level 3-7 features C 3 , C 4 , C 5 , C 6 , C 7 from the backbone network and recursively applies top-down and bottom-up bidirectional feature fusion. We demonstrated a method that improved object detection performance by building a more powerful robust feature extract module RNFF that recursively inputs the first extracted features in the neural network to remove the elements again. However, unlike DetectoRS, we do not need to build a new backbone and use the original image; instead, after the neighbor feature fusion step of the feature fusion process, the same backbone is used again for extraction to reduce parameters of the model.

Experiments and Results
In this part, we report two groups concerning comparative experiments. The first group supports the effectiveness of addressing feature imbalance, using the NFPN architecture as the basis for comparison. The second group verifies the improvement achieved with our method. In the third group, the application of our Reinforced Neighbour Feature Fusion (RNFF) method in combination with other recent networks is investigated by replacing the original neck and RoIE parts of the networks. We use an auxiliary multi-scale feature enhancement module to assist in extracting multi-scale shallow features and merging them with the components selected from the backbone, which dramatically improves the expression ability of small objects [40].

Dataset
We conducted experiments on the LISA Traffic Sign Dataset for the NFPN architecture, and on the MS COCO dataset, [27] for the RNFF architecture. The LISA Traffic Sign Dataset contains 6k images divided into 47 traffic signal sign categories, and the MS COCO dataset contains more than 11k images of 80 classes.

Implementation Details
Our default experimental configuration was as follows unless otherwise specified. The experimental platform had 8 GPUs (TITAN V). The total number of epochs was 24, and we selected the best mAP among all epochs. The learning rate was 0.02, the weight decay rate was 0.0001, and the batch size on each GPU was 2. We used the Faster R-CNN architecture and ResNet-50 as the backbone. We used the COCO evaluation metrics in the MMDetection framework to evaluate and compare RNFF and the other methods. In the two-stage network experiment, taking Faster RCNN and its variant network as an example, after replacing the corresponding module in the model, whether it is the experiment with Res50 as the backbone network and the Res101 series as the backbone, the average accuracy and each, the scale accuracy has been improved. In the experiment with ResNet50, the mAP increase was the highest when compared with GRoIE, from 37.5% to 40.1%. The mAP has increased by 2.6 percentage points, and small, medium, and large objects have increased by 1.4, 1.2, and 4.4 percentage points, respectively.

Module-Wise Ablation Analysis
Toward the MS COCO dataset, we observed the act of NFPN fusion, ResRoIE, and the RFP architecture at different precision scales and the improvements in precision when these components were used in place of the corresponding original components in various networks. On the LISA Traffic Sign Dataset, we used FPN, BFP, and BiFPN to observe the influence of these different fusion methods on precision and used two stacked BiFPN modules to observe the influence of top-down fusion on precision.
In the NFPN architecture, we performed ablation experiments on two datasets. The LISA Traffic Sign Dataset experiments were performed to observe the impact of topdown and bottom-up fusion on objects at different scales and the improvement enabled by our NFPN architecture. We used the five feature layers generated by FPN for feature fusion. The fusion process is divided into three steps. First, the neighboring layers of the reference layer are resized to the exact resolution as the reference layer through pooling or interpolation. Then, a 3 × 3 convolution check is used to convolve the reconstructed features and add them to the reference layer features. Finally, a 1 × 1 convolution kernel is used for convolution. We used the sum (+) method to fuse two parts with the exact resolution. Through experimental data, it is found that the NFPN structure can improve mAP on the Lisa data set, but the detection effect on objects of different scales varies greatly. For the large scale, although it is an increase of 5.2% compared to the benchmark network, the improvement is even more significant when other network fusion methods are used in the benchmark network. For the medium scale, NFPN increased by 1.1 percentage points. Except for RNFF, which can increase by 0.2 percentage points, other methods have decreased significantly. For small objects, the Faster RCNN network using FPN performs the best, the NFPN reduces the least, and the accuracy of the remaining networks reduces significantly. For this phenomenon, this article believes that the reason is that after the top-down (FPN) process. However, the background pixels cause the first interference to the small-scale object; due to the lateral conv, there will be no original features in the upsampled backbone network. Convolution and addition, thereby reducing this interference. After FPN, feature fusion is performed, that is, the bottom-up process, which transmits the position information of the shallow layer to the deep layer. Because the large object has a larger resolution, its influence is much more significant than the influence of the surrounding background pixels. The enhancement process from semantics (bottom-up) to location information (top-down) is verified by NFPN × 2 in Table 1.
In the experiment using ResRoIE + NFPN, mAP is increased by 2.0% for medium objects, while the accuracy of small objects is increased by 1.2%. The common point of all experiments in Table 3 is that the network of the feature fusion part increases, and the accuracy of small and medium objects decreases. For the large-scale experiment, NFPN × 2 + ResRoIE, the accuracy is 4.7 percentage points higher than that of the external NFPN + ResRoIE. Compared with BiFPN × 2 in Table 1, both increase network layers and decrease the accuracy. It shows that ResRoIE can reduce the feature fusion part and the grade fading made by increasing the network. In the experiment of the proposed RFP + ResRoIE method, after using RFP, the experiment is increased by 0.2%. The accuracy of each scale object is also improved, which verifies the hypothesis that there is noise in the feature fusion section. The RFP algorithm can reduce this noise. Subsequently, an adaptive weight standardization strategy was used to reduce the mutual influence between different scale features. The accuracy of the small, medium, and large objects were slightly improved. When SEnet is used to add attention to the feature fusion of each layer, the accuracy drops by 0.1%, indicating that the features of different layers are fused. While paying attention to the object feature, it also gives the same attention as the noise. Compared with some mainstream networks based on two stages, the experiment uses a combination of NFPN, ResRoIE, and RFP. Experiments prove that the algorithm offered in this paper can increase the precision of target detection. Balanced FPN imports the neck module in Libra RCNN into Faster RCNN, and the rest remains unchanged.
In addition, weight normalization (ConvAWS) is added to the network, and the idea of using weight fusion features in BiFPN in EfficientDet is referred to. The attention mechanism is used to add learnable weights when different layers of features are fused. Experiments show that the combination of ResRoIE and RFP introduced can realize more reliable performance, significantly improving compared to some current two-stage methods. For small and medium scales, in the experimental results, PANet, Balanced FPN, GRoIE, and RFP + ResRoIE have been upgraded in turn. Indicating that in a larger data set with a relatively balanced distribution of categories and scales, the more feature fusion is, the more accurate the detection of small and ordinary targets is presented in Figure 8. It is meriting that feature fusion may lead to information loss and noise interference in a small-scale data set due to the scale and category distribution imbalance, thereby triggering gradient descent. For large objects, the accuracy of GRoIE is lower than Faster RCNN. The reason is that features of all scales are used in its detection. The shallow features of small-scale objects are easily disturbed by upsampling, while large objects use disturbing features. So it manages to decrease the precision of the large object, as is shown in Figure 9.

Conclusions
To solve the insufficient features of objects, this work introduces some of the most advanced and representative network models based on Faster R-CNN, such as Libra R-CNN, Grid R-CNN, guided anchoring, and GRoIE, which we observed the performance of NFPN fusion, ResRoIE and the RFP architecture at different scales of precision. ResRoIE was added to the NFPN network on the Lisa data set, which verified that NFPN has vanishing gradients, and ResRoIE can alleviate this problem. Then, experiments were performed on other data sets of different scales. Among them, the small and medium data sets are the same as Lisa's conclusion that ResRoIE can alleviate the problem of gradient disappearance, thereby improving the accuracy of objects at various scales. The twostage network experiment took Faster RCNN and its variant network as a reference after replacing the corresponding module in the model. In the experiment with ResNet50 as the backbone network, the mAP increase was the highest compared with GRoIE, from 37.5% to 40.1%. In a large-scale data set such as MS COCO, the NFPN + ResRoIE in this article can improve the detection accuracy in the current most advanced two-stage network. Some well-known methods in recent object detection are selected for comparison. In the experiment based on those methods, we keep ResNet50 as a fixed backbone network to observe the performance improvement by the feature RNFF. Experiments show that the algorithm proposed in this paper can improve the accuracy of object detection.