CP-SSD: Context Information Scene Perception Object Detection Based on SSD

: Single Shot MultiBox Detector (SSD) has achieved good results in object detection but there are problems such as insufﬁcient understanding of context information and loss of features in deep layers. In order to alleviate these problems, we propose a single-shot object detection network Context Perception-SSD (CP-SSD). CP-SSD promotes the network’s understanding of context information by using context information scene perception modules, so as to capture context information for objects of different scales. Deep layer feature map used semantic activation module, through self-supervised learning to adjust the context feature information and channel interdependence, and enhance useful semantic information. CP-SSD was validated on benchmark dataset PASCAL VOC 2007. The experimental results show that, compared with SSD, the mean Average Precision (mAP) of the CP-SSD detection method reaches 77.8%, which is 0.6% higher than that of SSD, and the detection effect was signiﬁcantly improved in images with difﬁcult to distinguish the object from the background.


Introduction
Object detection is one of the main tasks of image processing. Its main purpose is to be able to accurately locate and classify objects in images. It has been comprehensively used in many communities such as face recognition, road detection, and driverless car, and so forth. The traditional object detection methods such as Histogram of Oriented Gradient (HOG) [1], Scale Invariant Feature Transform (SIFT) [2], are based on hand-crafted features (e.g., RGB color, texture, Gabor filter and gradient). Hand-crafted features lack sufficient discriminative representation, perform poorly in generalization ability and are easily affected by low contrast quality. It is difficult and time-consuming to perform object detection on a large and complex dataset.
The deep convolutional neural network method promotes the understanding of dynamic objects. However, it still faces the challenge of a lack of rich semantic features and insufficient understanding of context information. In Region-Convolutional Neural Network (R-CNN) series, the selection of regional proposal and repeated convolution of feature maps greatly increases the time and complexity of object detection. R-CNN [3] is several times more accurate than the traditional algorithms based on HOG and SIFT. However, this method of obtaining regional proposals by Selective search in R-CNN will lose a lot of contextual information. In the Single Shot MultiBox Detector (SSD) [4] algorithm, the end-to-end network structure is used to remove the steps in the R-CNN for region proposals. It directly inputs the entire picture into the network structure and predicts objects of different sizes using feature maps of different scales in the Convolutional Neural Network(CNN) [5]. In the backbone network Visual Geometry Group 16(VGG16) [6], low-level detection feature maps are generated and

Related Work
In recent years, the CNN has achieved great success in computer vision tasks, such as image classification [8][9][10][11][12], segmentation [13,14] and object detection [3,4,[15][16][17][18][19][20][21]. Among them, object detection is a basic task that has been extensively studied. There are two frameworks for the network proposed by the research community for object detection-the two-stage framework and the a single-stage framework. In the two-stage framework, such as the R-CNN series, region proposals with different scales and aspect rates are predicted by extracting feature maps using CNN. Then, classification and regression are carried out based on features extracted from the region proposals. With the help of R-CNN [3], the deep learning mechanism was introduced into object detection for the first time in Reference [3], and the algorithm for adaptive object detection was proposed. It generates region proposals which define the set of candidate detections available to the detector. Each region proposal is input into a large CNN to extract features and then use category-specific linear Support Vector Machines (SVMs) for classification. Finally, the object is detected by regression correction. In SPP-Net [17], the spatial pyramid pooling layer is added into the network behind the R-CNN network structure, which overcomes the shortcomings of R-CNN requirements for the size of the input region proposals and hence it helps to improve detection accuracy. For Fast R-CNN [18], a region of interest (RoI) pooling layer based on SPP-Net is introduced, which reduced the number of convolution layers and greatly improved the detection speed. For Faster R-CNN, the regional proposal network (RPN) is introduced in the process of extracting region proposals by Fast R-CNN. The regional proposal is generated on the convolution feature diagram of the last layer of RPN module and they are input into the RoI pooling layer of Fast R-CNN, so as to optimize the selection of the regional proposal, to reduce repeated feature extraction and to improve the accuracy of regional extraction and network training speed. The RoI pooling layer of Faster R-CNN [19] to RoIAlign by Mask R-CNN [20] improved and the bilinear interpolation method is adopted to reduce the position error of the boundary regression box. Meanwhile, a Mask generation task was added to improve the detection accuracy to some extent.
In single-stage frameworks, such as the You Only Look Once (YOLO) [21] series and SSD, the object classifiers and regressions are applied in a dense manner, without the need for object-based pruning. They all classify and regress a set of pre-computed anchors. In the YOLO alogritm, the end-to-end training mode is used to reduce the network structure. Although the accuracy is slightly worse than that of the R-CNN series, the speed is much faster than that of the R-CNN series. It reduces the error rate of background image detection and enhances the global information of the image. SSD also has a great improvement in speed, and uses a backbone network (for example, VGG16) to generate a low-level detection feature map. Based on this, it constructs several layers of object detection feature maps to learn semantic information in a layered manner, with lower layers detecting smaller objects and higher layers detecting larger objects, so as to eliminate region proposals and subsequent pixel resampling stages.

CP-SSD
CP-SSD (Context Perception SSD) is a single-shot object detection network based on SSD, which consists of three main parts,-the SSD model, context information scene perception module-which is used to capture local context information of different sizes, and the semantic activation module, which enriches the semantic information in a self-supervised manner. Please refer to Figure 1 for the structure of CP-SSD.
In SSD, VGG16 is used as the backbone network to generate the low-level detection feature map U. Based on that, the feature map U is continuously downsampled through a series of convolutional layers with a stride of 2 (i.e., fc7 to conv9_2) by applying anchors of different sizes and aspect rates in a hierarchical manner, so as to detect objects of small to large sizes.
In the context information scene perception module, we used multiple dilated rate convolution layers in parallel and each dilated convolution layer has different dilated rates. The larger the dilated rate is, the larger the receptive field of the convolution kernel is. The context information sensing module performs feature extraction on the feature map U through convolution kernels of different receptive fields so that the model can perceive changes in the context information between different scales and different sub-regions. In this way, the loss of feature information is reduced, and the image is understood more comprehensively.
In the deeper detection layer, a higher level of detection feature map is enhanced using a semantic activation module. In order to detect objects of different sizes, the feature map is downsampled in the fc7 to conv9_2 layer, which reduces the resolution of feature maps and increases the receptive field of the model. However, semantic information and location information are lost in each downsampling. Therefore, the semantic activation module is used on fc7 to conv9_2 to learn the relationship between the channel and the object by self-supervised learning, so as to adjust and enrich the semantic information.

Dilated Convolution and Receptive Field
The dilated convolution [13] increases the receptive field of the convolution kernel without introducing extra parameters. The formula for the 1-D dilated convolution is defined as follows: (1) denotes the k-th parameter of the convolution kernel, and K is the size of the convolution kernel. In the standard convolution, r = 1. The 2-D dilated convolution is constructed by inserting 0 between each weight of the convolution kernel. For a convolution kernel of size k × k, the size of the resulted dilated convolution kernel is . Therefore, the larger the dilated rate r is, the larger the receptive field of the convolution kernel is. For example, for a convolution kernel of k = 3, when r = 4, the corresponding receptive field size is 9. Figure 2 shows the dilated convolution kernel for different dilated rate. In Figure 2 the dark portion denotes the effective weight, and the white portion denotes the inserted zero.

Dilated Convolution and receptive field
The dilated convolution [22] increases the receptive field of the convolution kernel without introducing extra parameters. The formula for the 1-D dilated convolution is defined as follows: Here, x[i] denotes the input signal, y[i] denotes the output signal, i denotes the dilated rate, w[k] denotes the k-th parameter of the convolution kernel, and K is the size of the convolution kernel. In the standard convolution, r = 1. The 2-D dilated convolution is constructed by inserting 0 between each weight of the convolution kernel. For a convolution kernel of size k×k, the size of the resulted dilated convolution kernel is kd×kd, where kd = k + ( k -1)×(r -1). Therefore, the larger the dilated rate r is, the larger the receptive field of the convolution kernel is. For example, for a convolution kernel of k = 3, when r = 4, the corresponding receptive field size is 9. Figure 2 shows the dilated convolution kernel for different dilated rate. In Fig.2 the dark portion denotes the effective weight, and the white portion denotes the inserted zero.

Context information scene perception module
In the object detection, the targets to be detected usually have different scale, so the feature map must contain feature information of receptive fields in different scale. In deep learning, the size of the receptive field can be roughly expressed as the degree of utilization of the context information by the model. But at a high level, the previously important semantic information usually could not

Context Information Scene Perception Module
In the object detection, the objects to be detected usually have a different scale, so the feature map must contain feature information of receptive fields at different scales. In deep learning, the size of the receptive field can be roughly expressed as the degree of utilization of the context information by the model. But at a high level, the previously important semantic information usually could not be combined by the network. Inspired by PSPNet [22], a contextual information scene perception module was designed, which achieves this goal by parallel dilated convolution of different dilated rates. The same feature map is input to these convolutional layers and different dilated rates d is used to make the convolution kernels have different receptive fields. Then, the feature information of different sizes is sampled. Finally, the feature maps of these outputs are concatenated together. The structure of the context information scene perception module is shown in Figure 3 Firstly, a 1 × 1 convolution is used to reduce the number of channels of the feature map U ∈ R W×H×512 , so as to obtain a feature map U ∈ R W×H×512 . Then, the dilated convolution (d 1 , d 2 , d 3 , d 4 ) = (1, 2, 4, 6) with four different dilated rates is used in parallel to carry out feature sampling on the feature map U', and the feature map . Then, the dilated convolution (d1, d2, d3, d4) = (1,2,4,6) with four different dilated rates is used in parallel to carry out feature sampling on the feature map ' U , and the feature map

Semantic activation block
The semantic activation module is used to adjust the interdependence between contextual feature information and channels by self-supervised learning, and to selectively enhance useful semantic information according to the self-attention mechanism and suppress harmful feature information.

Semantic Activation Block
The semantic activation module is used to adjust the interdependence between contextual feature information and channels by self-supervised learning, and to selectively enhance useful semantic information according to the self-attention mechanism and suppress harmful feature information.
The semantic activation module is shown in Figure 4, which consists of three steps: spatial pooling f gap (·), channel-wise attention learning f f cl (·, θ), and channel weights adaptive f f use (·, ·). obtain a feature map '  U . Then, the dilated convolution (d1, d2, d3, d4) = (1,2,4,6) with four different dilated rates is used in parallel to carry out feature sampling on the feature map ' U , and the feature map

Semantic activation block
The semantic activation module is used to adjust the interdependence between contextual feature information and channels by self-supervised learning, and to selectively enhance useful semantic information according to the self-attention mechanism and suppress harmful feature information.

Figure 4 Semantic activation block
The semantic activation module is shown in Figure 4, which consists of three steps: patial pooling fgap(· ), channel-wise attention learning ffcl(· ,θ), and channel weights adaptive ffuse(· ,· ).  Spatial pooling: For a given input X ∈ R H×W×C , by globally pooling X with f gap (·) to generate V ∈ R C , the i-th element in V is obtained as following:

Patial pooling: For a given input
Channel-wise attention learning: In order to make full use of the information summarized in V, the f f cl (·, θ) operation is used to capture the direct correlation of the channel. To do this, a gating mechanism and a sigmoid activation function are used as follows: Here ϕ denotes the ReLU activation function and σ denotes the Sigmoid activation function, θ 1 ∈ R C ×C , θ 2 ∈ R C×C . In order to reduce the complexity of the model, we use two fully connected methods to form the bottleneck layer. That is, firstly the dimension is reduced to C , and then it is upgraded to C.
In the experiment, we set C = 1 2 C in all modules. Channel weights adaptive: The final output selects the relevant semantic features by using f f use (·, ·), to make sure that the related semantic information is assigned a larger weight, and the unrelated semantic information is assigned a smaller weight for generating the final feature map X . Here, the c-th channel in X is defined as: Here X = [ x 1 , x 2 , . . . , x C ], x c ∈ R H×W .

Analysis and Discussion of Experimental Results
We implemented the proposed model CP-SSD with help of the pytorch [23] deep learning framework. The server configuration of the training model was: Intel(R) Xeon(R) E5-2620 v3 2.40GHz CPU, Tesla K80 GPU and Ubuntu64 system.

Data Sets and Data Enhancements
PASCAL VOC [24] is a benchmark dataset for visual object classification recognition and detection, which includes 20 categories. The VOC2007 test section (testing dataset of VOC2007) is widely used by the research community for validating the performance of object detection models. In our training process, all the samples of train and val of VOC2007 and VOC2012 are used as the training set. The training set contains 16,551 pictures with 40,058 objects and the testing set contains 4952 pictures with 12,032 objects. In this dataset, smaller objects account for a large proportion of the objects.
In order to make the model more robust to various input object sizes and shapes, each training image is randomly sampled in one of the following ways: (1) The original image without any further processing; (2) The original image with overlap of 0.1, 0.3, 0.5, 0.7 or 0.9 is selected; (3) A portion of the original image is cropped randomly.
After the above sampling step, each sampling area was resized to a fixed size (300 × 300) and flipped at a probability level of 0.5.

Experimental Parameter Settings
In order to compare the effectiveness of the CP-SSD network model with SSD, we used the same training settings as SSD. For the model, first we set lr = 10 −3 to train for 80k iterations, then we set lr = 10 −4 for 20k iterations and finally we set lr = 10 −5 for another 20k iterations. The momentum was fixed to be 0.9 and the weight decay was set to be 0.0005, batchsize = 32, and the backbone structure of the model was initialized using pre-trained VGG16 weights.

Ssd with Context Information Scene Perception Module
In Table 1, we validate the SSD with and without the context information scene perception module(CISP) for detection performance. In terms of general object detection, the overall performance of the model reached 77.6% after applying the context information sensing module to the SSD and the performance improved by 0.4% compared with the original SSD. Especially for samples with similar backgrounds and objects, the original SSD cannot detect some objects because it cannot understand the context information. Using the context information sensing module to perceive and fuse the local context information at different scales, it is possible to understand some complex scenes and detect the objects from the background.

Ssd with Semantic Activation Block
In Table 2, we show the detection performance of the SSD with and without semantic activation block (SAB). For high-level low-resolution feature maps, self-supervised adjustment of channel weights to enhance useful feature information can better distinguish between object and background. From the table, we can see that the semantic activation module can improve the performance of the model by 0.4%, which indicates the effectiveness of the semantic activation module. Compared with the original SSD, although the addition of the semantic activation module increases the amout of parameters and computation, the cost of the increased parameters and computation on the running time required by speed of the model is negligible.

Comparison of Methods
In Table 3, we compared the R-CNN, YOLO, and SSD methods on the VOC 2007 test dataset. For RCNN based algorithms, RCNN [3] is the first algorithm to use CNN for object detection. It has great shortcomings in the selection of region proposals. Too many region proposals are selected by the algorithm, which requires a lot of memory and the normalization process of the input network makes the algorithm lose a lot of context information and features, resulting in only 50.2% positioning accuracy. In order to cope with solving the feature loss problem of R-CNN in image normalization, Fast-RCNN [18] inputs the whole image into the network and extracts fixed-length feature vectors from the feature map through the region of interest (RoI) pool layer. The resulting classification and coordinate information eventually increased the accuracy to 70.0%. However, Faster-RCNN still does not solve the problem caused by 2000 regional proposals generated by selective search. Therefore, in Faster-RCNN [19] algorithms, the RPN module is proposed, which utilizes 9 kinds of anchors with different area ratios and applying CNN to complete the object detection compeletely. The mAP reached 76.4%. In YOLO [21] algorithms an end-to-end network is used to remove the selection of regional proposal individually, combining the selection of regional proposal with the object detection network. Due to the simple network structure, the speed of object detection is much higher than that of the RCNN based algorithms but there are great restrictions to the position and size of the object. The detection effect of mAP is only 57.9%, especially for small objects. In SSD [4], low-level feature maps are separated to improve the detection effect of small objects but there are shortcomings of insufficient semantic information. The additional detection layer features use the downsampling method to increase the receptive field but the resolution of the downsampling reduction feature map causes a large loss of feature information and the mAP on the testing dataset is only 77.2%. The mAP of the CP-SSD on the testing dataset reached 77.8%, which is 0.6% higher than the original SSD.
In CP-SSD, we use the CISP module to fuse context information and prior information between different scales and different sub-regions from feature maps U. In the context information perception module, convolution with different dilated rates can be used in parallel to capture different sizes of objects, which makes the model understand local context information more comprehensively. It alleviates the problem that SSD lacks understanding of semantic scene and context information. In the higher level feature map, we proposed the semantic activation module to enhance the semantic information. In the semantic activation module, the global average pooling method is used to remove the spatial information. It learns the relationship between channels and objects in a self-supervised way, promotes the useful feature information, restrains the irrelevant feature information and adjusts and enriches the semantic information. At the same time, SSD uses ResNet101 instead of VGG16 as the backbone network in the experiment. The network structure of ResNet101 is deeper than that of VGG16 and its feature extraction ability is stronger. However, the results of our proposed method using VGG16 (77.8%) perform significantly better than those using the SSD model of ResNet101 (77.1%), which highlights the effectiveness of the CP-SSD method.

Detection Examples
In Figure 5, we visualize some of the images. The localization results of CP-SSD were compared with the original SSD. As shown in Figure 5, in the upper two rows of images, SSD cannot locate people on horseback and boat, while CP-SSD can. CP-SSD uses the semantic activation module to capture more prior information before downsampling, so it can understand the image more accurately. In the lower two rows of images, the boats and buildings in the image are similar in shape and color and the color of the bird is similar to the surrounding environment. SSD cannot accurately detect the position of ships and birds because of the lack of understanding of the scene. CP-SSD can more fully understand the contextual prior information so that it can better distinguish the background and the detected objects, and determine the location of the ship and the bird through the contextual information.

Conclusions
In this paper, we proposed a single-shot object detection method, CP-SSD, to alleviate the problem of insufficient understanding of contextual scene information in SSD. We introduced a context information scene perception module and captured different scales of contextual information by parallel dilated convolution of different dilated rates, so as to improve the model's ability to understand the scene. Meanwhile, the semantic activation module was used to enrich the semantic information of the feature map in the deep detection feature map. We validated CP-SSD on the PASCAL VOC 2007 benchmark dataset. The experimental results showed that, compared with SSD, YOLO, Faster R-CNN and other methods, our proposed CP-SSD method had better performance on the test set and the mAP was 0.6% higher than that of SSD. In future research, we will work on how to balance the global information feature extraction and improve the accuracy of small object detection.
Author Contributions: Y.J. contributed towards the algorithms and the analysis. As the supervisor of Y.J., she proofread the paper several times and provided guidance throughout the whole preparation of the manuscript. T.P. and N.T. contributed towards the algorithms, the analysis, and the simulations and wrote the paper and critically revised the paper. All authors read and approved the final manuscript.