An Approach on Image Processing of Deep Learning Based on Improved SSD

: Compared with ordinary images, each of the remote sensing images contains many kinds of objects with large scale changes, providing more details. As a typical object of remote sensing image, ship detection has been playing an essential role in the ﬁeld of remote sensing. With the rapid development of deep learning, remote sensing image detection method based on convolutional neural network (CNN) has occupied a key position. In remote sensing images, the objects of which small scale objects account for a large proportion are closely arranged. In addition, the convolution layer in CNN lacks ample context information, leading to low detection accuracy for remote sensing image detection. To improve detection accuracy and keep the speed of real-time detection, this paper proposed an efﬁcient object detection algorithm for ship detection of remote sensing image based on improved SSD. Firstly, we add a feature fusion module to shallow feature layers to reﬁne feature extraction ability of small object. Then, we add Squeeze-and-Excitation Network (SE) module to each feature layers, introducing attention mechanism to network. The experimental results based on Synthetic Aperture Radar ship detection dataset (SSDD) show that the mAP reaches 94.41%, and the average detection speed is 31FPS. Compared with SSD and other representative object detection algorithms, this improved algorithm has a better performance in detection accuracy and can realize real-time detection.


Introduction
Remote sensing image is the record of all kinds of objects on the ground by artificial satellite. Over the past 50 years, space information technology which is represented by satellite remote sensing technology, has provided an efficient means to obtain data from a large boundary. As a crucial research direction in the field of remote communication, remote sensing image object detection technology is an intelligent data analysis method to achieve automatic identification and localization for remote sensing objects, which is widely concerned in civil and military fields [1]. With the rapid development of highresolution technology, the quality of remote sensing images obtained by remote sensing satellites is getting better and better, which is of great significance to resource exploration, urban traffic management, military object recognition, environmental monitoring and so on. As a typical object of remote sensing image, ship detection plays a significant role in port traffic planning, marine safety management, maritime disaster relief and national defense security.
Due to the prominent capability of feature representation, deep learning algorithms begin to replace the traditional machine learning algorithms. At present, deep learning has been applied to many fields, including object detection [2], driverless car [3], machine translation [4], emotion recognition [5] and speech recognition [6]. Particularly, in the field of object detection, a variety of deep learning-based object detection algorithms aiming to resolve practical problems rush out, which are applied to fall detection [7], medical image segmentation [8], defect detection [9], face recognition [10], remote sensing object Symmetry 2021, 13, 495 2 of 15 detection [11] and so on. Remote sensing technology has developed rapidly in recent years, tremendously increasing the quality and quantity of remote sensing images captured by remote sensing devices. As an important application in the field of object detection, the massive growth of data promotes the rapid deployment of remote sensing object detection algorithm based on deep learning.
Object detection for ship in remote sensing image is an extraordinarily challenging task. Firstly, remote sensing image has high resolution, containing many object categories, of which the object scale changes greatly. The huge consumption of time and memory will impose strict requirements for algorithm and hardware. Then, convolutional neural network (CNN) is a multilayer structure, consisting of several convolutional layers. With the gradual increase of network depth, the feature information of small object will become less abundant. There are commonly a large number of small remote sensing objects, leaving a poor detection performance for small objects. Lastly, extensive complex backgrounds in remote sensing images will cause some false positives. As shown in Figure 1 [12], the ships in remote sensing image are small and sometimes there are complex backgrounds around them. Considering the above problems, how to improve the accuracy of ship detection has become a big challenge. This paper proposes an efficient ship detection algorithm for SAR remote sensing images based on improved SSD. The main innovation of this algorithm includes the following two points: (1) We add a feature pyramid network to SSD, introducing context information to Conv4_3 and Conv7 which are large scale feature maps and are responsible for detection of small objects. We use bilinear as upsampling method and element-wise sum as fusion method, designing a future fusion model to improve detection accuracy. (2) Then, in order to further improve the ability of feature extraction and raise the significant channel-wise feature as well as reduce insignificant channel-wise feature, a SE module is added, enabling model to perform dynamic channel-wise feature recalibration.
The rest of this paper is organized as follows. Section 2 gives a brief review of some typical object detection algorithms as well as research status for remote sensing ship detection. Section 3 details our improved method for ship detection. Section 4 introduces the experimental results and analyzes them in detail. Section 5 describes the conclusion of this paper and shows the highlights of this paper.

Related Work
The task of remote sensing object detection is to locate the researched object (e.g., aircrafts, vehicles, storage tanks, houses, ships, playgrounds, flyovers, etc.) in aerial or satellite images. As an application of data analysis technology, remote sensing object detection depends on all kinds of remote sensing data to a great extent. The remote sensing data is of large amounts, which is extremely hard for efficient data processing and analysis relying on previous machine learning models such as support vector machine (SVM), decision trees, naive Bayesian classification, logistic regression, clustering algorithms and so on. The past few years witnessed the development and decline of traditional object detection algorithms based on classical machine learning model. Traditional object detection algorithms mainly include Viola-Jones (V-J) object detection algorithm [13], Support Vector Machine (SVM) using Histogram of Oriented Gradients (HOG) features algorithm [14], Deformable Parts Model (DPM) algorithm [15] and so on. The traditional remote sensing object detection algorithm is based on the sliding window strategy to perform region selection, which requires manual feature extraction and selection including haar-like features and HOG features. As the extracted features need to be designed manually, and the selection of regional proposal has no pertinence, the time complexity and space complexity of traditional object detection are very high. Through massive growth of remote sensing images in the quantity and quality, huge computational costs leave traditional object detection algorithm unable to meet the requirement of practical application.
With the development of computer technology, especially parallel computing technology, deep learning-based object detection algorithms are playing a dominate role in the field of remote sensing object detection. Deep learning model is an effective tool to solve large-scale operations, which can accurately obtain the nonlinear relationship between the data by means of multilayer learning and has gradually become the preferred method for remote sensing image recognition and analysis. Deep learning-based object detection algorithms have a strong ability of feature expression to process remote sensing images, which can be divided into two categories of one-stage and two-stage according to whether region proposals come into being. There are two kinds of typical one-stage algorithms: You Only Look Once (YOLO) [16][17][18][19] and Single Shot Multibox Detector (SSD) [20][21][22][23][24]. Two-stage object detection algorithms mainly include R-CNN [25], SPP Net [26], Fast-RCNN [27], Faster R-CNN [28], HyperNet [29], R-FCN [30], MS-CNN [31] and Mask R-CNN [32]. As a series of regional proposals will be generated, two-stage object detection algorithms perform detection more precisely and have a slower detection speed in the meantime.
In order to improve the practical application effect while analyzing and processing remote sensing image, many researchers focus on the shortcomings of current object detection algorithms based on deep learning, making some structural refinement to algorithms according to characteristics of remote sensing technology. X. Nie et al. improved Mask R-CNN, proposing a ship detection method. This method added channel-wise and spatial attention mechanisms as well as a bottom-up architecture to improve detection accuracy [33]. Sun X et al. proposed a ship detection model based on YOLO using spinning object detection technology. After the refinement of the rotation matrix, this method restructures the loss function as well as the rotated IOU calculation formula. Then, through lightweight processing, the model partly reduces the redundant parameters increased by the augmented dimensions of output feature maps [34]. J. Qu et al. proposed DFSSD to increase detection accuracy for small objects in remote sensing images [35]. Different from the original SSD, in this algorithm, the random clipping procedures of data preprocessing layers are discarded, and a feature pyramid network (FPN) is added to enhance information of low-level feature map. Besides, the regular convolution in the third-level feature map is replaced with dilated convolution to extend the receptive field. Yin R et al. proposed AF-SSD [36]. This model utilizes MobileNet as the network backbone and designs a light encoding-decoding module to enhance the information of low-level features. In the meanwhile, a cascade architecture including spatial and channel attention modules is added to increase detection accuracy for objects with low-contrast and few-texture.
In the field of ship detection, many ship datasets come from Synthetic Aperture Radar (SAR) imagery, which has been regarded as a significant data collection method for monitoring maritime activities. Zhang T et al. proposes a ship detection model based on grid convolutional neural network (G-CNN) [37]. This model divides the SAR image into several grid cells and detects ship objects on three different scales feature maps. Wei S et al. proposed a ship detection method based on a high-resolution ship detection network (HR-SDNet) for high-resolution SAR imagery [38]. This method connects high-to-low resolution subnetworks in parallel and can maintain high resolution and utilizes Soft-NMS to improve the detection performance of the dense ships. Chen C et al. proposed a ship detection network combined with an attention mechanism and introduced a loss function that incorporates the generalized intersection over union (GIoU) loss to reduce the scale sensitivity of the network [39].
With the large proportion of small objects and the complex background in remote sensing image including ships, it is necessary to improve detection accuracy of remote sensing objects detection. The above research optimizes the algorithm from the perspective of improving accuracy detecting small objects. Moreover, at the same time as improving the accuracy, it should be guaranteed that the loss of speed and the increase of memory cannot be too huge considering the application of embedded device. There still remains plenty of scope for improvement in improving the accuracy of the remote sensing object detection model. Based on SSD algorithm which has an excellent performance in accuracy as well as speed, through the inspection of FPN [40], attention mechanism in computer vision [41][42][43][44] and other related relevant literature along with the studies at home and abroad, we proposed an efficient object detection algorithm for remote sensing image.

SSD
As a kind of one-stage algorithms, Single Shot MultiBox Detector (SSD) directly performs prediction using convolutional neural network (CNN). CNN are composed of many convolution layers. Related research has proved that with the increase of network layers depth, the information extracted from convolution layers is more and more abstract. Utilizing the characteristic that different depth convolution layer extracts different information features, SSD has achieved great results in object detection.

Multiscale Prediction
SSD is composed of VGG and extra feature layer. As shown in Figure 2, there are six feature layers to perform a prediction, of which Conv4_3 is from Visual Geometry Group Network (VGG), the remaining five feature layers are added on the base of VGG. During prediction, each feature map will produce default boxes with different scale. Figure 3b,c show feature maps of different scales. Each feature map is divided into several grids. (b) and (c) are divided into 64 and 16 grids, respectively. While predicting, the grid will generate several default boxes with different scales. As the proportion of default boxes accounted for in (b) is much smaller than that of (c), the default boxes in (b) can easily cover small scale objects. With the increase of network layers depth, the scale of feature maps will gradually become small. As a result, shallow feature maps can predict small objects precisely, and deep feature maps can predict large objects precisely, which is called multiscale prediction in SSD. The default boxes will further adjust the scale according to the loss function to generate the outputting prediction box. During prediction, each feature map will produce default boxes with different scale. Figure 3b,c show feature maps of different scales. Each feature map is divided into several grids. (b) and (c) are divided into 64 and 16 grids, respectively. While predicting, the grid will generate several default boxes with different scales. As the proportion of default boxes accounted for in (b) is much smaller than that of (c), the default boxes in (b) can easily cover small scale objects. With the increase of network layers depth, the scale of feature maps will gradually become small. As a result, shallow feature maps can predict small objects precisely, and deep feature maps can predict large objects precisely, which is called multiscale prediction in SSD. The default boxes will further adjust the scale according to the loss function to generate the outputting prediction box.  The scale of priors increases linearly with the depth of network layers, which varies from 0.2 to 0.9. The default box size of each feature map is calculated according to Equation (1); the parameters are shown in Table 1. The aspect ratio of the default boxes is set to 1, 2, 3, 1/2, 3/1. The height and width of each default box can be calculated according to Equation (2).
where means the default box scale of k-th feature map, means the respect ratio of default boxes, means the default box width of k-th feature map when the aspect ratio is , means the default box height of k-th feature map when the aspect ratio is . Particularly, when = 1, an extra square default box will be added, the scale is = . Figure 3. The default boxes in the feature map [20].
The scale of priors increases linearly with the depth of network layers, which varies from 0.2 to 0.9. The default box size of each feature map is calculated according to Equation (1); the parameters are shown in Table 1. The aspect ratio of the default boxes is set to 1, 2, 3, 1/2, 3/1. The height and width of each default box can be calculated according to Equation (2).
where s k means the default box scale of k-th feature map, a r means the respect ratio of default boxes, w a k means the default box width of k-th feature map when the aspect ratio is a, h a k means the default box height of k-th feature map when the aspect ratio is a. Particularly, when a r = 1, an extra square default box will be added, the scale is s − k = √ s k s k+1 . The output value consists of category confidence and bounding box position (height, width and center point coordinates). When the number of classes is 20, there will eventually be 8732 bounding boxes generating.

Loss Function
The loss function of the SSD algorithm is composed of confidence loss and localization loss. The loss function is defined as follows: where N refers to the number of positive samples among all default boxes, x p ij ∈ {0, 1}, x p i,j = 1 means that i-th default box matches with j-ground truth and the ground truth is positive, c refers to prediction of category confidence, l refers to prediction of bounding box position, g means the location parameter of ground truth, d means the coordinate of the default box.
The position error function is defined as follows: The confidence error function is denoted as follows:

Improved SSD
Based on SSD, we add some improvements including introducing attention mechanism and feature fusion module. The overall structure we improved is shown in Figure 4. Firstly, we add a feature pyramid network to SSD, introducing context information to Conv4_3 and Conv7 which are large scale feature maps and are responsible for detection of small objects. Then, in order to improve the ability of feature extraction and raise the significant channel-wise feature as well as reduce insignificant channel-wise feature, a SE module is added.

Feature Fusion Module
In SSD, there are six different feature maps to generate default boxes and perform prediction, during which the feature maps in low-level layer are superior to locate small targets, and the feature maps in high-level layer are superior to locate large targets. As the network becomes deep, the feature maps will be abundant in semantic characteristic as well as deficient in spatial characteristic.
Based on the characteristics of the multiscale training for SSD, it is crucial to connect the information of different level feature maps. The low-level feature map lacks semantic information, for which the feature information of high-level feature map can be introduced to low-level feature map by means of upsampling. Thus, to enhance the semantic feature extraction capability, there should be some context information to be introduced into the high-level feature maps. As is shown in Figure 5, FPN is composed of three parts: down-top part, top-down part and connection part. Down-top part is the forward propagation of CNN, during which semantic information will be gradually enhanced. The top-down part is the upsampling process; the size of upper feature map will be doubled by the method of bilinear. In order to solve the problem that the channels of connected feature maps are different, there will be a 1 × 1 conv operation to ensure that the channels of different feature maps are consistent before the connection with upper feature map. Then, two feature maps with the same size and channel will be connected; the connected method is element-wise sum. Based on the research of FPN and FSSD, we design a feature fusion model to enhance the ability of small object feature extraction. In this module, the feature map will be connected together with the upper feature map. The size of the upper feature map will be doubled by the method of upsampling, which mainly includes bilinear and deconvolution. There are two common connected methods, including concatenation and element-wise sum. According to the experimental results in FSSD [24], we choose bilinear as upsampling method and element-wise sum as fusion method.
where X i means different feature maps. T i means the transformation function of each source feature map before being added together. Φ f is the feature fusion function. Φ p is the function to generate pyramid features. We add this feature fusion module to Conv4_3 and Conv7. As Conv7 will be introduced the information of Conv8_2, Conv8_2 will also be pretreated. In the preprocess of pretreatment, the channel of each source layer will be changed by conv 1 × 1. Then, bilinear interpolation is used to resize two different feature maps to the same size. Particularly, in order to reduce the amount of calculation, we set the channel of Conv7 from 1024 to 512. In the process of feature fusion, different feature maps have different characteristics. As a result, the element-wise sum of feature maps will result in the aliasing effect, reducing detection accuracy. As shown in Figure 6, we proposed an aliasing effect reduction module, removing extra information.

SE Module
In the field of computer vision, attention mechanism is weight allocation mechanism. The weights that are originally evenly allocated will be redistributed according to the importance of each weight. The important weights are given more values, and the unimportant weights are given less values. As a kind of attention mechanism, Squeezeand-Excitation Networks (SENet) judges the importance of channel by considering the relationship between channels. The architecture of SE module is as shown in Figure 7. F tr is a transfer function mapping the input X to the feature maps U (U ∈ R H×W×C ). SE module is added to recalibrate the feature. The features U are firstly performed by a squeeze operation, a channel descriptor will be generated suppress the spatial dimensions (H × W) to 1 × 1. The descriptor can reweight distribution of channel-wise feature. Then an excitation operation will be performed, engendering a weight matrix with dimension is 1 × 1 × C. These weights in the weight matrix will be applied to the feature maps U to recalibrate the feature. Specifically, as shown in Figure 7, SE module is composed of three parts: squeeze part, excitation part and reweight part.
The squeeze part is an average pooling, squeezing the size of feature map to 1 × 1. The squeeze part can be described as follows: where (i, j) means abscissa and ordinate of the feature map (H × W), z c is a one-dimensional array. The excitation part can regenerate the weight of each channel according to the parameter of W. The squeeze part can be described as follows: where W 1 and W 2 is the weight matrix of two fully connected layers, respectively. δ refers to Relu function. s is the weight coefficient of different channels, whose dimension is C × 1 × 1.
In the reweight operation, s is regarded as the importance of each feature maps channel, and the weight of the original channel will change according to following equation.
where F scale (u c , s) refers to channel-wise multiplication between the scalar s and the feature map u c . We added SE modules to all six feature maps, enhancing the weight of contributing channels and suppressing the invalid features to improve the detection accuracy.

Experiment
We carry out some comparative experiments to test the effectiveness of our model based on SSDD dataset.

Dataset
We use SAR ship detection dataset (SSDD) as our experimental dataset. SSDD is the first dataset for ship detection in SAR image. This dataset is obtained by downloading the public SAR image on the Internet, clipping the object area to about 500 × 500 pixels. The data with resolution of 1 m-15 m is shot by RadarSat-2, TerraSAR-X and Sentinel-1 sensors. The background environment includes sea area and coastal area. Table 2 shows the statistics of the ships number per image in the SSDD: NoS means the number of ships, NoI means the number of images; there are 1160 images and 2456 ships, with 2.12 ships per image.

Evaluation Index
There are some evaluation indexes while experimenting. The intersection over union (IoU) measures the overlap degree of two regions and is calculated by Equation (15).

IoU =
area of overlap area of union (15) In object detection, there is a threshold to judge whether the prediction is correct. Positive is defined as a bounding box whose prediction score is larger than the threshold (α) and negative is defined as a bounding box whose prediction score is smaller than the threshold. When IoU is greater than threshold, the detection box is a true positive (TP). Furthermore, if less than threshold, it is called a false positive (FP). The false negative (FN) means that the model predicts there is no object in the image, but the image actually contains the object. In this way, a confusion matrix is constructed as Table 3. Then, precision and recall can be defined as follows: It is impossible to reach the highest level for the value of precision and recall at the same time. By changing the threshold, making recall as horizontal axis and precision as vertical axis, the P-R (precision-recall) curve can be obtained. According to P-R curve, AP (Average Precision) and mAP (mean Average Precision) can be calculated, which are more convincing to evaluate the model. Their calculations are as follows: where N means the number of all classes. In our experiment, as the category is just ship, mAP is equal to AP.

Experimental Results
We set training batch size to 16, total train epoch to 200 and initial learning rate to 0.001. When reaching 100-th and 150-th epoch, learning rate will decay to 0.0001 and 0.00001. Our experimental environment is shown in Table 4. Some detection results are shown in Figure 8. Figure 8a represents detection result in complex background and Figure 8b represents detection results under multiobject condition. Figure 9 is P-R curve of detection and shows the improved model's mAP can reach 94.41%.
In order to concretely display the results comparison of the improved model and the original SSD, some comparisons of detection boxes are shown in Figure 10. In Figure 10, the left side is our model's result, and the right side is SSD's result. As we can see, the prediction box score of our model is higher than that of SSD. Our improved model has an excellent performance under condition of multiobject and complex background.
Some detection results are shown in Figure 8. Figure 8a represents detection result complex background and Figure 8b represents detection results under multiobject con tion. Figure 9 is P-R curve of detection and shows the improved model's mAP can rea 94.41%.    We also compare detection results between our model and other state-of-art object detection algorithms. The parameters for comparison include mAP, memory and detection speed. The comparison of experimental results is as shown in Table 5 and Figure 11. As Faster R-CNN belongs to two-stage algorithms, it has the highest detection accuracy among these models and lowest detection speed, which is hard to practically apply. Our model has the second highest detection accuracy, which is 2.34% and 0.41% higher than SSD and YOLOV4. As we add feature fusion module and SE module, the model size and floating point operations per second (FLOPs) will be increased. As a result, our model's memory is 174 MB and detection speed is 31 FPS, which can also meet the requirements of real-time detection.

Conclusions
Aiming at the problem that the accuracy of ship detection still has space for further improvement, we proposed an efficient ship detection algorithm for SAR remote sensing images based on improved SSD. Firstly, we add a feature pyramid network to SSD, introducing context information to Conv4_3 and Conv7 which are large scale feature maps and are responsible for detection of small objects. Then, in order to further improve the ability of feature extraction and raise the significant channel-wise feature as well as reduce insignificant channel-wise feature, a SE module is added, enabling model to perform dynamic channel-wise feature recalibration. The experimental results based on SSDD dataset show that our improved model has an excellent performance in accuracy compared to other state-of-art object detection algorithms. Meanwhile, the detection speed of our model is 31 FPS, higher than the speed for real-time detection.
Author Contributions: G.L. contributed to the conception of the study. L.J. performed the experiment and the analysis with constructive discussions; L.J. and G.L. performed the data analyses and wrote the manuscript. All authors have read and agreed to the published version of the manuscript.