H-YOLO: A Single-Shot Ship Detection Approach Based on Region of Interest Preselected Network

: Ship detection from high-resolution optical satellite images is still an important task that deserves optimal solutions. This paper introduces a novel high-resolution image network-based approach based on the preselection of a region of interest (RoI). This pre-selected network ﬁrst identiﬁes and extracts a region of interest from input images. In order to e ﬃ ciently match ship candidates, the principle of our approach is to distinguish suspected areas from the images based on hue, saturation, value (HSV) di ﬀ erences between ships and the background. The whole approach is the basis of an experiment with a large ship dataset, consisting of Google Earth images and HRSC2016 datasets. The experiment shows that the H-YOLO network, which uses the same weight training from a set of remote sensing images, has a 19.01% higher recognition rate and a 16.19% higher accuracy than applying the you only look once (YOLO) network alone. After image preprocessing, the value of the intersection over union (IoU) is also greatly improved.


Introduction
Over the past few years, the object detection domain has rapidly improved, opening many valuable opportunities to detect ships in maritime environments. Applying remote detection to ships can be of value for many applications such as harbor surveillance, traffic monitoring, fishery management and pollution monitoring to mention a few examples. The extent of primary image materials to be detected ranges from optical remote sensing images [1][2][3][4][5][6] to synthetic aperture radar (SAR) images [7][8][9][10][11][12][13][14]. In [2], a novel spaceborne optical images (SDSOI) approach is adopted to remove most false alarms. In addition to eliminating interferences, attention has also been given to multi-target detection [6]. With early low resolution remote sensing images, previous studies considered ships as mostly a point target and then applied a series of methods such as false alarm rates (CFAR) [15], generalized likelihood ratio test (GLRT), template matching [16] and other methods [17][18][19][20], but overall ships were mostly approximated. With the recent development of high-resolution remote sensing images, additional target details and background information are now available. In particular, deep learning methods [21][22][23][24][25][26][27][28][29][30][31] bring new opportunities for target detection of high-resolution remote sensing images and are likely to produce much more robust ship target identification.

1.
We introduce a novel vessel remote sensing image classification network, the so-called HSV-YOLO network, consisting of two essential components: an HSV-operation module and a one-stage detection module. To the best of our knowledge, it is the first time that the difference of HSV color space is used as a filter to extract the useful RoIs to reduce detection calculation time.

2.
We designed an HSV-module, which consists of four crucial cores: a background removal operation, a noise removal operation, a box-finding operation, and a noise deletion operation. After these four steps, one can obtain valuable RoIs instead of noisy RoIs, and this in reasonable computing time.

3.
We designed a pipeline to deal with the outcomes of the HSV-operation module, containing three common situations when processing the images. The rest of the paper is organized as follows. Section 2 describes the component of our proposed network. The network setup, experimental result, and the analysis of the results are provided in Section 3. Finally, Section 4 concludes the paper and draws a few perspectives for further work.

HSV-YOLOv3
Object detection algorithms can be divided into one-stage and two-stage algorithms. One-stage algorithms are computationally more efficient than two-stage algorithms, while two-stage algorithms can achieve higher accuracy. Two-stage algorithms first get feature maps through a CNN, then send them into the region proposal network to select appropriate RoIs. The selected RoIs are next resized into the same output and classified by the one-stage detection module.
Compared with the two-stage target detection algorithms represented by faster RCNNs, one-stage algorithms directly provide category and location information through the backbone network instead of using a region proposal network (RPN). Therefore, the speed of recognition is much improved. Simultaneously, the accuracy can still reach an acceptable rate compared to the two-stage algorithm, which satisfies high real-time performance on mounted and unmanned devices. Taking you only look once version 3 (YOLOv3) as an example, it abandons the use of the RPN, extracts features through a backbone network, and then directly performs regional regression and target classification. In this way, the total detection time is much more efficient compared to the two-stage network. Figure 1 shows the whole architecture of the HSV-YOLO model, which includes an HSV operation module, a YOLO network, and a pipeline. The main role of the HSV operation module is to extract the regions of interest from the input images. The HSV operation module consists of four steps: background removal, noise removal, target selection, and noise contrast deletion. The YOLO network in Figure 1 is a one-stage object detection network that can be used to detect ships from the images delivered by the pipeline. The red, green, and blue lines in Figure 1 represent the processing paths of three output experiences, and the case switching algorithm is shown in Algorithm 1. The whole workflow of the proposed method illustrated in Figure 1 is described as follows. First, the input images are sent towards the HSV operation module in order to generate the regions of interest and obtain S, the total number of regions of interest. The workflow then switches towards case 1, which is denoted by the red line in Figure 1. Secondly, the RoIs generated by the HSV operation module are sent towards the detection network to obtain the number of unidentified RoIs N. Thirdly, we set the parameter of the unrecognized rate k and compare the value of N divided by S with k. If N divided by S is larger than k, the workflow will switch towards case 2. Otherwise, it will switch towards case 3. The respective detection workflows of case 1, 2, and 3 are further described in Section 2.4.

Algorithm 1 Case Switching Procedure in H-YOLO
Initialize Set total number of identified region of interest S > 0, upper limit T, number of unidentified regions of interest N, and switching value of k; Input: Testing Image I test ; and obtain S, the total number of regions of interest. The workflow then switches towards case 1, which is denoted by the red line in Figure 1. Secondly, the RoIs generated by the HSV operation module are sent towards the detection network to obtain the number of unidentified RoIs N. Thirdly, we set the parameter of the unrecognized rate k and compare the value of N divided by S with k. If N divided by S is larger than k, the workflow will switch towards case 2. Otherwise, it will switch towards case 3. The respective detection workflows of case 1, 2, and 3 are further described in Section 2.4.  For these three types of ships, the apparent characteristics can also be seen through the edge detection diagram. For bulk carriers, rectangular boxes can be seen arranged regularly in the middle of the ship. The size of rectangular boxes in the container ship's body is different from each other, and the position regularity is worse than the dry bulk carrier. The sum of rectangular boxes extracted from the tankers' remote sensing image is lower than the bulk carrier and container. As mentioned above, the features make it possible for each neuron to fit each feature better when training the network. The edge detection diagram based on the difference of each vessel feature is shown in Figure 2. if N/S < k then 6: Coordinatei, Labeli = Case1(Itest) 7: else 8: Coordinatei, Labeli = Case2(Itest) 9: end 10: end For these three types of ships, the apparent characteristics can also be seen through the edge detection diagram. For bulk carriers, rectangular boxes can be seen arranged regularly in the middle of the ship. The size of rectangular boxes in the container ship's body is different from each other, and the position regularity is worse than the dry bulk carrier. The sum of rectangular boxes extracted from the tankers' remote sensing image is lower than the bulk carrier and container. As mentioned above, the features make it possible for each neuron to fit each feature better when training the network. The edge detection diagram based on the difference of each vessel feature is shown in Figure  2.

HSV Processing
There are two common color spaces to describe an object. The first one is the RGB color space, which is widely used as a standard display system. Despite the convenience for a computer to display such images, RGB color spaces have a series of drawbacks. The images captured in naturally occurring conditions or environments are prone to be affected by natural lighting intensity. For instance, it is hard for an RGB color space to describe continuous colors in this kind of situation. However, the hue, saturation, value (HSV) color space is more suitable to describe such

HSV Processing
There are two common color spaces to describe an object. The first one is the RGB color space, which is widely used as a standard display system. Despite the convenience for a computer to display such images, RGB color spaces have a series of drawbacks. The images captured in naturally occurring conditions or environments are prone to be affected by natural lighting intensity. For instance, it is hard for an RGB color space to describe continuous colors in this kind of situation. However, the hue, saturation, value (HSV) color space is more suitable to describe such configurations. HSV provides a color space based on the color's intuitive characteristics, also known as the hexcone model, as shown Remote Sens. 2020, 12, 4192 5 of 18 in Figure 3. In fact, the HSV color space is often more appropriate to describe the color distribution of most remote sensing images.
Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 19 configurations. HSV provides a color space based on the color's intuitive characteristics, also known as the hexcone model, as shown in Figure 3. In fact, the HSV color space is often more appropriate to describe the color distribution of most remote sensing images. The vessel's hue and value channels are quite different from the background value of a remote sensing picture in the ocean. Figure 4 shows an example of the different HVS channel images of some ships in the ocean. Figure 4 shows that the targets in the image of the value channel are relatively notable, while the initial noisy image information in the original image is removed in the hue and saturation channels.   Figure 4 shows an example of the different HVS channel images of some ships in the ocean. Figure 4 shows that the targets in the image of the value channel are relatively notable, while the initial noisy image information in the original image is removed in the hue and saturation channels. configurations. HSV provides a color space based on the color's intuitive characteristics, also known as the hexcone model, as shown in Figure 3. In fact, the HSV color space is often more appropriate to describe the color distribution of most remote sensing images. The vessel's hue and value channels are quite different from the background value of a remote sensing picture in the ocean. Figure 4 shows an example of the different HVS channel images of some ships in the ocean. Figure 4 shows that the targets in the image of the value channel are relatively notable, while the initial noisy image information in the original image is removed in the hue and saturation channels.

Modeling Approach
According to the real-time requirement of our application scenario, we selected the YOLO framework to be optimized. The objective is to combine the characteristics of the detection target with an improved YOLO algorithm to obtain a new H-YOLO algorithm. The HSV operation module difference extracts the detected pictures from the remote sensing image before being sent to the detection step. Firstly, the network algorithm suppresses the background area using HSV's characteristics, as shown in Figure 5. Secondly, the noise removal module eliminates the interferences and applies thresholds to enhance the picture's contrast. At the last step, the frame is recognized to extract the processing object.

Modeling Approach
According to the real-time requirement of our application scenario, we selected the YOLO framework to be optimized. The objective is to combine the characteristics of the detection target with an improved YOLO algorithm to obtain a new H-YOLO algorithm. The HSV operation module difference extracts the detected pictures from the remote sensing image before being sent to the detection step. Firstly, the network algorithm suppresses the background area using HSV's characteristics, as shown in Figure 5. Secondly, the noise removal module eliminates the interferences and applies thresholds to enhance the picture's contrast. At the last step, the frame is recognized to extract the processing object. Using the objects extracted from the last step significantly reduces the input image scale, which also improves the convolution fitting operation. After the data prediction is obtained, the label is transmitted back to identify the extraction area.

Adoptive RoI Extraction
The extraction of RoIs based on HSV differences requires a background HSV value in order for the denoising algorithms to remove the background. To suppress the background, it is necessary to get the HSV value of the background. As shown in Figure 6, each ship's background in the remote sensing image is related to its surrounding sea depth, location, and weather conditions.  Using the objects extracted from the last step significantly reduces the input image scale, which also improves the convolution fitting operation. After the data prediction is obtained, the label is transmitted back to identify the extraction area.

Adoptive RoI Extraction
The extraction of RoIs based on HSV differences requires a background HSV value in order for the denoising algorithms to remove the background. To suppress the background, it is necessary to get the HSV value of the background. As shown in Figure 6, each ship's background in the remote sensing image is related to its surrounding sea depth, location, and weather conditions. Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 19

Modeling Approach
According to the real-time requirement of our application scenario, we selected the YOLO framework to be optimized. The objective is to combine the characteristics of the detection target with an improved YOLO algorithm to obtain a new H-YOLO algorithm. The HSV operation module difference extracts the detected pictures from the remote sensing image before being sent to the detection step. Firstly, the network algorithm suppresses the background area using HSV's characteristics, as shown in Figure 5. Secondly, the noise removal module eliminates the interferences and applies thresholds to enhance the picture's contrast. At the last step, the frame is recognized to extract the processing object. Using the objects extracted from the last step significantly reduces the input image scale, which also improves the convolution fitting operation. After the data prediction is obtained, the label is transmitted back to identify the extraction area.

Adoptive RoI Extraction
The extraction of RoIs based on HSV differences requires a background HSV value in order for the denoising algorithms to remove the background. To suppress the background, it is necessary to get the HSV value of the background. As shown in Figure 6, each ship's background in the remote sensing image is related to its surrounding sea depth, location, and weather conditions.  As shown in the Figure 6, the average HSV of the background in Figure 6a is H: 101, S: 136, V: 80, while the average HSV of the background in Figure 6b is H: 79, S: 84, V: 109, respectively. Although the background is different in each figure, it still can be observed that almost all values of HSV in the background around the ship are similar. In order to achieve adaptive region of interest (RoI) extraction based on HSV differences, a certain number of pixels Np are randomly extracted from the identified image in order to derive the average value of these pixels to estimate background HSV values. Pixels are captured in the upper and lower adjacent intervals of the value to remove, and this interval contains the HSV value of the background. The pixel is set to black if the HSV value of the image is in the interval. Conversely, we set the pixel to white, as shown in the middle column of Figure 7.
Remote Sens. 2020, 12, x FOR PEER REVIEW 7 of 19 As shown in the Figure 6, the average HSV of the background in Figure 6a is H: 101, S: 136, V: 80, while the average HSV of the background in Figure 6b is H: 79, S: 84, V: 109, respectively. Although the background is different in each figure, it still can be observed that almost all values of HSV in the background around the ship are similar. In order to achieve adaptive region of interest (RoI) extraction based on HSV differences, a certain number of pixels Np are randomly extracted from the identified image in order to derive the average value of these pixels to estimate background HSV values. Pixels are captured in the upper and lower adjacent intervals of the value to remove, and this interval contains the HSV value of the background. The pixel is set to black if the HSV value of the image is in the interval. Conversely, we set the pixel to white, as shown in the middle column of Figure 7.   Figure 7 is the original picture. One can get the middle column of images after removing the image backgrounds. It is easier to locate the targets when using the images without backgrounds than finding the targets in the images that contain a background. When the image contained massive ships, harbor information, or other inference caused by the wind or waves, it became difficult to withdraw the image background so the number of possible regions of interest was relatively high. Case 1 and case 3 described in Section 2.4 were developed for this situation. When the total number of RoIs and the unrecognition ratio N/S increases, this triggers the process of sending original images into the detection network.

Reference Correction
The above method can suppress and eliminate the salt and pepper noise points. However, it is still a tricky problem for the noise removal module to deal with relatively large noise areas like the noise area caused by the waves sailing boats left behind (as shown in the box (2) in the following  Figure 7 shows the background suppression performance using different backgrounds affected by the weather conditions and sea depth. It can be observed that different backgrounds have been removed from the results of background suppression in the middle column of Figure 7. The left column in Figure 7 is the original picture. One can get the middle column of images after removing the image backgrounds. It is easier to locate the targets when using the images without backgrounds than finding the targets in the images that contain a background. When the image contained massive ships, harbor information, or other inference caused by the wind or waves, it became difficult to withdraw the image background so the number of possible regions of interest was relatively high. Case 1 and case 3 described in Section 2.4 were developed for this situation. When the total number of RoIs and the unrecognition ratio N/S increases, this triggers the process of sending original images into the detection network.

Reference Correction
The above method can suppress and eliminate the salt and pepper noise points. However, it is still a tricky problem for the noise removal module to deal with relatively large noise areas like the noise area caused by the waves sailing boats left behind (as shown in the box (2) in the following Figure 8d). In addition, in the case of large winds and waves in the sea area where the hull is located, the amplitude of HSV fluctuations in the sea area around the hull will be large, thus causing the output image of noise removal to be unacceptable.
Remote Sens. 2020, 12, x FOR PEER REVIEW 8 of 19 Figure 8d). In addition, in the case of large winds and waves in the sea area where the hull is located, the amplitude of HSV fluctuations in the sea area around the hull will be large, thus causing the output image of noise removal to be unacceptable.
(a) To solve this issue, additional statistical analysis of residual noise and ship boxes was performed. Figure 9 is created according to Tables A1 and A2 in Appendix A, and it shows the comparison of the distribution between noise and ship boxes. The two box characteristics can then be analyzed from the joint distribution map. It appears that the noise boxes are mainly distributed between 0 and 70 pixels. The joint distribution map of the noise box is flat at 45°, thus indicating that the residual noise frame is more likely to appear in the form of a rectangular frame. However, the dimension of the ships is mainly distributed above 70 pixels, and the orientation of the targets causes the triangle shape. From the above analysis, the residual noise boxes' characteristics are summarized as a square frame with dimension distributed between 0 and 70 pixels; the characteristics of the ships' boxes are summarized as a rectangular frame with dimension distributed above 70 pixels. To solve this issue, additional statistical analysis of residual noise and ship boxes was performed. Figure 9 is created according to Tables A1 and A2 in Appendix A, and it shows the comparison of the distribution between noise and ship boxes. The two box characteristics can then be analyzed from the joint distribution map. It appears that the noise boxes are mainly distributed between 0 and 70 pixels. The joint distribution map of the noise box is flat at 45 • , thus indicating that the residual noise frame is more likely to appear in the form of a rectangular frame. However, the dimension of the ships is mainly distributed above 70 pixels, and the orientation of the targets causes the triangle shape. From the above analysis, the residual noise boxes' characteristics are summarized as a square frame with dimension distributed between 0 and 70 pixels; the characteristics of the ships' boxes are summarized as a rectangular frame with dimension distributed above 70 pixels.
Remote Sens. 2020, 12, x FOR PEER REVIEW 8 of 19 Figure 8d). In addition, in the case of large winds and waves in the sea area where the hull is located, the amplitude of HSV fluctuations in the sea area around the hull will be large, thus causing the output image of noise removal to be unacceptable.
(a) To solve this issue, additional statistical analysis of residual noise and ship boxes was performed. Figure 9 is created according to Tables A1 and A2 in Appendix A, and it shows the comparison of the distribution between noise and ship boxes. The two box characteristics can then be analyzed from the joint distribution map. It appears that the noise boxes are mainly distributed between 0 and 70 pixels. The joint distribution map of the noise box is flat at 45°, thus indicating that the residual noise frame is more likely to appear in the form of a rectangular frame. However, the dimension of the ships is mainly distributed above 70 pixels, and the orientation of the targets causes the triangle shape. From the above analysis, the residual noise boxes' characteristics are summarized as a square frame with dimension distributed between 0 and 70 pixels; the characteristics of the ships' boxes are summarized as a rectangular frame with dimension distributed above 70 pixels. After analyzing the characteristics of residual noise frames, such frames are removed. It can be observed from Figure 10 that a few noise frames generated by the wind and waves are accurately removed, leaving only the special boxes of the vessel. For example, the noise frames in Figure 10a,b are totally removed. The total number of removed noise boxes in these two figures are 1 box and 54 boxes, respectively. As shown in Figure 10c, 79 noise frames were removed, and only one noise frame is left. For this kind of noise frame containing objects different from ships, case 1 and case 3 in Section 2.4 were developed to solve this issue.
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 19 After analyzing the characteristics of residual noise frames, such frames are removed. It can be observed from Figure 10 that a few noise frames generated by the wind and waves are accurately removed, leaving only the special boxes of the vessel. For example, the noise frames in Figure 10a,b are totally removed. The total number of removed noise boxes in these two figures are 1 box and 54 boxes, respectively. As shown in Figure 10c, 79 noise frames were removed, and only one noise frame is left. For this kind of noise frame containing objects different from ships, case 1 and case 3 in Section 2.4 were developed to solve this issue.

Anti-distortion operation
Since vessel positions as they appear in the remote sensing images are different from each other, as shown in Figure 11, this situation will cause the length of the sides selected by the RoI extraction network based on the difference of HSV to be different from each other. When the vessel's angle is far from 45°, the aspect ratio is far from 1.

Anti-Distortion Operation
Since vessel positions as they appear in the remote sensing images are different from each other, as shown in Figure 11, this situation will cause the length of the sides selected by the RoI extraction network based on the difference of HSV to be different from each other. When the vessel's angle is far from 45 • , the aspect ratio is far from 1. After analyzing the characteristics of residual noise frames, such frames are removed. It can be observed from Figure 10 that a few noise frames generated by the wind and waves are accurately removed, leaving only the special boxes of the vessel. For example, the noise frames in Figure 10a,b are totally removed. The total number of removed noise boxes in these two figures are 1 box and 54 boxes, respectively. As shown in Figure 10c, 79 noise frames were removed, and only one noise frame is left. For this kind of noise frame containing objects different from ships, case 1 and case 3 in Section 2.4 were developed to solve this issue.

Anti-distortion operation
Since vessel positions as they appear in the remote sensing images are different from each other, as shown in Figure 11, this situation will cause the length of the sides selected by the RoI extraction network based on the difference of HSV to be different from each other. When the vessel's angle is far from 45°, the aspect ratio is far from 1.
(a) (b) (c) Figure 11. Different angles of the hull lead to different aspect ratios of the frame: (a) with aspect ratio greater than one, (b) where aspect ratio is smaller than one and (c) where aspect ratio is close to one.
Before sending the selected box into the YOLOv3 network without processing it, it must be modified to an input image of rectangular size 416 × 416. This operation causes features squeezing and makes it harder for the network to recognize and classify the features. The orientation of ships determines the width and height of the region of interest extracted by the HSV operation module. When the image information in the region of interest is squeezed, the pixel information containing ships is different from the training data's pixel information. Thus, the squeezing of the region of interest surely has an impact on ship detection. Intuitively, this can be explained as shown in Figure 12; the picture features with an aspect ratio close to 1 are less likely to be squeezed.
Remote Sens. 2020, 12, x FOR PEER REVIEW 10 of 19 Figure 11. Different angles of the hull lead to different aspect ratios of the frame: (a) with aspect ratio greater than one, (b) where aspect ratio is smaller than one and (c) where aspect ratio is close to one.
Before sending the selected box into the YOLOv3 network without processing it, it must be modified to an input image of rectangular size 416 × 416. This operation causes features squeezing and makes it harder for the network to recognize and classify the features. The orientation of ships determines the width and height of the region of interest extracted by the HSV operation module. When the image information in the region of interest is squeezed, the pixel information containing ships is different from the training data's pixel information. Thus, the squeezing of the region of interest surely has an impact on ship detection. Intuitively, this can be explained as shown in Figure 12; the picture features with an aspect ratio close to 1 are less likely to be squeezed. A solution can be obtained from the above analysis: performing an anti-extrusion treatment on the detection frame before sending it into detection, thereby making the characteristics better retained. Firstly, get the coordinates of the upper left corner (x, y) and the height and width values (h, w) by the processes of RoI extraction based on HSV difference.
Then compare the h and w values, and take the larger value as the final side length values. Finally, the last upper left corner coordinates are calculated according to the following Equation (1).
After calculating the coordinates of the upper left corner of the anti-compression frame, the necessary information of the extracted area, namely (a, b) and the side length s, can be obtained. The process is shown in Figure 13. A solution can be obtained from the above analysis: performing an anti-extrusion treatment on the detection frame before sending it into detection, thereby making the characteristics better retained. Firstly, get the coordinates of the upper left corner (x, y) and the height and width values (h, w) by the processes of RoI extraction based on HSV difference.
Then compare the h and w values, and take the larger value as the final side length values. Finally, the last upper left corner coordinates are calculated according to the following Equation (1).
After calculating the coordinates of the upper left corner of the anti-compression frame, the necessary information of the extracted area, namely (a, b) and the side length s, can be obtained. The process is shown in Figure 13.

Switch Network Conditions
The processing pipeline of RoI extraction based on the HSV difference will change when the following two situations occur: 1. Switching condition 1: Record the total number of RoIs extracted, based on the HSV difference as S and set an upper limit value as T. When S is more massive than T, it directly jumps out of the extraction process. 2. Switching condition 2: Record the unknown number of the input image as N. Set a proportional value as k, and when the N/S ratio exceeds k, the network will be automatically switched to the yolov-3 network.
(2) Figure 14 shows the flowchart of the proposed network, that includes cases 1, 2, and 3. The input images were sent into the HSV operation module to extract the RoIs and obtain the total number S. If S is larger than the upper limit T, it means that there is massive noise so the inputs are not suitable to detect all of the RoIs. The original input images will be sent into the detection network as shown in Figure 14a. However, if S is smaller than the upper limit T, the RoI extracted will be sent into the detection network in order to obtain the label and N, the total number of unrecognized RoI. The processing flow will be chosen according to the proportion of N and S. If the N/S ratio is smaller than k, the label will be attached to the RoI extracted from original images as shown in Figure 14b. In contrast, if the N/S ratio is larger than k, which means the recognition rate is still not high enough, the original image will be sent into the detection network, as shown in Figure 14c.

Switch Network Conditions
The processing pipeline of RoI extraction based on the HSV difference will change when the following two situations occur:

1.
Switching condition 1: Record the total number of RoIs extracted, based on the HSV difference as S and set an upper limit value as T. When S is more massive than T, it directly jumps out of the extraction process.

2.
Switching condition 2: Record the unknown number of the input image as N. Set a proportional value as k, and when the N/S ratio exceeds k, the network will be automatically switched to the yolov-3 network.
(2) Figure 14 shows the flowchart of the proposed network, that includes cases 1, 2, and 3. The input images were sent into the HSV operation module to extract the RoIs and obtain the total number S. If S is larger than the upper limit T, it means that there is massive noise so the inputs are not suitable to detect all of the RoIs. The original input images will be sent into the detection network as shown in Figure 14a. However, if S is smaller than the upper limit T, the RoI extracted will be sent into the detection network in order to obtain the label and N, the total number of unrecognized RoI. The processing flow will be chosen according to the proportion of N and S. If the N/S ratio is smaller than k, the label will be attached to the RoI extracted from original images as shown in Figure 14b. In contrast, if the N/S ratio is larger than k, which means the recognition rate is still not high enough, the original image will be sent into the detection network, as shown in Figure 14c

Ablation Experiment
Aiming at the vessel classification, the YOLOv3 algorithm is considered as a core to building a novel network. The main idea is first to classify the vessels use the YOLO-tiny algorithm, then use YOLO-tiny as the core of the HSV-based method to detect the ship and analyze the results. The main principles are as follows: 1.
The training and testing sets are collected from Google Earth, with 560 samples containing categories of tanks, bulk carriers, and containers. Use this small training set to train and test the YOLO-tiny and HSV-base-YOLO-tiny algorithm.

2.
Use a small sample (including 500 training samples) training set to train the network on the YOLO-tiny framework to get a weight file. YOLOv3-tiny and the improved SV-based-YOLOv3-tiny use the same weight file for testing.

3.
To evaluate our proposed method's performance on a lightweight data set that only provides limited samples, we use the HRSC2016 dataset [36]. It is a public high-resolution ship dataset that covers bounding-box labeling and three-level classes, including ship, ship category, and ship types. The HRSC2016 dataset contains images from two scenarios including ships on sea and ships close inshore. The dataset is derived from Google Earth images and associated annotations. The properties of the HRSC2016 dataset are shown in Table 1. The YOLOv3 of the RoI network based on the HSV difference and the separate YOLOv3 network use the same trained weights described above, and the remaining variables are precisely the same except for the RoI extraction network. We use the same picture as the input of the two networks to get the test results. With respect to a given number N p for the remove background step, values have been tested from 30 to 120 and finally we set N p = 100, which provided the best performance. The proportion of pixel captured as ships is higher when Np is smaller than 100, this denoting that the calculated average value cannot represent the average value of the background. However, when Np is higher than 100, the results of the background removal are similar. For the parameter in the noise removal step, we set the value of the Gaussian blur kernel at (9 × 9).

Data Analysis
Through experiments, under the condition of small samples, the missed detection rate of YOLOv3 is 35.54%, the missed detection rate of HSV-YOLOv3 is 16.53%, and the missed detection rate of the improved method drops by 19.01%. In comparison, the accuracy rate of YOLOv3 is 70.25%, and the accuracy rate of HSV-YOLOv3 is 86.44%, an increase of 16.19%. Figures 15 and 16 show an evaluation of our method as compared to YOLO, using similar weights on tiny objects, multiple objects, and large objects. The left column is the ground truth, the middle column is the test result of the original YOLO network, and the last column is the test result of H-YOLO. The proposed method solves the problem of not identifying a ship and not being able to obtain its location.  The network is split according to the RoI network extraction steps and based on HSV difference. The network's step can be divided into background removal based on the HSV difference, noise removal, target frame selection, and the deletion step. For the background removal step based on the HSV difference, it can be seen from Figure 17 that the 400 pieces of test data are all under 0.01s; the time required for the selection step of the target frame is similar to that of the background removal step, which is also under 0.01s. The average time of both of them fluctuates up and down 0.003 s; the longest step of the average time is the noise removal step and the median value is around 0.023. The determinant of its duration is the upper limit value T in switching condition 1. The larger the value of T, the longer the period needed.  The network is split according to the RoI network extraction steps and based on HSV difference. The network's step can be divided into background removal based on the HSV difference, noise removal, target frame selection, and the deletion step. For the background removal step based on the HSV difference, it can be seen from Figure 17 that the 400 pieces of test data are all under 0.01s; the time required for the selection step of the target frame is similar to that of the background removal step, which is also under 0.01s. The average time of both of them fluctuates up and down 0.003 s; the longest step of the average time is the noise removal step and the median value is around 0.023. The determinant of its duration is the upper limit value T in switching condition 1. The larger the value of T, the longer the period needed. The network is split according to the RoI network extraction steps and based on HSV difference. The network's step can be divided into background removal based on the HSV difference, noise removal, target frame selection, and the deletion step. For the background removal step based on the HSV difference, it can be seen from Figure 17 that the 400 pieces of test data are all under 0.01s; the time required for the selection step of the target frame is similar to that of the background removal step, which is also under 0.01s. The average time of both of them fluctuates up and down 0.003 s; the longest step of the average time is the noise removal step and the median value is around 0.023. The determinant of its duration is the upper limit value T in switching condition 1. The larger the value of T, the longer the period needed. Test and record all the pictures in the training sample set to obtain the box diagram shown in Figure 18. It can be seen from the figure that the time required for each step is statistically compared with the time required for the YOLOv3 structure to identify a picture, which is approximately 20% of the time required for YOLOv3.

Conclusions
The research presented in this paper shows that a region of interest extraction and preprocessing based on the HSV difference can improve the ship detection accuracy in a relatively short computation time. From the experimental data, it can be observed that this method achieved a good recognition rate and received a better performance compared with its core algorithm. This method is particularly suitable for all detections which have simple background or a continuous color space in a local area. The HSV difference operation is computationally efficient and is a high precision method when pre-extracting a target. The experiments also show that the proposed method can generate Test and record all the pictures in the training sample set to obtain the box diagram shown in Figure 18. It can be seen from the figure that the time required for each step is statistically compared with the time required for the YOLOv3 structure to identify a picture, which is approximately 20% of the time required for YOLOv3. Test and record all the pictures in the training sample set to obtain the box diagram shown in Figure 18. It can be seen from the figure that the time required for each step is statistically compared with the time required for the YOLOv3 structure to identify a picture, which is approximately 20% of the time required for YOLOv3.

Conclusions
The research presented in this paper shows that a region of interest extraction and preprocessing based on the HSV difference can improve the ship detection accuracy in a relatively short computation time. From the experimental data, it can be observed that this method achieved a good recognition rate and received a better performance compared with its core algorithm. This method is particularly suitable for all detections which have simple background or a continuous color space in a local area. The HSV difference operation is computationally efficient and is a high precision method when pre-extracting a target. The experiments also show that the proposed method can generate Time (s)

Background_Remove
Noise_Remove Find_Box Compare_Del YOLO_Time Figure 18. The RoI extraction steps based on the HSV difference are compared with the YOLOv3 framework recognition time.

Conclusions
The research presented in this paper shows that a region of interest extraction and preprocessing based on the HSV difference can improve the ship detection accuracy in a relatively short computation time. From the experimental data, it can be observed that this method achieved a good recognition rate and received a better performance compared with its core algorithm. This method is particularly suitable for all detections which have simple background or a continuous color space in a local area. The HSV difference operation is computationally efficient and is a high precision method when pre-extracting a target. The experiments also show that the proposed method can generate high-quality RoIs by sacrificing little computing time. The proposed method also contains the pipeline to deal with noise information and prevent the method from falling into meaningless calculations. This mechanism ensures the efficiency of the proposed method. The method of image processing in the HSV operation module also affects the processing time, and we will compare different processing methods in our future work. The YOLOv3 algorithm used in the network can be deployed as other one-stage algorithms such as SSD, so performance comparison with different framework algorithms will be also be done in further work.