Small-Sized Ship Detection Nearshore Based on Lightweight Active Learning Model with a Small Number of Labeled Data for SAR Imagery

: Marine ship detection by synthetic aperture radar (SAR) is an important remote sensing technology. The rapid development of big data and artiﬁcial intelligence technology has facilitated the wide use of deep learning methods in SAR imagery for ship detection. Although deep learning can achieve a much better detection performance than traditional methods, it is difﬁcult to achieve satisfying performance for small-sized ships nearshore due to the weak scattering caused by their material and simple structure. Another difﬁculty is that a huge amount of data needs to be manually labeled to obtain a reliable CNN model. Manual labeling each datum not only takes too much time but also requires a high degree of professional knowledge. In addition, the land and island with high backscattering often cause high false alarms for ship detection in the nearshore area. In this study, a novel method based on candidate target detection, boundary box optimization, and convolutional neural network (CNN) embedded with active learning strategy is proposed to improve the accuracy and efﬁciency of ship detection in nearshore areas. The candidate target detection results are obtained by global threshold segmentation. Then, the strategy of boundary box optimization is deﬁned and applied to reduce the noise and false alarms caused by island and land targets as well as by sidelobe interference. Finally, a lightweight CNN embedded with active learning scheme is used to classify the ships using only a small labeled training set. Experimental results show that the performance of the proposed method for small-sized ship detection can achieve 97.78% accuracy and 0.96 F1-score with Sentinel-1 images in complex nearshore areas.


Introduction
Ship detection in SAR images plays a vital role in marine transportation and dynamic surveillance applications. Therefore, monitoring marine activity quickly and efficiently by the use of the remote sensing technique, which can be used to observe the Earth at a large scale, is important. Compared with optical remote sensing, SAR as an active remote sensing technique is an adequate approach for ship detection, as it is not only sensitive to water and hard targets but also works during daytime and nighttime, and in all weather conditions [1,2]. Fortunately, many SAR satellites, such as RADARSAT-1/2, TerraSAR-X, Sentinel-1A/B, ALOS-PALSAR, COSMO-SkyMed, and Gaofen-3, have been successfully launched in recent years, and are now providing many images in different modes and polarizations for maritime applications and ship detection.
In the previous studies, constant false alarm rate (CFAR), as a classical target detection method, has been usually used for ship detection [3][4][5]. However, applying a sliding window when processing SAR images by CFAR is necessary. Moreover, the setting of the protection and background window sizes not only affects the detection result but also takes satisfying performance due to the fact that most of the small-sized ships are nonmetallic fishing boats, and that generating strong dihedral angle scattering is hard due to their simple structure, material, and target wobble [21]. Another difficulty is that a large amount of labeled data is required to obtain a reliable model. Manual labeling each sample datum not only takes too much time but also requires a high degree of professional knowledge. The ocean surface waves, surface wind, upwelling, surface currents, eddies, and sea state can modulate and influence the ocean surface; thus, the SAR image is relatively complex in the ocean areas [22,23]. Therefore, manually labeling all samples of ships in different conditions is difficult. By labeling more annotated data, the quality of deep neural networks can be optimized. The difficulty is that manually labeling all samples of ships in different conditions is limited. A new training strategy should be adopted to obtain a stable ship detection model with a small number of labeled data. Although CNN networks are datadriven, the quality of the data is as important as the quantity. If the dataset contains ambiguous examples that are difficult to label accurately, the effectiveness of the model will be reduced. Active learning models can automatically label data by selecting those that the model considers most optimal, update the model, and repeat the process until the results are sufficiently good. Thus, inspired by active learning [24][25][26], a model was proposed that asks humans to annotate data that it considers uncertain. Models trained by active learning strategy are not only faster to train but also can converge to a good final model by using fewer data. An uncertainty-based approach [27], a diversity-based approach [28], and expected model change [29] are three major ways to select the next batch to be labeled [24,30]. Various methods for applying active learning to deep networks have been proposed recently; however, almost all of them are either designed specifically for their target tasks or operationally inadequate for large networks.
In this study, an improved two-stage ship detection method by active learning scheme with a small number of labeled sample data is proposed. To begin with, an exponential inverse cumulative distribution function [20] is employed to estimate the segmentation threshold and obtain candidate detection results. Then, the candidate detection results are optimized by the rule of boundary box distance. Finally, the candidate detection results slices are input into the lightweight CNN with embedded active learning scheme to accurately recognize the ships by labeling a small number of training data.
The main contributions of this study are detailed as follows: 1.
The boundary box distance is proposed to optimize candidate targets further, which makes the boundaries of the candidate targets more reasonable; 2.
In the training stage, the proposed method can achieve better performance with a small number of labeled data; 3.
In the ship detection stage, the proposed method is suitable for detecting a small-level ship on the nearshore.

Methods
A strategy that combines deep learning with active learning is proposed, as shown in Figure 1, to reduce the volume of labeled training samples and the labeling cost for ship detection. Figure 1 shows a typical example of a deep active learning model architecture. Algorithm 1 shows the training strategy. A huge number of unlabeled data U N is obtained. The subscript N indicates a huge number of data samples. K samples are randomly selected from the unlabeled pool and annotated manually. Then, an initially labeled dataset L 0 K is constructed. We define the size of the unlabeled dataset pool as U 0 N−K . The subscript 0 refers to the initial stage. As soon as the initially labeled dataset L i K is obtained, the loss function evaluates all the data in the unlabeled pool U i N−K to obtain the data loss. The top-K data with the highest prediction loss are labeled and then added to the labeled training set. After L i K is updated with the samples with the K highest losses, it becomes L i 2K , and the unlabeled pool is reduced and denoted as U i N−K at the same time. This cycle is repeated until the label budget is exhausted [24][25][26].  K samples are randomly selected from the unlabeled pool and annotated manually. Then, an initially labeled dataset 0 K L is constructed. We define the size of the unlabeled dataset pool as 0 NK U − . The subscript 0 refers to the initial stage. As soon as the initially labeled dataset i K L is obtained, the loss function evaluates all the data in the unlabeled pool i NK U − to obtain the data loss. The top-K data with the highest prediction loss are labeled and then added to the labeled training set. After i K L is updated with the samples with the K highest losses, it becomes 2 i K L , and the unlabeled pool is reduced and denoted as i NK U − at the same time. This cycle is repeated until the label budget is exhausted [24][25][26]. Train the lightweight CNN with embedded active learning scheme, and optimize it by stochastic gradient descent. The loss is calculated by target loss and loss prediction from the loss prediction module. ,, Then, get the uncertainty with the data samples of the k highest losses.

3.
Update the labeled dataset 2 i K L and unlabeled dataset i NK U − , respectively. end for end for The loss prediction module is the core to active learning for the task as the total loss defined in the model can be learned so as to imitate. This section describes how we design and improve the M-LeNet and ResNet models to make them suitable for active learning. The ResNet network architecture with residual learning framework has been proven to reduce the training error, converge quicky and avoid overfitting; hence, the ResNet18 was selected as the CNN baseline backbone target architectures [31]. The four convolution blocks of ResNet18 are selected as the loss prediction module. The size of the first convolution kernel size is changed from 5 to 3 to obtain detailed feature information in the Res-Net18 network architecture. Figure 2 shows the improved ResNet18 contains the baseline target backbone (blue dashed rectangle box) and loss prediction module (red dashed rectangle box). The mid-level feature map blocks of the improved ResNet18 target backbone Train the lightweight CNN with embedded active learning scheme, and optimize it by stochastic gradient descent. The loss is calculated by target loss and loss prediction from the loss prediction module.
Then, get the uncertainty with the data samples of the k highest losses.

3.
Update the labeled dataset L i 2K and unlabeled dataset U i N−K , respectively. end for end for The loss prediction module is the core to active learning for the task as the total loss defined in the model can be learned so as to imitate. This section describes how we design and improve the M-LeNet and ResNet models to make them suitable for active learning. The ResNet network architecture with residual learning framework has been proven to reduce the training error, converge quicky and avoid overfitting; hence, the ResNet18 was selected as the CNN baseline backbone target architectures [31]. The four convolution blocks of ResNet18 are selected as the loss prediction module. The size of the first convolution kernel size is changed from 5 to 3 to obtain detailed feature information in the ResNet18 network architecture. Figure 2 shows the improved ResNet18 contains the baseline target backbone (blue dashed rectangle box) and loss prediction module (red dashed rectangle box). The mid-level feature map blocks of the improved ResNet18 target backbone model are used as the input of the loss prediction module. Then, each feature map of the loss prediction module is connected by a global average pooling layer, a fully connected layer, and rectified linear unit layer. Finally, the total loss prediction could be obtained by concatenating target loss and prediction loss. The loss prediction module is much smaller and can learn jointly with the ResNet18 target backbone.
The performance of ship detection was once discussed in land-contained sea areas [19]. First, the candidate targets containing ships and false alarms were obtained by the CFAR method. Second, a dataset of 2286 ships and 2276 false alarms was constructed. Third, a CNN model was trained with constructed dataset and the final model was used to predict the ship [19]. Different from the candidate targets method in [19], a method based on exponential inverse cumulative distribution function was used to obtain candidate targets, which was proven to be faster and reasonable under different screens [20]. The lowcomplexity and lightweight M-LeNet was once proven to be effective for ship detection in the nearshore area [20]. Thus, the M-LeNet model in [20] is improved in the present study as the baseline backbone target module and loss prediction module in active learning, as shown in Figure 3. The two convolution blocks of M-LeNet are selected as loss prediction modules. The network contains two convolutional layers and has fewer parameters than the classical object detectors. Thus, the improved M-LeNet has a baseline backbone target module (blue dashed rectangle box) and a loss prediction module (red dashed rectangle box) consisting of blocks from the mid-level feature maps, as shown in Figure 3. Then, each feature map of the loss prediction module is connected by a global average pooling layer, a fully connected layer, and rectified linear unit. Finally, the total loss prediction could be obtained and jointly learned by concatenating target loss and prediction loss.
Remote Sens. 2021, 13, x FOR PEER REVIEW 5 of 22 model are used as the input of the loss prediction module. Then, each feature map of the loss prediction module is connected by a global average pooling layer, a fully connected layer, and rectified linear unit layer. Finally, the total loss prediction could be obtained by concatenating target loss and prediction loss. The loss prediction module is much smaller and can learn jointly with the ResNet18 target backbone. The performance of ship detection was once discussed in land-contained sea areas [19]. First, the candidate targets containing ships and false alarms were obtained by the CFAR method. Second, a dataset of 2286 ships and 2276 false alarms was constructed. Third, a CNN model was trained with constructed dataset and the final model was used to predict the ship [19]. Different from the candidate targets method in [19], a method based on exponential inverse cumulative distribution function was used to obtain candidate targets, which was proven to be faster and reasonable under different screens [20]. The low-complexity and lightweight M-LeNet was once proven to be effective for ship detection in the nearshore area [20]. Thus, the M-LeNet model in [20] is improved in the present study as the baseline backbone target module and loss prediction module in active learning, as shown in Figure 3. The two convolution blocks of M-LeNet are selected as loss prediction modules. The network contains two convolutional layers and has fewer parameters than the classical object detectors. Thus, the improved M-LeNet has a baseline backbone target module (blue dashed rectangle box) and a loss prediction module (red dashed rectangle box) consisting of blocks from the mid-level feature maps, as shown in Figure 3. Then, each feature map of the loss prediction module is connected by a global average pooling layer, a fully connected layer, and rectified linear unit. Finally, the total loss prediction could be obtained and jointly learned by concatenating target loss and prediction loss.  . Then, the final total loss function is defined as Equation (1), which could be jointly learned by the target backbone model and the loss prediction module [24]. ,, where λ is set to 1 in the experiment. Given a training data point x, a backbone target module f target , and a prediction module f loss , the goal of active learning is to obtain the baseline backbone target prediction by ∧ y = f target (x) and the prediction loss module by ∧ l = f loss (h). h is the mid-level feature map blocks of the improved ResNet18 or M-LeNet target backbone model. With the annotated data y t corresponding to the input data x, we can calculate the target loss by l = L target y t , ∧ y learning the target model. As the loss l is a ground-truth target of h for the loss prediction module, the loss of the prediction module can be obtained and computed by L loss l, ∧ l . Then, the final total loss function is defined as Equation (1), which could be jointly learned by the target backbone model and the loss prediction module [24].

Dataset
where λ is set to 1 in the experiment.

Dataset
The dataset is constructed by Level-1 Sentinel-1 Ground Range Detected product data, located in the East China sea [20]. The performance of ship detection in VH polarization is better than VV polarization as the speckle-noise and false alarm of VV polarization can affect vessel-detection results more easily than cross-polarization [20,32]. Hence, the VH polarization image is used for ship detection. The training dataset comes from VH polarization and contains slices of 2099 false alarms and 1566 different scale ships, as listed in Table 1. The false alarms are mainly caused by bridges, lighthouses, buildings, small islands, reefs, and rocks, as well as ghosts caused by azimuth ambiguity, as shown in Figure 4a. The ship mainly has a different large size and a strong scattering intensity, as shown in Figure 4b. Figure 5 shows some Google Earth ground truth and the corresponding false alarm candidates.    Using the dataset constructed by [20], we train the lightweight CNN with embedded active learning scheme. Then, another two images located in the Qiongzhou Strait and the East China Sea are used for the candidate detection by data preprocessing and test the efficiency of the CNN with embedded active learning scheme with a few annotated training samples. The details of the SAR images, including the acquisition time, swath width, and imaging mode, are listed in Table 2.     Using the dataset constructed by [20], we train the lightweight CNN with embedded active learning scheme. Then, another two images located in the Qiongzhou Strait and the East China Sea are used for the candidate detection by data preprocessing and test the efficiency of the CNN with embedded active learning scheme with a few annotated training samples. The details of the SAR images, including the acquisition time, swath width, and imaging mode, are listed in Table 2. Using the dataset constructed by [20], we train the lightweight CNN with embedded active learning scheme. Then, another two images located in the Qiongzhou Strait and the East China Sea are used for the candidate detection by data preprocessing and test the efficiency of the CNN with embedded active learning scheme with a few annotated training samples. The details of the SAR images, including the acquisition time, swath width, and imaging mode, are listed in Table 2.

Training Details
The experiments are conducted on a workstation that runs the Ubuntu 14.04 operating system, which is equipped with TITAN Xp GPU of 12 GB memory and Xeon W-2100 CPU of 32 RAM. We repeat the same experiment multiple times with different labeled sample datasets setting until the unlabeled datasets are exhausted for each active learning method. For each of the active learning cycles, we use stochastic gradient descent to optimize the baseline backbone and loss prediction module. The hyperparameters, such as initial learning rate, epochs, batch size, moment, and momentum were set at 0.01, 50, 32, 0.9, and 0.0005, respectively. After 30 epochs, the initial learning rate is divided by 10. The number of cycles depends on the number of unlabeled samples, but the total epoch is 1000 when iterating all unlabeled samples by active learning training strategy. For the supervised learning strategy, M-LeNet, and ResNet18 method, the parameters of the hyperparameters are set to be the same as those of the active learning method, except that the epoch is set to 1000. This setting is used to compare the efficiency between the active learning and supervised learning strategy under the same hyperparameters. After every 200 epochs, the learning rate is divided by 10. The support vector machine (SVM) and random forest (RF) were set with the default parameters by Python Scikit learn. The input data in the experiments were normalized to 0 and 1 to remove the effects of unit and scale differences between features. In those CNNs-based methods, such as improved M-LeNet and ResNet-18 with active learning strategy, the size of input data is 32 × 32 × 1, and in the SVM and RF, the data is stretched as a one-dimensional vector with the size of 624 × 1.

Evaluation Indexes
The evaluation indicators of accuracy, precision, recall, and F1-score are introduced to evaluate the performance of the different models, as shown in Equations (2)-(5). The F1score can be considered as a kind of reconciled average of accuracy and recall, which is widely used in the field of remote sensing classification and target extraction, and it is more valuable than precision and accuracy.
where true positive (TP) means that a positive sample (the ship) is accurately predicted; true negative (TN) means that a negative sample (the false alarm) is accurately identified; false positive (FP) means that the true category is not a ship, but the predicted category is a ship; false negative (FN) means the true category is a ship, but the predicted category is not a ship.

Candidate Detection
Two sub-images with 2855 × 2144 and 7833 × 5884 are clipped from Nos.1 and 2 to verify the accuracy of our method. Figures 6 and 7 are intensity images in VH that the areas located in the nearshore area of the Qiongzhou Strait and the East China Sea area, respectively. The background of the Qiongzhou Strait area is relatively simple with land, island, and ships. However, the background of the East China Sea area is complex with land, radio-frequency interference (RFI) [33], ships, islands, and reefs [19,20], as well as the noise effects in VH polarization [34]. CFAR, Ostu, spectral residual, and corner detection are often used to obtain candidate detection results. However, the method is ineffective in cases where the variance between the object and the background is very varied. The method in [20] is used in the current work to obtain candidate targets for reducing the additional calculations and ensuring the candidate targets to be obtained is sufficient. However, there weresome invalid candidate targets due to some strong scattering or shiplike structures. Figures 8 and 9 show the results of candidate detection of two sub-images, including the ships and false alarms caused by land targets, islands, and reefs. A total of 322 candidate targets containing ships and false alarms were obtained by pre-progress candidate detection, as in Figures 8 and 9, respectively. A total of 18 true ships in the Qiongzhou Strait area and 79 true ships in the East China Sea area were obtained by Google Earth and SAR image interpretation.

Boundary Box Optimization
Candidate targets on the binary map can be discontinuous due to the effect of speckles noise and sidelobe interference. Therefore, the bounding box of candidate targets may be inaccurate, as shown in Figure 10. Figure 10a,b show the bounding box of candidate targets without optimization. Further processing steps are applied to improve the boundary box of candidate target quality. A first quality improvement resides in the candidate target caused by sidelobe interference [35,36]. For the strongly scattering target, the presence of weak scattering features around strong scatterers due to sidelobe interference. The severe sidelobe of a strong scattering target is quite high and in many cases can be mistaken for a ship [35]. Hence, when an eight-connected or four-connected method is used to identify the target region of interest and obtain the bounding box of candidate targets, multiple bounding boxes will be generated for the same strongly scattering target. If the amplitude of one target is significantly higher than that of another, then the low boundary box of a weak scattering target close to a strong scattering target should be suppressed by the high sidelobe of the strong scattering target. The bounding boxes should be optimized to reduce the number of bounding boxes caused by sidelobe interference for the same candidate target and obtain accurate bounding boxes. Figure 11 shows the eight situations of bounding boxes. The red bounding box coordinates are X1 top_le f t , Y1 top_le f t , X1 bottom_right and Y1 bottom_right , and the other irrelevant bounding boxes are typed in green and blue color with X2 top_le f t , Y2 top_le f t , X2 bottom_right and Y2 bottom_right . The distance rule is introduced to reduce the irrelevant bounding boxes. The distance rule is defined in Table 3. If the distance rule is less than 12, the bounding boxes are merged and updated by Equations (6)- (9). Figure 10c,d show the optimization result of the candidate targets bounding box. After the optimization, the position of the candidate targets bounding boxes is more accurate and reasonable. A second quality improvement resides in candidate targets caused by noise. The area and length of the bounding box from the candidate target are used to reduce the invalid candidate targets. Candidate target areas with less than 10 pixels, as well as side lengths greater than 180 pixels, are also considered noise and false alarms and are subsequently removed.

Boundary Box Optimization
Candidate targets on the binary map can be discontinuous due to the effect of speckles noise and sidelobe interference. Therefore, the bounding box of candidate targets may be inaccurate, as shown in Figure 10. Figure 10a,b show the bounding box of candidate targets without optimization. Further processing steps are applied to improve the boundary box of candidate target quality. A first quality improvement resides in the

Boundary Box Optimization
Candidate targets on the binary map can be discontinuous due to the effect o speckles noise and sidelobe interference. Therefore, the bounding box of candidate target may be inaccurate, as shown in Figure 10. Figure 10a,b show the bounding box o candidate targets without optimization. Further processing steps are applied to improv the boundary box of candidate target quality. A first quality improvement resides in th

Effect of the Size of the Initial Labeled Training Set
In this experiment, we randomly select 3000 unlabeled slices from the dataset to form a training set N U . The other slices are labeled and form the validation set. The test set Figure 11. Position of bounding boxes (red box: the main box, blue and green boxes: the box caused by sidelobe interference that needs to be merged).

Effect of the Size of the Initial Labeled Training Set
In this experiment, we randomly select 3000 unlabeled slices from the dataset to form a training set U N . The other slices are labeled and form the validation set. The test set comes from the other SAR image. We initialized a labeled dataset L 0 k (k = 50, 100, 150, 200, 250, 300) with different sizes in the training stage to analyze the effect of the size of the labeled set on the detection results. The labeled training set size of 50 is taken as an example, and 50 slices are chosen from the 3000 unlabeled slices, and then inputted into the proposed learning model. Each of the 50 slices is labeled with the class with the maximum probability. A newly labeled slice is added to the labeled dataset, and the labeled set is trained once again until all the unlabeled slices are labeled. The same process is conducted for the labeled sets of other sizes. Figures 12 and 13 show the change in the training accuracy with the initialed number of labeled slices by M-LeNet and ResNet18 with embedded active learning.
Remote Sens. 2021, 13, x FOR PEER REVIEW 14 of 22 LeNet and ResNet18 by active-learning strategy can be comparable to that of supervised training strategy when achieving 1000 epochs. However, the training time is greatly reduced.  The accuracy of the training in the initial stage for M-LeNet with embedded active learning is higher when the number of the initially labeled training samples is larger. In addition, the profiles show that the initial training accuracy increases rapidly with the size of the labeled training set, and it reaches 95% when the labeled size is larger than 500 except the initially labeled 250 in Figure 12. The accuracy tends to convergence when the size of the training set is higher than 1500. A similar phenomenon is shown in Figure 13, and the accuracy of the training in the initial stage for ResNet18 with embedded active learning is higher when the number of the initial labeled training samples is increased. The difference is that the ResNet18 shows better performance than M-LeNet. ResNet network architecture was proposed by He et al. [30], and it achieved the best performance in ILSVRC 2015 classification task. In the initially labeled 50 training samples, the accuracy of ResNet18 is close to 90% at the first stage. The accuracy and convergence speed improve fast with increasing training samples, compared with M-LeNet with embedded active learning. The outstanding performance also illustrates that the ResNet architecture is suitable for ship detection. In Table 4, the training time and accuracy with different sizes of initially labeled training sets are listed. In the iterative training process of M-LeNet and ResNet18 with active learning, the accuracy for each iterative is recorded, and the minimum and maximum accuracy, average accuracy, and running time are counted for all the iterations. Meanwhile, the supervised learning strategy, with a large number of the labeled dataset of 2932 manually labeled training samples and 733 test samples, was used to train the M-LeNet, ResNet18, RF, SVM, and CNN methods. During the testing stage, the minimum and maximum accuracy, average accuracy, and running time are calculated. Table 4 also shows that the training time is highly correlated with the initially labeled set size. The model training time is shorter when the initially labeled set is larger. The maximum and the average accuracy rates exceed 98% and 97%, respectively, in all the experiments with different sizes of initial training labeled set.    Data-driven M-LeNet and ResNet18 in the supervised learning mechanisms, as well as SVM and RF in the machine learning mechanisms, are also used for evaluation. In those methods, the ratio of initial training and test sample is 8:2. The highest and the average accuracy rates of ResNet18 exceed 99% and 97%, which is better than that of ResNet with embedded active learning mechanisms. The highest accuracy and the average accuracy of M-LeNet exceed 98% and 97%, which are better than those of M-LeNet with embedded active learning mechanisms. The RF and SVM show poor accuracy with 95%, which is less than that of the CNN model. Similar performances from RF, SVM, and CNN are also shown in [19].
In the active learning mechanism, the accuracy gradually becomes better and rapidly converges with increasing numbers of labeled samples. However, the accuracy oscillation is relatively large compared with those of other models in the M-LeNet with embedded active learning mechanisms when the number of samples initially labeled is set as 150 and 250. When the data-driven CNN is applied to classification and recognition tasks, the steps are to train the model with a great number of labeled samples, obtain feedback from the model, and then adjust the parameters, continue to label the data, or modify the model architecture according to its performance until it meets the requirements. However, active learning is to train the model during the data labeling process. Thus, the quality of the data strongly influences the model. Under the condition of initially labeled 150 and 250 training samples in M-LeNet with active learning mechanism, or initially labeled 200 and 250 training samples in ResNet18 with active learning mechanism, the accuracy increases to a certain point and then suddenly decreases, and finally the accuracy increases and converges again during the iterative process. The reason for the decrease in accuracy is that the ships and false alarms are not completely and effectively distinguished, and there are samples which are difficult to separate, resulting in labeling errors. Thus, compared with convergence when the initial label sample size settings of 50 and 100, when the number of initially labeled samples is a set of 150 and 250, performance is slow due to the quality of data for each batch. However, as the number of samples increases, the learning ability of the model becomes strong and the accuracy increases and tends to be stable. The model architecture also influences the performance of active learning. The ResNet18 shows better performance than M-LeNet with embedded active learning, and this performance illustrates the effect of the model architecture. Moreover, the performance of M-LeNet and ResNet18 by active-learning strategy can be comparable to that of supervised training strategy when achieving 1000 epochs. However, the training time is greatly reduced.

Comparison of the Results Derived by Different Methods
We also compared the results achieved by SVM, RF, M-LeNet, and ResNet18 to demonstrate the efficiency of the improved lightweight M-LeNet and ResNet18 with embedded active learning scheme. The ratio of training samples and test samples is 8:2 in the supervised learning scheme, with 2932 manually labeled training samples and 733 test samples. In the active learning scheme, the manually initially labeled sample is set to 50, The backgrounds of the two sub-images for testing are complex and contain different scale-level ships. Figures 8 and 9 show the candidate detection result of two sub-images, respectively. A total of 322 candidate targets containing ships and false alarms are obtained by pre-progress in Figures 8 and 9. A total of 18 true ships in the Qiongzhou Strait area and 79 true ships in the East China Sea area are obtained by Google Earth and SAR image interpretation. Figure 14 shows the area of ships that can be obtained by the LabelImg annotation tool [37]. In those ships, the minimum size of the ship is 6 × 7 pixels, and most of the ships are less than 32 × 32 pixels. The targets with an area less than 32 × 32 are classified as small objects in the MS COCO nature dataset [38]. Most of the small-sized ships are nonmetallic fishing boats, so it is difficult to generate a strong scattering echo due to their simple structure, material, and target wobble [21]. The small-sized ships tend to be operated in the morning (02:00-11:00) and seem to be operated near shore [39]. Thus, most of the ships are most likely fishing boats; the size of the ships looks small and the scattering intensity also looks weak in the two SAR images acquired at 09:53 and 10:48 in the morning. The Qiongzhou Strait area located in the nearshore is relatively simple. The quantitative assessment performance is listed in Table 5. The best accuracy in the training stage is used to evaluate for M-LeNet and ResNet18 by the active learning strategy. The detection results show that the highest recall, accuracy, and F1-score of 1.0, 100%, and 1.0, respectively is achieved by the M-LeNet-50, M-LeNet-100, ResNet18-50, ResNet18-100, Res-Net18-250, and ResNet18-300. In the supervised learning strategy, the ResNet18 and RF can achieve a recall of 1.0, an accuracy of 100%, and an F1-score of 1.0. The performance of M-LeNet-150, M-LeNet-200, and M-LeNet-250 is not as good as the supervised learning strategy of M-LeNet, ResNet18, SVM, and RF. Figure 15 shows that the best detection results are achieved by active learning and supervised strategy.
The East China sea area located on the nearshore is relatively complex. The RFI could be observed in the left of Figure 7, which has similar intensity to ships and can degrade ocean interpretation [33]. A new method was once proposed to discriminate ships from RFIs based on non-circularity and non-gaussianity [32]. However, the candidate detection results show that the preprocessing reduces the effect of RFI, as shown in Figure 9. Table  6 shows the quantitative evaluation results by active learning and supervised learning strategy. The result shows that the highest accuracy and F1-score of 97.78% and 0.96 is achieved by the M-LeNet-50 and M-LeNet-150. The highest accuracy and F1-score of 97.41% and 0.96 is achieved by the ResNet-50. The performance of M-LeNet and ResNet18 with the supervised learning strategy can achieve the best performance with the accuracy and F1-score better than 96% and 0.94, but the RF and SVM have the worst result. Figure  16 shows the best detection results by the active learning and supervised strategy. In the CNN detector field, an area smaller than 32 × 32 is defined as small objects, and most ships in the test data are much less than 32 × 32 pixels. The result shows that 73 true ships are detected, six true ships are undetected, and zero false alarm is misclassified as the ship. The reason is that the ship's RCS is weak, and some ships have similar backscattering with the ocean [21]. The results of RF and SVM show that some false alarms are misclassified as ships due to similar characteristics with islands and reefs. The Qiongzhou Strait area located in the nearshore is relatively simple. The quantitative assessment performance is listed in Table 5. The best accuracy in the training stage is used to evaluate for M-LeNet and ResNet18 by the active learning strategy. The detection results show that the highest recall, accuracy, and F1-score of 1.0, 100%, and 1.0, respectively is achieved by the M-LeNet-50, M-LeNet-100, ResNet18-50, ResNet18-100, ResNet18-250, and ResNet18-300. In the supervised learning strategy, the ResNet18 and RF can achieve a recall of 1.0, an accuracy of 100%, and an F1-score of 1.0. The performance of M-LeNet-150, M-LeNet-200, and M-LeNet-250 is not as good as the supervised learning strategy of M-LeNet, ResNet18, SVM, and RF. Figure 15 shows that the best detection results are achieved by active learning and supervised strategy.    The East China sea area located on the nearshore is relatively complex. The RFI could be observed in the left of Figure 7, which has similar intensity to ships and can degrade ocean interpretation [33]. A new method was once proposed to discriminate ships from RFIs based on non-circularity and non-gaussianity [32]. However, the candidate detection results show that the preprocessing reduces the effect of RFI, as shown in Figure 9. Table 6 shows the quantitative evaluation results by active learning and supervised learning strategy. The result shows that the highest accuracy and F1-score of 97.78% and 0.96 is achieved by the M-LeNet-50 and M-LeNet-150. The highest accuracy and F1-score of 97.41% and 0.96 is achieved by the ResNet-50. The performance of M-LeNet and ResNet18 with the supervised learning strategy can achieve the best performance with the accuracy and F1-score better than 96% and 0.94, but the RF and SVM have the worst result. Figure 16 shows the best detection results by the active learning and supervised strategy. In the CNN detector field, an area smaller than 32 × 32 is defined as small objects, and most ships in the test data are much less than 32 × 32 pixels. The result shows that 73 true ships are detected, six true ships are undetected, and zero false alarm is misclassified as the ship. The reason is that the ship's RCS is weak, and some ships have similar backscattering with the ocean [21]. The results of RF and SVM show that some false alarms are misclassified as ships due to similar characteristics with islands and reefs.

Discussion
In this article, we mainly discuss ship detection for CNN-based VH polarization. As an alternative to end-to-end CNN-based detectors [8,9,[40][41][42][43], we proposed a two-stage ship detection method. Although similar ship detection methods were proposed by [19,20,44], a large number of samples needed to be labeled and prepared before the CNN began to train. A ship detection method was proposed based on an improved lightweight M-LeNet and ResNet18 deep learning network with an active learning strategy to enable suitability of the CNN model for detecting small ships with a small amount of labeled sample data, as well as reduce labor cost. In the CNN-based active learning strategies, the initially labeled sample size only affects the initial accuracy and training time. The performance of the different numbers of labeled data is similar to those of [24]. As the unlabeled database is updated with the samples by active learning strategy, the accuracy of the model is gradually becoming higher and stabilized. He et al. [45] once emphasized that the convergence can be accelerated by using models that have been pre-trained on ImageNet in the early stage of training. It is not feasible that pretraining on ImageNet would require a significant amount of time and computational power [10]. Transfer training can converge with suitable time and small datasets, but data differences between SAR images and natural images are ignored [9]. By the application of a suitable active learning method and an adequate number of iterations, we can achieve satisfactory convergence. However, some existing ship-like structures produce similar characteristics to ships. In addition, the detection effectiveness of SAR ships is influenced by many factors, including polarimetry, image resolution, incidence angle, ocean dynamics parameters, ship size, and ship orientation [46]. In some active learning strategies, the model appears not very well converged. Thus, the poor detection result is obtained in the experiment. In the future, the sea state information should be further considered to obtain satisfactory convergence to improve ship detection. In addition, the small-sized ship is difficult to detect due to the weak target scattering and few pixels. In the next work, we will consider further optimization of the model to improve weak scattering target detection by combining polarization features and scattering features. In addition, ship detection in SAR images has become an important technology-based on CNNs, several SAR ship detection methods have been proposed by scholars using Radarsta-1/2, TerraSAR-X, Sentinel-1 A/B, GF-3 datasets [7,8,14]. However, they do not receive support from AIS information, nor Google Earth images, so the annotation process of their dataset relies heavily on the experience of experts, which likely leads to a decrease in the authenticity of the dataset [14,47]. In our experiment, the ships and false alarms are annotated by visual interpretation, expert knowledge, and Google Earth images; it is progress. Due to a lack of AIS information, it may be that there are wrong samples in the dataset. Hence, in the future, it will be necessary to obtain AIS information corresponding to SAR data to improve the performance of ship detection.

Conclusions
We mainly discuss ship detection for CNN-based VH polarization in this article. As an alternative to end-to-end CNN-based detectors, a new method was proposed for SAR image ship detection in the case of a small number of training samples. The main steps of the proposed method include candidate target detection, boundary box optimization and ship detection. Compared with the SVM, RF, M-LeNet, and ResNet18, which need a great number of labeled samples, the proposed ship detection method based on improved lightweight M-LeNet and ResNet18 network with an active learning strategy can label the training data automatically, and shows high reliability with only a small number of training samples. The experimental results also show that it performs well for small-sized ships nearshore with the proposed method. In the future, the way to use the polarimetric features and combine them with CNN to further improvements in the detection accuracy of the small-sized ship is worth investigating.
Author Contributions: Conceptualization, W.S.; data curation, X.G.; methodology, X.G. and L.S.; supervision, P.L. and J.Y.; validation, X.G.; writing-original draft, X.G.; and writing-review and editing, X.G. and L.Z. All authors have read and agreed to the published version of the manuscript.