Ship Classiﬁcation in SAR Imagery by Shallow CNN Pre-Trained on Task-Speciﬁc Dataset with Feature Reﬁnement

: Ship classification based on high-resolution synthetic aperture radar (SAR) imagery plays an increasingly important role in various maritime affairs, such as marine transportation management, maritime emergency rescue, marine pollution prevention and control, marine security situational awareness

Nevertheless, they still have some limitations in real-world applications that need to be taken seriously by researchers.One is the insufficient number of SAR ship training samples, which limits the training of a satisfied CNN with megabytes of parameters to be learned [34][35][36].The other is the limited information that SAR imagery can provide (compared with natural images), which limits the extraction of discriminative ship representation features.
To alleviate the bottleneck caused by insufficient training data that hinder the further improvement of ship classification accuracy, existing approaches primarily leverage two kinds of strategies.A widely adopted strategy is to pre-train CNNs on a generic dataset with massive samples (such as ImageNet [37]) and fine-tune the pre-trained networks on the target dataset (i.e., a SAR dataset) with only a small number of training samples [38][39][40] (as shown in Figure 1a).However, recent studies have shown that due to the different imaging mechanisms of SAR imagery and natural images, it is hard to guarantee that the pre-trained CNNs (even if they perform extremely well on ImageNet) can be finely tuned by a small SAR dataset just right enough to extract the discriminative features of the ships [39].Wang et al. claim that considering the difference between SAR imagery and natural images (e.g., imaging mechanisms and target information), features extracted from natural images via the ImageNet pre-trained model are not suitable for SAR imagery.The other strategy, which is termed transfer learning (TL) or domain adaptation (DA), improves classifier training on the target dataset (i.e., the SAR dataset) by transferring the knowledge from a different but related source dataset (usually another ship dataset) [41][42][43][44][45][46][47][48].This strategy requires that the source dataset (the source domain) and the target dataset (the target domain) have an intrinsic connection [41].Therefore, the source dataset often deliberately chooses another ship dataset (which usually has a different modality from the target dataset, such as optical ship images) to ensure that there is transferable knowledge between the two domains [45,46,48].A typical scheme is to extract features in each of the two domains, then map them to a common feature space, and make the two consistent by adjusting the marginal or/and conditional distribution(s), so as to solve the problem of shortfall of labeled data in the target domain by using a large amount of labeled data from the source domain [45,46,48].Existing methods focus on aligning the feature-level distribution of the source and target domains to minimize the domain gap.In the case of SAR imagery, the network finds it challenging to extract features from the grayscale images which can be used for a classification task.
On the other hand, in order to extract the most discriminative ship representation features from SAR imagery, the existing methods have carried out fruitful research on network architecture design [17,19,21,22,[29][30][31], attention mechanism embedding [23,25], feature fusion [23][24][25][26]29,33], etc. Ref. [17] verifies the superior performance of the ResNet architecture on the SAR ship classification task.Ref. [21] proposes the Siamese network equipped with a pre-processing module and a feature fusion module.Ref. [30] proposes a searched binary neural network (SBNN) with neural architecture search technique to obtain the optimal CNN architecture for SAR ship classification.Ref. [29] proposes a group bilinear convolutional neural network (GBCNN) model to deeply extract discriminative representations of ship targets from the pairwise vertical-horizontal polarization (VH) and vertical-vertical polarization (VV) SAR imagery.To fully explore the polarization information, the multi-polarization fusion loss (MPFL) is constructed to train the proposed model for superior SAR ship representation learning.Ref. [25] proposes HOG-ShipCLSNet to integrate four mechanisms: multi-scale classification mechanism, global self-attention mechanism, fully connected balance mechanism, and HOG feature fusion mechanism.The work in [24] discusses how to effectively fuse traditional handcrafted features and deep convolutional features in CNN and recommends to perform feature fusion in the last fully connected layer.Ref. [26] proposes a ship classification network (PFGFE-Net) with polarization coherence feature fusion and geometric feature embedding to utilize polarization information and traditional handcrafted features.Although these efforts improve the performance of SAR ship classification to some extent, they are usually based on more complex network architecture and higher dimensional features, accompanied by more time-consuming storage expenses.Not only that, a small SAR dataset is not enough to effectively learn so many parameters of complex CNN, resulting in overfitting, and the ship features extracted by CNN are highly redundant, which directly impairs the discrimination ability of the features.This situation becomes more serious as the depth of the network deepens.Comparison between a widely adopted flowchart and the proposed flowchart.(a) A widely adopted pre-trained and fine-tuned flowchart.(b) Flowchart of the proposed method.Generally, a CNN is used to extract the deep features from the input images.Then, the deep features are refined further by fully connected layers (FC).Finally, the refined features are input into the softmax classifier for the final ship classification.Compared with the widely adopted flowchart, the proposed one explores the performance of the shallow CNN (S-CNN) on SAR ship classification.In addition, we add a feature refinement operation between CNN and FC aiming to reduce the redundancy and increase the discriminative ability of the deep features further.

Motivation
Through the analysis of SAR image characteristics and the CNN feature extraction mechanism, this study puts forward three hypotheses: (1) pre-training a CNN on a taskspecific dataset (specifically, on an optical remote sensing ship dataset (ORS)) may be more effective than that on a generic dataset; (2) a shallow CNN may be more suitable for SAR image feature extraction than a deep one; and (3) the deep features extracted by a CNN can be further refined to improve the feature discrimination ability.
The first hypothesis is motivated by the assumption that compared with the ImageNet, which is a generic dataset with tens of thousands of categories, the vast majority of which have completely different attributes from ships, the optical remote sensing ship dataset has the same target attributes as the SAR ship dataset, making it easier to train a transferable network to improve the successive fine-tuning performance.The second hypothesis is motivated by the characteristics of SAR imagery and the feature extraction mechanism of CNNs.The mature and widely used CNN classifiers, such as AlexNet [49], VGGNet [50], ResNet [51], and DenseNet [52], are initially designed for natural image classification tasks.Natural images have abundant color information, while SAR image pixels represent the reflected electromagnetic wave intensity of the object.Thus, there exist enormous domain differences between natural and SAR imagery.It is a well-known fact that SAR imagery contains far less information than natural images.While the CNN classifier extracts the feature maps of input images and uses the last layer of feature vectors for object classification, different feature layers of a CNN have different spatial resolutions and semantic information.For example, the lower layers utilize lower-level semantic features in comparison with those utilized by the final layers.Due to the lack of information in the SAR image itself, deeper layers cannot extract more effective discriminative features.As a result, a shallow CNN may be more suitable for SAR image feature extraction.The third hypothesis is based on the consensus that the deep features extracted by CNN (especially the fused features) are highly redundant, and it is reasonable to believe that these features can be further compressed into lower dimensions without losing their discriminative ability, or even further improving their discriminative ability.
To validate these hypotheses, we propose to learn a shallow CNN which is pretrained on a task-specific dataset, i.e., optical remote sensing ship dataset (ORS) instead of on the widely adopted ImageNet dataset.The proposed flowchart is illustrated in Figure 1b.Compared with the widely adopted method (Figure 1a), our study conducts three improvements.Firstly, we pre-train the CNN model on a task-specific dataset, i.e., an optical remote sensing ship dataset (ORS) instead of the generic dataset ImageNet.Secondly, we explore the performance of shallow CNNs in SAR ship classification tasks.For comparison purposes, we designed 28 CNN architectures by changing the arrangement of the CNN components, size of convolutional filters, and pooling formulations on the basis of VGGNet models [50] and present a thorough evaluation.Thirdly, in order to avoid overfitting, extract more discriminative deep features, and reduce feature redundancy, we propose to refine deep features by active convolutional filters selection based on the coefficient of variation (CoV) sorting criteria and performing the feature refinement operation by selecting and reserving the active convolutional kernels on the last convolutional layer of a CNN.
Extensive experiments not only prove that the above hypotheses are valid but also prove that the shallow network learned by the proposed pre-training strategy and the feature refining method can achieve considerable ship classification performance in SAR imagery like the state-of-the-art (SOTA) methods.

Contribution
The contribution of this study is threefold: (1) Through the analysis of SAR image characteristics and the CNN feature extraction mechanism, we put forward three hypotheses and designed an experimental flowchart to prove their validity.
(2) We introduce the ORS dataset as a pre-training dataset serving the SAR ship classification task.Compared with the widely used generic ImageNet dataset, ORS is smaller but more suitable for SAR ship classification scenarios.To the best of our knowledge, no existing studies have attempted this.
(3) We propose a novel feature refinement method by extracting active convolutional filters which have a high response for the purpose of reducing feature dimensions, avoiding over-fitting, and extracting more discriminative deep features in SAR ship imagery.

Organization
The remainder of this paper is organized as follows.Section 2 introduces the CNN architectures we designed for research purposes (Section 2.1) and the details of the proposed feature refinement method (Section 2.2).In Section 3, we introduce the dataset (Section 3.1) used to conduct our experiments (Section 3.2) and the common experimental protocol (Section 3.3).Then, we analyze and discuss the experimental results in Section 4. The effectiveness of using a task-specific dataset to pre-train CNNs (Section 4.1), the feasibility of feature refinement (Section 4.2), the overall performance of the three proposed improved schemes (Section 4.3), and the comparison with the state-of-the-art methods (Section 4.4) are introduced sequentially.Finally, we conclude this study in Section 5.

CNN Architecture Design
A CNN can learn multi-level representations of ships from SAR imagery.A suitable CNN is crucial in the ship classification task, which can improve prediction accuracy and reduce prediction error.So far, many CNN architectures, such as AlexNet [49], VGGNet [50], ResNet [51], and DenseNet [52], have been successfully applied to SAR ship classification tasks.A typical CNN architecture generally is composed of alternate layers of convolution and pooling followed by one or more fully connected layers at the end.Furthermore, different regulatory units such as batch normalization and dropout are also incorporated to optimize CNN performance [53].The mechanism and role of these components are briefly described as follows: • Convolutional layer (C).The convolutional layer is composed of a set of convolutional kernels where each neuron acts as a kernel.A convolutional kernel works by dividing the image into small slices, commonly known as receptive fields.The kernel convolves with the images using a specific set of weights by multiplying its elements with the corresponding elements of the receptive field.The convolution operation can be expressed as: where I c (x, y) represents an element of the input image located at the position (x, y).I c (x, y) is element-wise multiplied by f k l (p, q) index of the k-th convolutional kernel of the l-th layer.The output feature-map of the k-th convolutional operation can be expressed as where p, q denote the row and column position in a feature matrix, and P, Q denote the total number of rows and columns of feature matrix, respectively.• Pooling layer (P).Pooling or down-sampling is a local operation, which sums up similar information in the neighborhood of the receptive field and outputs the dominant response within this local region.
Equation ( 3) shows the pooling operation in which Z k l represents the pooled feature map of input feature map F k l , whereas g p (•) defines the type of pooling operation.Different types of pooling formulations such as max, average, overlapping, spatial pyramid pooling, etc., are used in CNNs.Boureau et al. [54] performed both a theoretical comparison and experimental validation of max and average pooling and proved that when the clutter is homogeneous and has low variance across images, average pooling has good performance and is robust to intrinsic variability.• Activation function.Activation function g a (•) serves as a decision function and helps in learning of intricate patterns.For a convolved feature map, the activation function is defined as Different activation functions such as sigmoid, tanh, maxout, ReLU, and its variants (leaky ReLU) are used to inculcate non-linear combination of features.In real applications, ReLU and its variants are preferred, as they help in overcoming the vanishing gradient problem [49].
• Batch normalization.Batch normalization is used to address the issues related to the internal covariance shift within feature maps, which slows down the convergence.Furthermore, it smoothens the flow of gradient and acts as a regulating factor, which thus helps in improving the generalization of the network.• Dropout.Dropout introduces regularization within the network, which ultimately improves generalization by randomly skipping some units or connections with a certain probability.This random dropping of some connections or units produces several thinned network architectures, and finally, one representative network is selected with small weights.

•
Fully connected layer (FC).The fully connected layer is mostly used at the end of the network for classification.It makes a non-linear combination of selected features, which are used for the classification of data.• Softmax layer.In probability theory, the output of the softmax function can be used to represent a categorical distribution.The softmax function is used as output layer in the CNN networks, and it can be expressed as where x c is the output for c-th class, C is the number of classes, and p(x c ) is probability of the c-th class.
A recent survey has found that among the various improvements that can improve the performance of CNNs, such as the use of different activation and loss functions, parameter optimization, regularization, and architectural innovations, the significant improvement in the representational capacity of the deep CNNs is achieved through architectural innovations [53].The arrangement of CNN components plays a fundamental role in designing new architectures and thus achieving enhanced performance.
This study investigates the impact of different CNN architectures, especially shallow networks, on SAR ship classification.To this end, we built 28 CNNs of different architectures based on VGGNet [50], which is widely regarded as the milestone design template for the follow-up networks, by changing the network depth (i.e., the connection configuration of the convolutional layers and pooling layers), the convolutional kernels, and the pooling formulations.The detailed architecture of the designed CNNs and the corresponding parameter size is listed in Table 1.This table illustrates 28 stacks of convolutional layers (C) and pooling layers (P) with different depths in different arrangements.The fully connected layer (FC) and softmax layer following these stacks are not illustrated in this table.Specifically, in this study, we utilize three fully connected layers: the first two have 4096 channels each, and the third performs SAR ship classification and thus contains C channels (one for each class), which equals the number of classes of ship to be classified.The final layer is the softmax layer.As for other basic components, i.e., activation function (ReLU), batch normalization, and dropout, we utilize the same functions for all 28 CNNs.Among these 28 CNNs, we are able to find the standard VGGNets and their counterparts, i.e., CNN-XIV (VGG-8) vs. CNN-XIII, CNN-XIX (VGG-11) vs. CNN-XVIII, CNN-XXI (VGG-13) vs. CNN-XX, CNN-XXIII (VGG-16) vs. CNN-XXI, and CNN-XXV (VGG-19) vs. CNN-XXIV.The difference between the standard VGGNets and their counterparts is that we replace the maximum pooling (P m ) with average pooling (P a ) motivated by the study of [54].The size of the convolutional kernel used in CNN-VIII and CNN-XVII is 7 × 7 and 5 × 5 for CNN-XV and CNN-XXVIII.These models are sorted according to the number of parameters they need to learn.The minimum is 0.37M of CNN-I, and the maximum is 25.61M of CNN-XXVIII.

Feature Refinement
The existing research shows that CNNs prefer a high response from a filter when it fires on an input [55], and there are active convolutional kernels and attribute-centric nodes in the CNN network [55,56].It motivates us to perform feature refinement by extracting active convolutional filters which have a high response to reduce feature dimensions, avoid overfitting, and extract more discriminative deep features in SAR ship images.In our implementation, we perform active convolutional kernel selection on the last convolutional layer of the CNN.
To obtain the active convolutional filters from the last convolutional layer, in our implementation, we compute the mean (E ) and standard deviation (S) of each convolutional filter's output on a 2 × 2 window, which are defined as: Usually, ships are very similar within a class, and the convolutional filter always has a high mean value and low standard deviation when it "fires" on a certain ship class.Hence, we employ the coefficient of variation (COV) to determine active convolutional kernels and select more discriminative features.The COV of the k-th filter in the last convolutional layer for class c can be computed as: Based on experience from pre-experiments, we select the first 50 convolutional kernels with the largest COV value for each class and extract the refined deep features from these selected convolutional kernels.The pseudo-code is described in Algorithm 1. Select the first 50 convolutional kernels with the largest COV value for each class.

Dataset and Data Pre-Processing
In our experiments, we utilized the generic dataset ImageNet [37] and task-specific dataset ORS [45] to pre-train the CNNs, respectively.ImageNet is one of the most famous and widely used datasets for pre-training CNNs.It contains more than 20,000 categories with roughly 500-1000 images per category [37].In this study, we used ImageNet as the benchmark to validate the performance when using the ORS dataset [45] to pre-train the CNN.ORS dataset was released by [45] to provide a reliable source domain with a large number of labeled ship samples for transfer learning study.The dataset has eight ship classes, including cargo, container, oil tanker, bulk carrier, car carrier, chemical tanker, dredge, and tug.All the ship slices were segmented from Google Earth with sub-meter resolution, as shown in Figure 2. On the other hand, we utilized two SAR ship datasets to fine-tune the pre-trained CNNs and test the final performance of the proposed method.The first SAR dataset (SD1) was collected by [3] from six strip map-mode VV-polarization TerraSAR-X images with 2.0 m × 1.5 m resolution in azimuth and range directions, respectively.The dataset contains three ship categories, i.e., carrier, container, and oil tanker.Each category (class) has 50 ship samples.The second SAR dataset (SD2) is the FUSAR dataset, which was collected by [35] from high-resolution Gaofen-3 imagery.In this study, we chose a subset that contains four common ship classes, i.e., cargo, bulk carrier, container, and oil tanker, from the original dataset.Moreover, we kept 50 ship samples per class to conduct our experiments.Several typical samples from ORS, SD1, and SD2 datasets are shown in Figure 3. Considering that pre-training a CNN requires a large number of samples, in our experiments, we used a data augmentation scheme involving scaling, flipping, and rotating to enrich the samples in the ORS dataset to 5000 per class.For two SAR datasets, we enriched them to 100 samples per class.A total of 80 samples were randomly selected from each class to build a training set for fine-tuning the CNNs pre-trained on ImageNet/ORS, and the remaining 20 samples per class were used to validate and compare the performance of various CNNs and methods.To obtain reliable evaluation results, we ran it ten times and reported the averaged classification accuracy.

Experimental Content
As shown in Figure 1, this paper proposes three changes compared with the traditional method, i.e., (1) pre-training CNNs with task-specific ORS rather than ImageNet, (2) replacing deep CNNs with shallow CNNs to perform SAR ship classification, and (3) adding a feature refinement operation between the last convolutional layer of CNNs and FC layer to improve the discrimination ability of the feature.To validate the feasibility of our ideas and corresponding changes, we conducted four experiments for comparison.
• E1: CNNs were pre-trained on ImageNet and then fine-tuned on the SAR dataset without feature refinement.• E2: CNNs were pre-trained on ImageNet and then fine-tuned on the SAR dataset with feature refinement.• E3: CNNs were pre-trained on ORS and then fine-tuned on the SAR dataset without feature refinement.• E4: CNNs were pre-trained on ORS and then fine-tuned on the SAR dataset with feature refinement.
Furthermore, to obtain an objective evaluation of the performance of the proposed method, we compare it with several SOTA methods.These methods include: [17] conducting ship classification based on deep residual network, [18] utilizing a CNN embedding and metric learning, [24] injecting traditional handcrafted features into the deep CNN, and [48] proposing a dual-branch network embedding attention mechanism to conduct deep sub-domain adaptation.In our implementation, we reproduced these algorithms and conducted experiments on SD1 and SD2 datasets.It should be noted that, for [17], we utilized ResNet50 as the backbone network to experiment.For [24], we injected naive geometric features (NGFs) [7] into the VGG16 network and performed feature fusion in the last FC layer by the method of concatenation with feature normalization, as recommended by the authors.While for [48], since it is a domain adaptation method, following the original literature, we utilized the ORS dataset as the source domain and SD1/SD2 as the target domain.

Experimental Protocol
In our experiments, for a fair comparison, for all CNNs, the batch size was set to 64 and momentum to 0.9.The training was regularized by weight decay (the L2 penalty multiplier set to 5 × 10 −3 ) and dropout regularization for the first two fully connected layers (dropout ratio set to 0.5).The learning rate was initially set to 10 −3 and then decreased by a factor of 10 when the validation set error stopped decreasing within 10 epochs.The training was stopped when the validation set error stopped decreasing within 30 epochs.
All the experiments are performed on a 64-bit Core i7-6800k (3.40 GHz) computer with 64 GB of RAM and one Nvidia GTX1080ti GPU (11 GB RAM), which is supported by compute unified device architecture (CUDA) 8.0, and the proposed method is implemented by KERAS [57] with tensorflow [58] backend.

Results and Discussion
We conducted all four experiments on two datasets and listed the experimental results in Table 2.It should be noted that CNN-I has a serious overfit due to its too shallow network that prevents convergence during training, so there are no test results reported.Other CNN architectures have high robustness and can achieve stable results.Analyzing the experimental results, we draw the following basic understanding.mainly since SD2 is a 4-classification task, which is slightly more difficult than SD1's 3classification task.The best performance is 80.50%, obtained by CNN-XXIII (E1), the second is 79.75%, obtained by CNN-XIV (E3), and the third is 79.33%, obtained by CNN-XIX (E1) and CNN-XVI (E3).
In our view, the reason behind this experimental discovery may be that ImageNet contains a large number of class-rich samples, so it is enough to pre-train a relatively deep network and then fine-tune on the target dataset (even if it has nothing to do with ImageNet) to obtain satisfactory performance.In contrast, because ORS has far fewer training samples than ImageNet, it is not enough to pre-train a relatively deep network with good performance.However, because the task-specific ORS dataset is intrinsically linked to the target dataset (more essential common attributes), the shallow network it trains can achieve performance comparable to the deeper network trained by ImageNet.In this sense, it proves that it is perfectly feasible to train a shallow network with a smaller task-oriented pre-training dataset that does not lose performance to a deeper network trained with a larger dataset such as ImageNet.

About Feature Refinement
We explore the effectiveness of the proposed feature refinement by comparing E1 with E2 (E3 vs. E4).The gap between E1 (E3) and E2 (E4) (after feature refinement) on two datasets for all CNN architectures is illustrated in Figure 5.There are two main findings.The first is that the proposed feature refinement consistently helps CNN improve classification performance, except for the only three exceptions with negative gain.The results of this experiment prove that the proposed feature refinement is feasible and effective.Secondly, take a closer look at Figure 5, it shows that feature refinement is more effective at boosting the CNN pre-trained using the ORS dataset than those with the ImageNet dataset.For example, on SD1, four networks receive a gain of more than 4.00% (the maximum is 4.67% by CNN-XX), and six networks received a boost of more than 3.00%, which can be seen by looking at the green bars in the figure.This outcome does not occur on CNNs pre-trained with ImageNet, none of which achieves a gain of more than 3.00% (the maximum is 2.34% by CNN-XIX), which can be seen by checking the blue bars.A similar situation occurs on SD2, which can be seen by comparing the yellow with the red bars.On average, the gain of E4-E3 on SD1 and SD2 are 1.86% and 1.56%, which outperforms E2-E1 (1.22% and 1.05%) by 0.64% and 0.51%, respectively.
Feature refinement plays a role similar to principal component analysis (PCA) operation; selecting the active convolutional kernels is equivalent to selecting the principal component of the feature, so it significantly reduces the feature dimension, while further improving the feature discriminative ability.The effect of this process can be illustrated by Figure 6.As shown in Figure 6a, the feature map of the last layer of CNN (CNN-VII in this demo) before feature refinement has a lot of redundancy (refer to the all-black sub-plots in the map).After feature refinement, as shown in Figure 6b, the redundant features are greatly eliminated, preserving only dimensions with useful information.

About the Shallow Network Pre-Trained on Task-Specific ORS Dataset with Feature Refinement
Looking back at Table 2, it can be found that the best classification performance on the two datasets is 87.67% (E4) and 82.33% (E4), respectively, and they are obtained by CNN-VII, which is a shallow network pre-trained by the task-specific ORS dataset with feature refinement.In contrast, it is very interesting and very coincidental that the second best performance on both datasets (86.33% (E2) and 81.75% (E2)) is obtained by the deep network CNN-XXIII pre-trained by the ImageNet dataset with feature refinement.This result proves that for the specific task of ship classification in SAR imagery, the proposed scheme of "shallow network + task-speicific ORS pre-training + feature refinement" is completely feasible and effective, and it is fully capable of performing this task and obtaining the best results.In addition to the performance advantages, the shallow network (CNN-VII) needs to learn fewer parameters (1.55M vs. 14.7Mrefer to Table 1, the former is only one-tenth of the latter) compared with deep networks (CNN-XXIII); feature refinement improves the discriminative ability while also reducing the dimension of deep features, coupled with the use of task-oriented smaller ORS training dataset, training expenses have been greatly reduced, and overfitting has been almost eliminated.

Comparison with State-of-the-Art Methods
We compared the performance of the shallow network CNN-VII learned by the proposed method (E4) to several state-of-the-art (SOTA) methods.The results are listed in Table 3.It shows that the shallow network learned by the proposed pre-training strategy and the feature refining method can achieve considerable ship classification performance in SAR imagery like the state-of-the-art (SOTA) methods.The performance of CNN-VII on both SD1 and SD2 datasets are all in second place, only slightly behind method [48] and ahead of the other methods [17,18,24].
Considering that [17,18] and [24] follow the same "pre-train + fine tune" scheme as the proposed method, it indicates that the proposed method is very effective in this scheme.It is worth mentioning that the proposed method is far lower than the above methods in terms of both network complexity and training cost, which makes it more valuable to popularize in practical applications.Ref. [48] is essentially a transfer learning method, it utilizes the ORS dataset as the source domain for knowledge transfer instead of using it as pre-training data.

Conclusions
To alleviate the bottleneck caused by the insufficient SAR training dataset that hinders the further performance improvement of ship classification in SAR imagers, this study makes three improvements to the existing widely adopted "pre-train + finetune" scheme and proposes a novel scheme of a shallow network with feature refinement pre-trained on a task-specific dataset.Extensive experiments have demonstrated the effectiveness of this scheme, and in-depth analysis has revealed the reasons behind the experimental outcomes.
This paper utilizes CNNs in the form of VGGNet as the backbone network for research, but it does not prevent the proposed scheme from being generalized to other types of network architectures, thanks to the fact that task-specific pre-training and feature refinement have no relevance to the network types.Some of the newly emerging methods aimed at improving feature discriminative ability or network performance, such as attention mechanisms, network trimming, etc., can also be used in conjunction with the proposed scheme.

Figure 1 .
Figure 1.Comparison between a widely adopted flowchart and the proposed flowchart.(a) A widely adopted pre-trained and fine-tuned flowchart.(b) Flowchart of the proposed method.Generally, a CNN is used to extract the deep features from the input images.Then, the deep features are refined further by fully connected layers (FC).Finally, the refined features are input into the softmax classifier for the final ship classification.Compared with the widely adopted flowchart, the proposed one explores the performance of the shallow CNN (S-CNN) on SAR ship classification.In addition, we add a feature refinement operation between CNN and FC aiming to reduce the redundancy and increase the discriminative ability of the deep features further.
) where c indicates ship class (c = 0, 1, 2, • • • , C), and n c denotes all the input data which belong to class c. f k l last (p, q) denotes the response value of the k-th filter in the last convolutional layer, and E k c and S k c are the mean value and standard deviation of the k-th filter in the last convolution layer for ship class c.

Figure 2 .
Figure 2. Ship slices in ORS dataset were segmented from Google Earth imagery.

Figure 4 .
Figure 4.The curves of four experimental results on (a) SD1 and (b) SD2.The ordinate represents the classification accuracy (%), and the abscissa represents various CNN architectures.

Figure 5 .
Figure 5. Gain obtained by feature refinement.The ordinate represents the classification accuracy (%), and the abscissa represents various CNN architectures.

Figure 6 .
Figure 6.Feature map (partial) of the last layer of CNN (CNN-VII in this demo) (a) before and (b) after the feature refinement.
Deep feature refinement.

Table 3 .
Comparison with the SOTA methods on two datasets.The classification accuracy (%) averaged on ten runs are reported.