SBNN: A Searched Binary Neural Network for SAR Ship Classiﬁcation

: The synthetic aperture radar (SAR) for ocean surveillance missions requires low latency and light weight inference. This paper proposes a novel small-size Searched Binary Network (SBNN), with network architecture search (NAS) for ship classiﬁcation with SAR. In SBNN, convolution operations are modiﬁed by binarization technologies. Both input feature maps and weights are quantized into 1-bit in most of the convolution computation, which signiﬁcantly decreases the overall computational complexity. In addition, we propose a patch shift processing, which can adjust feature maps with learnable parameters at spatial level. This process enhances the performance by reducing the information irrelevant to the targets. Experimental results on the OpenSARShip dataset show the proposed SBNN outperforms both binary neural networks from computer vision and CNN-based SAR ship classiﬁcation methods. In particular, SBNN shows a great advantage in computational complexity.


Introduction
The synthetic aperture radar (SAR) has been widely applied in ocean surveillance due to its all-weather working with the capability of wide range detection [1]. Nowadays, busy international routes have increased the pressure of marine traffic control. For environmental protection, it is necessary to strengthen the monitoring of fishing vessels. Thus, timely ship classification is important for efficient ocean trade route maintenances and sustainable ocean resource developments [2,3]. Meantime, ship classification is an important part of maritime domain awareness (MDA), which holds great value in terms of the national security.
Traditional SAR ship classification methods mainly use the hand-crafted feature extraction and classification, including support vector machine (SVM) [4], decision tree [5], random forest [6], Bayesian classifier [7], Adaboost [8], etc. However, when the inputted SAR images are obtained from complex ship shapes, high sea states, and different types of radars, these manually designed methods provide unsatisfactory results with decreased accuracy and weak robustness.
With the remarkable achievements of deep learning in pattern recognition and image processing [9][10][11][12], the deep convolutional neural networks (CNN) are applied to SAR ship classification.
A study about SAR image classification based on transfer learning reveals that SAR ship data require re-training more layers [13]. The classification network, which is adapted from VGGNet [14], contains 13 convolution layers, 5 pooling layers and 1 linear layer. Experimental simulations show the finetuned VGGNet has a good classification accuracy, but with a high complexity and a weak robustness.
A Dual-Polarization SAR classification network based on deep learning and hybrid channel feature loss is proposed in [15]. Features in two different polarized channels are 4. A patch shift processing is proposed to enhance the learning of SBNN. SAR images usually have a poorer quality than natural optical images, which lays stress on the necessity of learning the feature distribution for a binary networks. The existing binary network technology reshapes the feature distribution channel-wisely without touching the spatial level. Recently, research about spatial attention has shown that processing the spatial information has a positive effect on performance. Hence, we divide feature maps into several patches. Each patch of each channel is adjusted by a learnable parameter. This processing enhances the performance of our binary network by reducing the local noise and the information irrelevant to the targets. Experimental results show our proposed processing further increases the accuracy of SBNN.
The rest of this paper is organized as follows: Section 2 reviews the related work. In Section 3, SBNN is presented with details. The results of experiments and ablation studies are shown in Section 4. Section 5 concludes this paper.

Network Architecture Search
NAS is an emergent researching field, after neural networks having achieved many great results. The purpose of NAS is to find an automatic way to design network architectures with pre-defined goals and the extent of the application scope of neural networks. According to the different methods to explore the searching space, NAS has three main categories: reinforcement learning-based searching, evolutionary-based searching and gradient-based searching.
Reinforcement learning-based searching is the most important part in the history of NAS. A meta controller is trained for generating network architectures. Several ideals from reinforcement learning-based searching have a great impact on NAS, even including other approaches. To reduce the searching cost, the concept of searching node cell architecture is proposed [17], where the target network consists of stacked searched cells. Searching cells has a much lower cost than searching a whole network. Then, the weights-sharing technology is proposed to significantly reduce the search time [18]. Trained weights in a searched network are kept after evaluating to avoid retraining the next searched networks from scratch.
Evolutionary algorithms are adopted to force network architectures evolute forward a pre-defined target with genetic operations. Xie et al. encodes a network architecture into a query with genes [19]. An improved evolutionary searching algorithms is proposed to effectively resist the noise in training [20].
In order to further improve the efficiency of searching, gradient-based searching is proposed. The first algorithm in this approach is differentiable architecture searching (DARTS), where NAS has been defined as bi-level optimization problem and the target architecture can be found by gradient descent [21]. Meanwhile, all candidate operations are trained together in a super-net. However, DARTS is still only affordable on workshops and servers. To bring NAS to personal computers, DARTS and partial channel connection are combined together as PC-DARTS [22]. As a result, the hardware requirement has been largely reduced with a higher searching efficiency.

Binary Network
Binary network is the most extreme form of network quantization, where weights and feature maps are quantized into 1 bit. Hence, the computational complexity of a binary network is at a very low level.
The pioneer of 1-bit neural networks is BNN, which is trained with end-to-end backpropagation [23]. Although BNN has an acceptable performance, floating-point networks outperform BNN much for accuracy. To improve the accuracy, a scale factoring is assigned to each channel in XNOR-Net [24]. The prototype floating network has been updated and shortcuts are used to largely enhance the representational capability in Bi-Real-Net [25]. Then, several improvements are proposed in ReActNet including new block architectures, distributional loss, and channel-wise reshaping [26]. Last, AdamBNN is proposed based on ReActNet for improving the training of binary networks [27]. Nowadays, the gap between 1-bit networks and their floating-point counterparts is narrowed.

Spatial Information Processing
Spatial information processing are usually applied to enhance learning capability in computer vision. Networks with spatial information processing have established new records in many challenging datasets, especially large datasets. Recently studies show spatial information processing can be operated at pixel level and patch level.
The most common way to process spatial information at the pixel level is by applying self-attention on each pixel of feature maps. A non-local sensing module is proposed, which can be integrated in most of the common vision network architectures [28]. Weights for each pixel are calculated and added to the original feature maps in this module. To avoid a long sequence for Transformer encoders, a local multi-head self-attention module is proposed, which uses slicing windows to limit the input length [29]. However, those methods give guaranteed performance based on a high computational complexity.
Recently, patch level processing has been developing and becomes popular. This approach divided feature maps into several patches. Vision Transformer encodes 16 × 16 pixels into a patch and divides an image into total 196 patches [30]. The patches are added by a sin signal and then processed by Transformer encoders. Compact Convolutional Transformer employs learnable parameters instead of fixed signals, which proves that spatial addition operators with learnable parameters have a positive effect [31].   Table 1 show the proposed SBNN. The inputs of SBNN are normalized SAR images. For a lower computational complexity, binary convolution is wildly used in the novel network. In detail, SBNN contains one floating-point Stem convolution layer, five binary convolutional cells and one floating-point classifier. The same as common networks for classification, the forward propagation has 3 stages to ensure a strong feature extraction capability, where the size of the feature maps is reduced twice and the number of channels are doubled twice. The convolutional cells have two types, i.e., Normal Cell and Reduction Cell, and they will be introduced in Section 3.2. In addition, we apply a very small initial channel number in SBNN in order to further guarantee low computational complexity and high memory efficiency.

Searched Normal Cell and Reduction Cell
With the development of deep learning, the gap between binary networks and their real-valued counterparts is narrowing [25,26]. We construct the binary cells for SAR ship classification under the following steps. First, efficient floating-point node convolutional cells, Normal Cell and Reduction Cell, are searched by NAS. Then, the searched cells are modified and binarized. The last step involves inserting our proposed patch shift processing. We believe this method can keep the advantages of NAS with high accuracy and small size with a much lower computational complexity benefiting from binarization.
The architectures of Normal Cell and Reduction Cell in SBNN are shown in Figure 2. Each cell has eight convolutional operations. The connections and types of the operations are the main difference between Normal Cell and Reduction Cell. As shown in Figure 2, Normal Cell contains three Binary Conv 5 × 5 Operations, three Binary Conv 3 × 3 Operations, two Binary Dilated Conv 5 × 5 Operations and Reduction Cell contains three Binary Reduction Conv 5 × 5 Operations, two Binary Conv 3 × 3 Operations, one Binary Reduction Dilated Conv 5 × 5 Operations, one Binary Dilated Conv 5 × 5 Operations and one Binary Dilated Conv 3 × 3 Operation.
In each cell, linear layers reshape the inputs at first. Then, the patch shift processing enhances the presentation of feature maps. Next, the binary convolutional operations extract and output deep features. The types and connections of those binary convolutional operations are set following the searched floating-point cells.

Efficient Searching
Thanks to the high flexibility, NAS can break through the limitations of subjective understanding and imagination [17,18,32]. Nowadays, NAS is considered as an automatic way to reduce the cost of time and labor in designing network architectures. In this paper, the prototype floating-point network is searched by PC-DARTS, which has a low searching cost and a friendly hardware requirement [22]. Take the information propagated from node i to node j as an example, the mixture operation f i,j (.) in searching stage can be formulated as: where: O denotes a pre-defined space of operations, and o(.) is a fixed operation in O. x i is the output of node i. α o i,j is a learnable parameter for weighting the candidate operation o(.), and S i,j denotes the randomly sampled channels. Sampling channels can directly reduce the searching cost and improve the memory efficiency. On the downside, the searching may become unstable. Edge normalization in PC-DARTS ensures the stability of searching. The computation of x j , which denotes the output of node j, is: where β i,j is an edge normalization coefficient. When searching on a natural optical image dataset, such as ImageNet, the candidate operations are usually 3 × 3 Conv, 5 × 5 Conv, 3 × 3 Dil_Conv, 5 × 5 Dil_Conv, Average Pooling, Max Pooling and Skip Connect [21]. Each 3 × 3 or 5 × 5 Conv operation contains two Depthwise Separable Convolution blocks. A Dil_Conv operation only has one Depthwise Separable Convolution blocks where the first convolution is dilated [33]. According to the recently researches, Max Pooling has an unsatisfied performance in binary networks [24].
PC-DARTS gives weight-free operations, i.e., Average Pooling, Max Pooling and Skip Connect, very large values resulting an unstable searching in the SAR ship dataset which has a small data size and a high task difficulty. Weight-free operations can give more consistent outputs, which means they are more likely to be placed in the target network. For reaching a better classification accuracy, weight-free operations from the original PC-DARTS are ignored. We only consider weight-equipped operations including 3 × 3 Conv, 5 × 5 Conv, 3 × 3 Dil_Conv, 5 × 5 Dil_Conv as candidate operations in searching.

Modification and Binarization
Floating point networks give an excellent performance, but they always have a strict requirement on hardware. Although the computing cost of binary networks is small, it is difficult to search a binary network directly because of the weak learning capability of binary convolution. Recent research on modifying and binarizing floating-point networks in computer vision reveal that the gap of accuracy between a floating-point network and a binary networks sharing a similar architecture is narrowed. Hence, we modify and binarize the searched float point network by replacing the floating-point operations with the binary operations, which have a lower computational complexity. In the proposed SBNN, binary operations consist of one ReActNet Block [26] or two staked ReActNet Blocks. The hyperparameters (e.g., Kernel Size, Dilation, etc.) of the first convolution in each ReActNet block are set as the same as the corresponding Depthwise Separable Convolution in the floating-point operations. The details of modification and binarization are shown in Table 2, and the architectures of ReActNet Blocks are shown in Figure 3.  In ReActNet Blocks, floating-point feature maps are transformed into binary feature maps by ReAct Sign function. The traditional Sign function may harm the training due to an unsuitable truncation threshold. When the threshold is low, lots of background noise makes the binarized feature maps become unclear. Meanwhile, a high threshold produces too little useful information in the binarized feature maps. ReAct Sign reshapes the distribution with channel-wise parameters, which is a learnable method to find the suitable thresholds for binarization: where the superscripts binary and float refer to binary and floating-point values, respectively, p c is a learnable parameter, c indicates the index of a channel.
The core of binary convolutional networks is the binary cross-correlation between 1-bit inputs and 1-bit weights. Compared with the floating-point cross-correlation, the binary form has a 64× computation efficiency and can be described as: x binary * w binary = popcount(XNOR(x binary , w binary )) (4) where x and w denote inputs and weights respectively. SBNN has 4 types of binary convolution with different hyper-parameters, which are demonstrated in Figure 4.
The traditional binary convolutional networks have a poor performance with the low representational capability. The shortcuts in ReActNet Blocks largely increase the number of different choices in a pixel after binary convolution and gives a representation capability close to floating-point convolution.
Feature maps from the shortcut and the Batch Normalization [34] are added through an addition operator. Then, ReAct PReLU shifts the distribution and activates the added feature maps. The formula of ReAct PReLU is shown below: where , ξ, γ are trainable parameters, c indicates the index of a channel.

Patch Shift Processing
SAR ship images, which contain coherent speckle noise, always have a poorer quality than natural optical images. Hence, it is more difficult to train a binary classification network about the distribution of the feature maps. ReAct Sign and React PReLU put effort on reshaping the distribution channel-wisely. However, those functions do not cover any adjustment at spatial level. Inspired by resent researches on spatial attention of neural networks [30,31,35], we propose a spatial representation reinforcement method named as patch shift processing, which is illustrated in Figure 5. Firstly, the proposed patch shift processing divides the input feature maps into p × p patches. Then, a learnable parameter is given to each patch in each channel. Finally, the features are reinforced by adder operators. This processing is trained with other weights in the network through backpropagation. Important areas can be highlighted and information irrelevant to the target is reduced in feature maps.
In Figure 5, the sample which is directly binarized without the processing has too few useful information to be learnt. On the contrary, the processed image gives a high-quality 1-bit feature map with a clear shape of the ship after binarization. Considering the training difficulty and the computational complexity, we set p as 12 and place the proposed patch shift processing after linear layers in each cell.

Experiments
Our experiments are run on OpenSARShip which is an open access SAR ship image dataset [36]. A personal computer with Intel core i5-11600 CPU, 16G memory and only single RTX 3060 is used in the experiments. The operating system is Ubuntu20.04 LTS. SBNN is driven by python3. Not only the searching and the training of the floatingpoint network, but also the training of SBNN are conducted with Pytorch neural network framework [37]. The used version of Pytorch is 1.91+cu111, where +cu111 means CUDA toolkit 11.1 is required for GPU acceleration.

Dataset
OpenSARship is a wildly used dataset in SAR ship classification research and was released by Huang et al. in 2018. OpenSARship contains 11,346 SAR ship images covering many categories of ships from Sentinel 1 satellite. Furthermore, the single-look complex data which contain both VV and VH polarization images with a high range resolution for targets constitute the final datasets in our experiments. The number of different categories of ships in OpenSARship are distributed in a wide range. We set two tasks to verify the effectivity of the proposed SBNN following the researches about SAR ship classification [16,38]. Those two tasks are the three categories SAR ship classification and the six categories Dual-Polarization SAR ship classification respectively.
The three categories SAR ship classification task includes bulk carrier, container ship and tanker which own approximately 80% of international routes [39]. To mitigate the imbalance number of different categories, we use the same way of HOG-ShipCLSNet [16] to create the dataset. We take 70% of the least number of all three categories as the number of training targets for each category. The rest targets are set in the test set. In addition, the VV and VH SAR images of a same target are treated as two independent samples. Table 3 shows the sample numbers for each category. The six categories Dual-Polarization SAR ship classification task includes bulk carrier, container ship, tanker, cargo, general cargo and fishing. We take 80% of the least number of all six categories as the number of training targets for each category. Similarly, the rest targets are set in the test set. In the six categories task, each sample consists of both the VV and VH SAR images of a same target. Table 4 shows the sample numbers for each category in the six categories Dual-Polarization SAR ship classification. Some samples used in experiments are shown in Figure 6.

Trainning
To improve the learning with a small data set size, we apply a data augmentation policy including random cropping and random horizontal flipping to expand the diversity of data. The inputs of SBNN share a same shape of 1 × 100 × 100, where 1 indicates VV or VH polarization.
OpenSARShip dataset provides uint8 files, which have been processed and generated from SAR products with related automatic identification system information. Compared with prior studies using PCA or a sub-network in data pre-processing, SBNN requires a much easier data pre-processing where only center cropping and resizing are employed. Considering the targets are surrounded by a large area of clear background, each sample is processed as follows.
In the three categories classification task, targets share a similar size. To reduce the distortion effect of resizing, targets larger than 128 × 128 are center-cropped to 112 × 112, and targets smaller than 112 × 112 are resized to 112 × 112.
In the six-categories dual-polarization classification task, targets are lying on a large range of size, which means the distortion effect of resizing is obvious and unavoidable. All targets are resized to 128 × 128, and then center-cropped to 112 × 112.
Notice that the data pre-processing provides a size of 112 × 112 which is slightly larger than the input of SBNN. This size is instrumental in applying the data augmentation policy, where random cropping produces data with a size of 100 × 100.
With reference to AdamBNN [27] about training binary networks, we use a distillation learning [40] method to train the weights in SBNN. Binary networks have a low learning capability, which means direct training with the cross-entropy loss and hard labels could result in poor performance. The soft label generated by floating-point networks can help the learning of binary networks. When training SBNN, the floating-point network, which acts as a teacher, is the searched CNN by PC-DARTS, the prototype of SBNN. The initial channel number of the teacher is 8 and the cell number is 8. SBNN is the student and trained with the KL divergence loss which can be computed as below: where: b is the batch size, class indicates the category of a sample, P is the SoftMax output of the floating-point network or the binary network.
In this paper, Adam optimizer [41] updates the weights in SBNN. To be specific, the weight decay for the convolution and linear layers is set as 0. The weight decay for other learnable parameters, such as the weights in Batch Normalization, is set as 1 × 10 −5 . Training has two stages. Each stage has 256 epochs, a batch size of 32 and a base learning rate of 0.001. We apply a linear scheduler to change the learning rate during training, the specific learning lr ep for each epoch can be calculated as: where: ep denotes the number of trained epochs.
In the first stage, all binary convolutional operators are replaced by the floating-point convolutional operators. The network learns the distribution of feature maps. In the second stage, we load the weights from the first stage and replace the floating-point convolutional operators with binary convolutional operators. Then, the network is trained with the same policy of the first stage.

Inferring
Center cropping takes over from random cropping and generates appropriate inputs of SBNN during inferring. In the three categories SAR ship classification task, the corresponding category of the neuron which produces the highest probability is the predication of a sample. In the six-categories Dual-Polarization SAR ship classification, we use a simple decision fusion when an input contains two images. In detail, the VV and VH SAR images are successively inputted into SBNN. If different polarization data of the same target give different results, the result with a higher probability will be the final prediction. The additional computation of this decision fusion can be ignored, and the network can be trained as same as the single polarization task. Table 5 lists the results of SBNN and the four other floating-point CNN-based SAR ship classification methods in the three categories task. Those methods for comparison are: finetuned VGG [13], plain CNN [42], group squeeze excitation sparsely connected CNN (GSESCNNs) [43] and the combination of HOG, PCA and deep learning (HOG-ShipCLSNet) [16]. Notice that SBNN and the above methods except HOG-ShipCLSNet do not require additional computation out of the networks. Table 6 lists the results of SBNN with fusion and three other floating-point CNN-based Dual-Polarization SAR ship classification methods in the six-categories task. The counterparts are: VGG with hybrid channel feature loss [15], mini hourglass region extraction and dual-channel efficient fusion network [44] and the squeeze-and-excitation Laplacian pyramid network (SE-LPN-DPFF) [38].

Comparison with CNN-Based SAR Ship Classification Methods
As shown in Tables 5 and 6, our proposed SBNN outperforms all other methods and reaches the best accuracy: 80.03% and 56.73% for the two tasks, respectively. On the hardware implementation side, SBNN also has great advantages. The number of weights in SBNN is only 0.37 M. Moreover, the computational complexity of SBNN is very small with a MAdds number of 4.67 M which is almost 19× lower than the HOG-ShipCLSNet.
The confusion matrixes of SBNN in the three-categories task and the six-categories task are shown in Tables 7 and 8, respectively. In the three-categories task, SBNN has a good accuracy for each category. In the six-categories task, most samples of bulk carrier, cargo, container ship and fishing can be classified correctly.     Table 9 lists the results of SBNN and other three modern computer vision binary networks in the three-categories task. The compared binary networks are: Bi-RealNet [25], ReActNet [26] and AdamBNN [27]. Not only is the number of weights obtained, but also the binary MAdds and the floating-point MAdds are counted separately. According to the research [25,26], the number of total MAdds can be calculated by adding the number of floating-point MAdds to 1/64 of the number of binary MAdds. From Table 9, SBNN has the best accuracy and the lowest number of MAdds close to 30% of other binary networks. In the meantime, the size of SBNN is much smaller than other networks listed together, which benefits the implement on the tiny-size mobile devices. We conduct an ablation experiment to verify the positive effect of deleting weight-free operations from the candidate operation list. We compare the original candidate operations and only the weight-equipped operations, i.e., 3 × 3 Conv, 5 × 5 Conv, 3 × 3 Dil_Conv, 5 × 5 Dil_Conv. To obtain the fair results, the rest configurations of searching, modifying, scaling and training are kept as same. Table 10 shows the results in the three categories task. The accuracy of the searched CNN and the corresponding binary CNN are improved by 2.4% and 1.5%, respectively, which means our proposed search policy, deleting weight-free operation, can give a better target network in SAR ship classification.

Patch Shift Processing
To verify the enhancement from the proposed patch shift processing, we construct a new network through removing the patch shift processing of SBNN. The new network is trained as same as SBNN in the three categories task. The results of using the patch shift processing and not using are listed in Table 11. We find that the patch shift processing nearly gives an improvement of 1%. As a result, SBNN has an accuracy over 80%, which outperforms several full floating-point networks designed for SAR ship classification.

Conclusions
In this paper, a Searched Binary Network, SBNN, for SAR ship classification is proposed. Experimental results show our network achieves good results in different tasks with a strong robustness. The proposed SBNN outperforms other networks for the model size and the computational complexity. In particular, the number of MAdds in SBNN is very small, which means SBNN has a strong potential for being implemented on the tiny-size mobile devices.