High-Resolution SAR Image Classification Using Multi-Scale Deep Feature Fusion and Covariance Pooling Manifold Network

The classification of high-resolution (HR) synthetic aperture radar (SAR) images is of great importance for SAR scene interpretation and application. However, the presence of intricate spatial structural patterns and complex statistical nature makes SAR image classification a challenging task, especially in the case of limited labeled SAR data. This paper proposes a novel HR SAR image classification method, using a multi-scale deep feature fusion network and covariance pooling manifold network (MFFN-CPMN). MFFN-CPMN combines the advantages of local spatial features and global statistical properties and considers the multi-feature information fusion of SAR images in representation learning. First, we propose a Gabor-filtering-based multi-scale feature fusion network (MFFN) to capture the spatial pattern and get the discriminative features of SAR images. The MFFN belongs to a deep convolutional neural network (CNN). To make full use of a large amount of unlabeled data, the weights of each layer of MFFN are optimized by unsupervised denoising dual-sparse encoder. Moreover, the feature fusion strategy in MFFN can effectively exploit the complementary information between different levels and different scales. Second, we utilize a covariance pooling manifold network to extract further the global second-order statistics of SAR images over the fusional feature maps. Finally, the obtained covariance descriptor is more distinct for various land covers. Experimental results on four HR SAR images demonstrate the effectiveness of the proposed method and achieve promising results over other related algorithms.


Introduction
Synthetic aperture radar (SAR) is an all-weather and all-day active microwave imaging system. Due to the special capabilities, the SAR system has become a very significant and powerful source of information for various fields, such as land-cover mapping, disaster monitoring, and urban planning [1]. Classifying and interpreting the information provided by SAR images is usually recognized as a prerequisite step among these applications. In recent years, the new generation of space-or airborne SAR sensors can acquire large amounts of high-resolution (HR) SAR images [2]. These data provide sufficient information in the spatial context for SAR scene understanding and interpretation. Nevertheless, HR SAR image classification still faces the following two challenges:

1.
Intricate spatial structural patterns: Due to the coherent imaging mechanism and object shadow occlusion, pixels of the same object will present a high degree of variability, known as speckle [3]. Moreover, HR SAR images contain more strong scattering points, and the arrangements of numerous and various objects have become more complicated. In this context, HR SAR images will have a great intra-class variation and little inter-class difference between objects [4]. As shown in Figure 1a,b, we have given two low-density residential areas from the same category and two different categories including open land and water areas. Therefore, extracting more discriminative and precise spatial features for HR SAR image classification is still a highly challenging task.

2.
Complex statistical nature: The unique statistical characteristics of SAR data are also crucial for SAR image modeling and classification. In HR SAR images, the number of elementary scatterers present in a single-resolution cell is reduced. Traditional statistical models for low-and medium-resolution SAR, such as Gamma [5], K [6], Lognormal [7], etc., find it difficult to provide a good approximation for the distribution of HR SAR data. Meanwhile, accurate modeling of HR SAR data using statistical models may require designing and solving more complex parameter estimation equations. Hence, it is also a challenge to effectively capture the statistical properties contained in the SAR image to enhance the discrimination of land-cover representations.

Related Work
The mainstream methods for SAR image classification can be roughly categorized as hand-crafted feature-based methods, statistical analysis-based methods, and deep learning methods. We briefly review these related works and then discuss the inspiration from these methods.
In recent years, many handcrafted feature descriptors have been proposed to characterize the content of SAR images, such as multilevel local pattern histogram (MLPH) feature [8], Ratio-detector-based feature [9], contextual descriptors [10], etc. These features exhibit better performance compared to GLCM [11] and Gabor features [12] in HR SAR image classification. In addition, Tombak et al. [13] investigated the use of the recently developed feature attribute profiles (FPs) for the feature extraction of SAR images. Song et al. [14] employed the histogram of oriented gradients (HOG)-like features to effectively capture the main structures of targets in speckled SAR images. Guan et al. [15] used the covariance descriptor of textural features and made the feature descriptor more distinguishable for various SAR land covers. Generally, the above features are fed into a classifier such as the Softmax [16] or the support vector machines (SVM) [17] for classification. To some extent, hand-crafted feature-based methods have excellent low-level feature representation capabilities of SAR images and can perform reasonably well for some specific categories with minimal amounts of training data. However, HR SAR images contain more complex structural and geometrical information, which requires hand-crafted feature-based methods needed to further improve robustness and generalization performance. Therefore, the more abstract and discriminative features need to be extracted from the above low-level features for complex HR SAR classification tasks.
Due to the unique characteristic of the coherent speckle, the distribution of pixel values within SAR images provides much valuable information. Statistically modeling the terrain distributions is an effective tool for SAR image analysis. There are already some traditional non-Gaussian models to describe the distribution characteristics of SAR images, such as Fisher [18], generalized Gamma [19], Nakagami-Gamma [20], heavy-tailed Rayleigh [21], etc. To fully capture the complex content of HR SAR images, some new statistical models, such as the scale mixture of Gaussian (SMoG) [22], generalized Gamma hybrid model [23], lognormal mixture model [24], beta generalized normal distribution [25], and complex generalized Gaussian model [26], have been proposed for statistical analysis. Frery et al. [27] propose a generalized statistical framework for HR SAR images. Generally, these models are then used in a Bayesian inference framework such as Markov random field [28] to realize classification. However, these statistical models generally have strict assumptions or are effective for specific scenarios. Meanwhile, parameter estimation is also very important for the accurate modeling of HR SAR data. Besides, these models are based on pixel values and do not establish effective relationships with high-level features. We find that the essence of the statistical model is to capture the high-order statistics of SAR images for data representation. Therefore, the above analysis inspires us to capture statistics from high-level features of SAR images that may be able to obtain more efficient and discriminant feature representations.
Deep neural networks (DNN) [29] are capable of learning high-level features of images hierarchically. Many studies have verified the powerful ability of DNN to discover significant features and semantic relationships in SAR image classification. Geng et al. [30,31] proposed a deep supervised and contractive neural network (DSCNN) for SAR feature learning. Zhao et al. [32] proposed a discriminant deep belief network (DisDBN) for HR SAR image classification. Ding et al. [33] investigated the capability of convolutional neural networks (CNN) combined with data augmentation operations in SAR target recognition. Chen et al. [34] proposed an all-convolutional network (A-CovNet) for SAR target recognition, which consists of only sparsely connected layers to prevent over-fitting problems. Li et al. [35] applied CNN to very-high-resolution SAR image classification. However, the above-mentioned learning methods require a large number of labeled data to obtain a satisfactory result. In actual application scenarios, manually annotating SAR data is labor-intensive and time-consuming. Considering the scarcity of SAR labeled data, many schemes such as domain adaptation [36], transfer learning [37], GAN [38], and unsupervised feature learning [39], etc., have been proposed to solve the SAR image classification problem. The sparse unsupervised feature learning has relatively simple structures and is a feasible solution to relieve the needs of labeled samples. Recently, a new unsupervised feature learning method [40] based on the dual-sparse encoder has been proposed. This method optimizes the cost function driven by natural rules and performs hierarchical unsupervised learning on CNN. However, [40] does not adequately consider the influence of coherent speckles from SAR images, and the complementarity of features between different levels is not fully utilized. Therefore, it is necessary to construct a CNN model for extracting high-level features from HR SAR images. This model can make full use of a large number of unlabeled data for feature learning and can take into account the complementarity of features between different levels, to realize the discriminant feature extraction of SAR objects.

Motivations and Contributions
Based on an overall consideration, the objective of this paper aims at combining the advantages of statistical analysis and representation learning to realize pixel-based classification of HR SAR images with resolution equal to or even less than 1 m. First, some previous CNN models [34,35] only use the features of the last convolutional layer for SAR image classification without the full consideration of the information obtained by the additional layers. Second, to capture statistics from high-level features of SAR images, Liu et al. [41] proposed a statistical CNN (SCNN) for land-cover classification from SAR images, which characterize the distributions of CNN features by the first-and secondorder statistics (including mean and variance). However, the variance only considers the statistical properties of independent feature maps and does not establish the interaction between the feature maps. As a second-order statistical method, covariance has a more robust representation than the mean and variance [42]. He et al. [43] proposed a method that combines multi-layer CNN feature maps and covariance pooling for optical remote sensing scene classification. Ni et al. [44] proposed a multimodal bilinear fusion network, which used the covariance matrix to fuse the optical and SAR features for land cover classification. Generally, the above methods map the covariance directly to the Euclidean space through the matrix logarithm operation for classification [45]. However, they do not further extend the covariance matrix to the deep network to deeply mine the potential discriminant features of second-order statistics.
To tackle the above problems, we propose a novel HR SAR image classification method, using a multi-scale deep feature fusion network and covariance pooling manifold network (MFFN-CPMN). MFFN-CPMN combines the advantages of local spatial features and global statistical properties and considers the multi-feature information fusion in representation learning to describe a SAR image. To our knowledge, this is the first approach that integrates the CPMN with the CNN for classification HR SAR images with a resolution equal to or even less than 1 m. The main contributions of this paper lie in two folds.

1.
We propose a Gabor-filtering-based multi-scale feature fusion network (MFFN) to obtain the effective spatial feature representation. MFFN combines the strengths of unsupervised denoising dual-sparse encoder and multi-scale CNN to learn discriminative features of HR SAR images. Meanwhile, MFFN introduces the feature fusion strategies in both intra-layer and inter-layer to adequately utilize the complementary information between different layers and different scales.

2.
We introduce a covariance pooling manifold network (CPMN) to capture the statistical properties of the HR SAR image in the MFFN feature space. The CPMN characterizes the distributions of spatial features by covariance-based second-order statistics and incorporates the covariance matrix into the deep network architecture to further make the global covariance statistical descriptor more discriminative of various land covers.
The rest of this paper is organized as follows. The proposed classification method MFFN-CPMN is described in Section 2. Experimental results and analysis on three real HR SAR image data are presented in Section 3. Finally, the conclusion is drawn in Section 4. Figure 2 shows the schematic of the proposed MFFN-CPMN-based classification method for the HR SAR image. In general, the proposed method consists of the following two steps: (1) Gabor filtering-based multi-scale deep fusion feature extraction; (2) global second-order statistics extraction and classification based on covariance pooling manifold network. The proposed method is elaborated in detail in the following subsections.  CNN can learn the high-level representation from the low-level features of the SAR data in a hierarchical way. Thus, the representation ability of the low-level features will affect the following high-level representation. The backscattering of the single-polarized HR SAR image is very sensitive to the shape and orientation of the scatterers. Moreover, complex geometrical information and coherent speckle exist in the SAR image. If only the raw image is used to optimize the first layer parameters of the network, the above factors may harm the performance of CNN in extracting SAR image features. Taking into account that the Gabor filter [46] has direction selection characteristics, it is compatible with the orientation-sensitive of the SAR image. Gabor filtering can extract rich multi-scale and multi-direction spatial information, which may reduce the feature extraction burden of CNN.

Materials and Methods
The Gabor filter is modulated by a Gaussian function and a sinusoidal plane wave [47], whose general function can be defined as: where f is the central frequency of the sinusoid. θ denotes the orientation of the Gabor function. α and β is the sharpness of the Gaussian along the two axes, respectively. γ = f /α and η = f /β are defined to keep the ratio between frequency and sharpness. To get Gabor features, a set of Gabor filters with different frequencies and orientations are required as follows: Here, f max is the highest peak frequency of the Gabor function. U and V represents the number of scales and orientations of Gabor filters, respectively. Then, Gabor features are extracted by convoluting the SAR image I(x, y) with every Gabor filter G u,v (x, y) as follows: where F u,v denotes the Gabor features corresponding to scale v and orientation u, respectively. ⊗ and | · | are convolution and absolute operators, respectively. By stacking the Gabor feature maps with different scales and different orientations, this step can enrich the low-level features of objects used for CNN classification.

Multi-Scale Deep Feature Fusion Network
The main three components of the traditional CNN are a convolutional layer, a nonlinear activation layer, and a pooling layer. Formally, the forward pass operations of the lth layer in CNN can be defined as follows: where F l−1 is the input feature map of the lth layer, W l and b l are weights and bias of the lth layer, respectively. σ(·) is the nonlinear activation function, and the sigmoid function is used in our work. pool(·) denotes the pooling operation. The input features F 0 of the first layer of CNN are the Gabor features extracted above. HR SAR images contain both complex objects and extended areas. On the one hand, the traditional CNN using a single-scale convolution kernel may not accurately capture local details of different sizes. On the other hand, our CNN model is trained in a greedy layer-wise unsupervised learning manner. The complementarity of features between different layers cannot be captured due to the lack of feedback information. Moreover, the shallow features of CNN tend to extract the local spatial structural information, while the deep features contain the global spatial layout information of the objects. Based on the above analysis, we need to excavate the potential information hidden in different scales and different layers to improve the feature representation capacity. Thus, we present two fusion strategies in our multi-scale feature fusion network (MFFN) to integrate local and global features between different scales and layers.
The first one is intra-layer fusion, which emphasizes the fusion of various local information in each layer. Specifically, inspired by the inception module [48], we aim to capture the multi-scale information from the input features of each layer. As shown in Figure 3a, an original inception model is given. It used multiple convolutional layers with different kernel sizes to extract the multi-scale feature. The output features are further concatenated as the final output. In our experiment, we find that the 1 × 1 convolution kernel does not bring meaningful improvement to the accuracy. This may be because the 1 × 1 convolution focuses on the location itself and cannot obtain enough neighbor information. Thus, as shown in Figure 3b, we propose to construct a multi-scale convolution module by using filters of sizes 3, 5, and 7. In addition, since the unsupervised learning method we adopted allows the model with large numbers of input feature channels efficiently, the feature concatenation will significantly increase the model parameter amount and computational burden. Considering a balance between accuracy and computational cost, we adopt the sum-based fusing mechanism to reduce feature dimensions and improve efficiency. The accuracy and computational cost are reported in the experiments. Correspondingly, the process of intra-layer fusion can be expressed as follow: where F l sum represents the fused features of the lth layer, s denotes the convolution kernel size at the current scale. F l s represents the convolution feature output of the filter of size s. Figure 2 shows the visualized multi-scale convolution feature maps. It can be seen that by introducing different kinds of convolution kernels, the diversity of extracted local features are increased. These features are conducive to further improving the feature representation ability of our model. The second strategy is inter-layer fusion. As we know, the features of different layers contain different levels of spatial-contextual information. The shallow features mine low-level structure information, while the deep layers generate the high-level semantical features. To compensate for the loss of interaction information between layers due to the unsupervised layer-wise training, we fuse the features from different layers to capture the complementary information and enhance global feature representation. As shown in the green dashed line in Figure 2, the features of each layer in MFFN are concatenated to obtain the final fusional features. We do not use summation fusion here because summation fusion is difficult to retain the special information of the features of each layer, which may cause the information loss of the local structure. Besides, since the spatial dimension of different layers is inconsistent, we adopt the average pooling operation to transform the dimensions of all layers to be consistent. Finally, the feature fusion can be easily performed with a way of concatenation. The whole process can be represented by the following equations: where F l sum is the output feature maps of the lth layer, the g l denotes the dimensionmatching function based on average pooling. ∪ refers to the concatenation operation. F f usion denotes the final fusional features of MFFN. To illustrate the effect of our fusion strategy, the different ways of fusing features are also verified in the experiments.
In the orange part of Figure 2, a two-layer MFFN model is presented. First, the whole model takes Gabor features as input and obtains multi-scale feature maps through multiscale convolution. Then, the nonlinear activation and summation fusion of the features are carried out. The final multi-scale fusional feature output is obtained by concatenation of multi-layer features. Note that our pooling step uses a 3 × 3 pooling region size with pad = 2 instead of a 2 × 2 pooling size to reduce the grid effect. This is mainly because the 2 × 2 pooling regions are disjointed, and more information is lost in each pooling layer. The overlapping pooling can reduce the spatial information loss of each layer to a certain extent, thereby improving the performance of the model [49].

Greedy Layer-Wise Unsupervised Feature Learning
An important aspect is how to train the weights W and bias b of the proposed MFFN model. Considering the limited SAR labeled samples, we update the parameters of MFFN by the layer-wise pre-training based on the unsupervised criterion [50]. Benefitting from the characteristics of meta parameters-free and simple rules, we introduce a new denoising dual-sparse encoder to realize the unsupervised learning of model parameters. The blue box of the first part in Figure 2 shows the proposed denoising dual-sparse encoder. Next, the detailed denoising dual-sparse encoder algorithm is described. To train the parameters of the lth layer, a set of small unlabeled patches D l−1 s ∈ R N×P randomly extracted from the output feature maps of the (l − 1)th layer as the training data. N is the number of patches. Every row of D l−1 s is a vectorized patch and P = s 2 · N l−1 h is the dimension of vectorization. N l−1 h is the output dimension of (l − 1)th layer. Then, inspired by the denoising autoencoder [51], we applied the denoising mechanism to the dual-sparse encoder model. We found that introducing this operation can further enhance the robustness of the model to noise. Specifically, we corrupt D l−1 s into the vector ∼ D l−1 s with a certain probability λ through a stochastic mapping: where ϕ(·) is a type of distribution determined by the original distribution of D l−1 s and the type of random noise added to D l−1 s . In general, ϕ is set to Bernoulli distributions, and the element components in the input D l−1 s are randomly forced to 0 with the probability of λ (λ is set to 0.5 in our work), and the others are left untouched.
For the lth layer, the feature output formula is as follows: where H l s ∈ R N×N l h is the feature output matrix of the lth layer. W l s ∈ R P×N l h and b l s ∈ R 1×N l h are the weights and bias at the s scale of the lth layer convolution kernel, respectively. Notably, the W l s here corresponds to the convolution kernel of each scale under each layer in MFFN. Thus, the trained parameter W l s can be reshaped into the form W l s ∈ R s×s×N l−1 h ×N l h , and are applied to the location of the corresponding convolution kernel. To form a sparse optimization function, a dual sparse encoder based on enforcing population and lifetime sparsity (EPLS) [40] is used to restricts the H l s units to have a strong dual sparsity and builds a one-hot sparse target matrix T l s . Finally, the parameters can be upgraded by minimizing the L2 norm of the difference between H l s and T l s : The model can be efficiently trained through the mini-batch stochastic gradient descent with adadelta [52] adaptive learning rate. Figure 4 shows the model structure of the denoising dual-sparse encoder. After the model completed the training of the current layer, the weights are applied to the convolution kernel location to obtain the output convolution feature map as the input of the next layer. Repeat the training process until the parameters of all layers are pre-trained. The whole procedure of optimizing parameters is purely unsupervised, and there is no need to carry out the fine-tuning after the layer-wise training. We can summarize our MFFN feature extraction in detail in Algorithm 1. The superiority of the proposed MFFN method is threefold. First, the Gabor filtering can enhance the richness of low-level features and reduce the training burden of the MFFN. Second, the multi-scale convolution module based on unsupervised learning can enrich the diversity of features in intra-layer and make full use of a large number of unlabeled patches as training data. Last but not least, the two different fusion strategies are adopted in both intra-layer and inter-layer, which can not only strengthen various local information in different scales but also capture complementary and interaction information between different layers. The three advantages mentioned above enable the MFFN that becomes a very effective feature extraction model for HR SAR data under a relatively shallow network structure. Obtain W l−1 s , b l−1 s by solving (9) and update the parameters of MFFN; 7: Extract lth layer multiscale feature F l s of MFFN by (2), and get the final feature output by sum fusion: ;

Global Second-Order Statistics Extraction and Classification Based on CPMN
During the feature classification stage, mainstream CNNs typically use global average pooling [53] to aggregate the output features at the end of the network. However, this method can only capture the first-order information of features, thereby ignoring the statistical properties of a SAR image between feature channels. It makes the model less adaptable to complex tasks such as SAR image classification. To make the feature representation more powerful, we adopt a second-order statistical method based on covariance analysis to extract more discriminatory and valuable features.

Multilayer Feature Fusion Based on Covariance Pooling
To construct the input of CPMN, covariance pooling is used to form a covariance matrix for the output features of MFFN. Following the process in Figure 1, we take an 64 × 64 input sample as an example, and the number of features is set to N h . We have the feature output F 1 sum ∈ R 64×64×N h of the first layer and the feature output F 2 sum ∈ R 32×32×N h of the second layer. Then, we average pooling the feature F 1 sum to obtain the downsampling featureF 1 sum ∈ R 32×32×N h , making it consistent with the spatial dimension of the second layer of features F 2 sum . After that, we stack the features of each layer to get the fusional features F f usion = F 1 sum ; F 2 sum ∈ R 32×32×2N h . Finally, the covariance matrix can be computed as: where f j , j = 1, . . . , n detnotes vectorization of F f usion along the third dimension and µ = (1/n)∑ n j=1 f j . To make the covariance matrix strictly positive definite (SPD), regularization [44] is applied to C. The covariance matrix by adding a multiple of the trace to diagonal entries of the covariance matrix: where ε is a regularization value and I is the identity matrix. Compared with first-order statistical features, the covariance matrix brings second-order statistics, which can obtain better regional feature description ability.

Covariance Features Classification Based on Manifold Network
The covariance matrix obtained above usually resides on the Riemannian manifold of the SPD matrix [45]. The standard method is often to apply a logarithmic transformation to map the Riemannian manifold structure to the Euclidean space [43,54]. Then, the upper triangular matrix is vectorized and input into a linear classifier to achieve classification. However, the covariance matrix obtained by the multi-layer feature fusion of CNN has large dimensions. In [43], a channel-average fusion strategy is proposed to reduce the dimensionality of CNN feature maps. Nevertheless, we find that when applied to SAR image classification, the channel-average fusion may cause a significant informative loss of some channel features, thereby degrading the performance of covariance features. To obtain more discriminative covariance features, a Riemannian manifold network is adopted to achieve the covariance-based feature classification. This network not only integrates the covariance matrix into the deep network but also reduces the dimensionality of the covariance matrix without losing geometric structure. The main three building blocks of a manifold network [55] are bilinear mapping (BiMap) layers, eigenvalue rectification (ReEig) layers, and an eigenvalue logarithm (LogEig) layer, respectively. The light blue part in Figure 1 shows our manifold network classification framework.
Specifically, given a covariance matrix C as input, the BiMap layer transforms the input SPD matrices to new SPD matrices by a bilinear mapping f b as: where C k−1 is the input SPD matrix of the kth layer. W k ∈ R d k ×d k−1 be weight matrix in the space of full rank matrices, C k ∈ R d k ×d k is the resulting matrix. According to the manifold learning theory, retaining the original data structure is beneficial for classification. Thus, the BiMap layer reduces the dimensionality of the covariance matrix while preserving the geometric structure. Then, a non-linearity is introduced by the ReEig layer to improve discriminative performance. The ReEig Layer is used to rectify the SPD matrix by tuning up their small positive eigenvalues: where U k−1 and ∑ k−1 are achieved by eigenvalue decomposition (EIG) The max operation is element-wise matrix operation. Further, to enable the covariance features to be classified on a standard Euclidean space classifier, we use the LogEig layer to map the output SPD matrices lie on the Riemannian manifold to the Euclidean space. Formally, the LogEig layer applied in lth layer is defined as: where T is an eigenvalue decomposition and log is an element-wise matrix operation.
In the end, the vector forms of the outputs can be fed into classical softmax layers for classification. The class conditional probability distribution of each sample the crossentropy [56] is used to measure the prediction loss L of the network based on where z i is the vectorized feature vector of the LogEig layer, T is the total number of classes. The matrix back-propagation methodology formulated in [55] is adopted to compute the partial derivative to the covariance matrix. The stochastic gradient descent is utilized to learn the network parameters. The implementation detail of optimizing the manifold network is summarized in Algorithm 2.  (12), (13) and (14); Compute the softmax activation and the loss function by (15); 5: Back-propagate error to compute cross-entropy loss gradient ∂L ∂z i ; 6: The loss of the k-th layer could be denoted by a function as Update network parameter of each layer based on partial derivatives The update formula for the BiMap layer parameter W l is The gradients of the involved data in the layers below can be compute

Experimental Data and Settings
To validate the performance of the proposed method, four real HR SAR images obtained from different sensors, including the TerraSAR-X satellite, Gaofen-3 SAR satellite, China airborne SAR satellite, and F-SAR satellite were adopted. The detailed information of four real HR SAR images is shown in Table 1. For each dataset, the ground truth images are generated by manual annotation according to the associated optical image, which can be found in Google Earth. TerraSAR-X data: The data of TerraSAR-X (http://www.intelligenceairbusds.com) are the region of Lillestroem, Norway. It was acquired in October 2013 with X-band and HH polarization. The image has 1450 × 2760 pixels in size, and the resolution of this data is 0.5 m. The acquisition mode of the data is staring spotlight. The original image and the ground-truth are shown in Figure 5a

Gaofen-3 data:
The images of Gaofen-3 SAR records the area of Guangdong province, China, with C-band and HH polarization, which were acquired in March 2017. The size of this single-look data is 2600 × 4500 with a spatial resolution of 1 m. The imaging mode is the sliding spotlight. The original image and the ground-truth are presented in Figure 6a   To achieve pixel-based classification, training, validation, and test samples are needed to be constructed. In our experiment, all the labeled pixels together with their neighborhood image patches are extracted to form the samples. 64 × 64-pixel image patches were randomly selected according to the ground truth, which shows a balance between the classification accuracy and computational cost. Five hundred samples of each class were randomly selected and divided into training and validation, accounting for 90% and 10%. The other labeled pixels were used for the testing. In the testing phase, we used a stride greater than 1 to inference the test samples to avoid excessive computational costs. (we set the stride to 5 in our paper). The obtained class probability map then upsampled the original resolution with a negligible loss in the accuracy. The overall accuracy (OA), average accuracy (AA), the kappa coefficient, and classspecific accuracy were used to measure the classification performance of the proposed method quantitatively. The optimal parameters of all methods were selected based on the best performance on the validation data. All results below were the mean values and standard deviations by running five times of the experiments. Furthermore, all experiments were implemented in MATLAB2014, with Intel I7 3.2-GHz CPU and 32-GB memory.

Parameter Settings and Performance Analysis of the Proposed MFFN-CPMN model
In this section, we first measured the sensitivity of different parameter settings on the classification results and determined the optimal parameters for the proposed method. As a public parameter choice, the Gabor filters were set with five scales and eight orientations, which include [0, (π/8), (π/4), (3π/8), (π/2), (5π/8), (3π/4), (7π/8)]. This can maintain robust low-level feature representation capabilities of the SAR image. For the training of each layer of MFFN, the 30,000 unlabeled local small patches from the feature maps were extracted to train the weight parameters. The corrupted probability of λ is set to 0.5 for the denoising dual-sparse encoder, which can obtain the best performance. If the absolute value of the loss function for two consecutive times was less than 10 −4 , the iterative training update was terminated immediately. For the training of manifold networks, the mini-batch size was set to 100, and the learning rate is set to 0.01. The maximum epoch was set to 150 experimentally. Then, the different parameter settings, including the effect of feature number, the multi-scale convolution module, the number of layers, the effect of feature fusion strategies, and the effect of the manifold network were evaluated in detail as follows. Notably, the specific analysis and decision of the TerraSAR-X image will be elaborated in this section. The parameter determination and the trend analysis of the Gaofen-3 and Airborne SAR images are the same as the TerraSAR-X image. Despite some differences in the resolution of each dataset, we hope to avoid parameter tuning for each dataset and generalize the same optimization model to other datasets. This way is more suitable for some application scenarios with tight time constraints, and it is more able to verify the generalization performance of the model.

Effect of the Feature Number
First, we tested the impact of different feature numbers (includes 20, 50, 70, 100, 150, 200, 250, 300) on the classification accuracy. The number of features is related to the performance of MFFN. To compare the results conveniently, the number of features is set to be equal for each layer. The global average-pooling is adopted at the end of MFFN, and the final features are fed to the Softmax classifier for evaluation. The experimental results are shown in Figure 9. It can be observed that stable accuracy appears when the feature number is set to 200. When the number of units exceeds 200, there is only a slight increase in the accuracy. Intuitively, the feature number in CNN can express the diversity of features. Too few features may not have sufficient discriminability, and too many features will lead to feature redundancy and increase computational complexity. Therefore, balancing the running time and the classification accuracy, we set 200 as the feature number in each layer in our experiment.

Effect of Multi-Scale Convolution Kernel
Then, we tested the influence of multi-scale convolution kernel (MCK) on the classification results. We fix the number of layers of MFFN as 4, and the number of features is 200. Furthermore, the global average-pooling is used at the end of MFFN to aggregate the features. Meanwhile, we also compared the impact of the Gabor features on the classification result of the MFFN model. Figure 10 shows the results with different convolution kernel size. We used the symbols "MCK135" to represent the multi-scale convolution module with filter sizes of 1, 3, and 5. Similarly, the symbol "MCK357" represents the multi-scale convolution module with filter sizes of 3, 5, and 7. First, we can conclude that the MFFN model with Gabor features as input can obtain better classification performance. The reason is that the Gabor filter enhances the richness of low-level features and improves the recognition accuracy of MFFN. Secondly, it can be seen that the proposed MFFN model has a higher classification accuracy than the single-scale model. This indicates that the multi-scale convolution kernel can mine the different scales information in the SAR image, thereby improving the expressing ability of features. Besides, the "MCK357" module obtained the best accuracy. Therefore, we use a multi-scale convolution module with filter sizes of 3, 5, and 7 as the default setting for MFFN in our experiments.

Effect of the Denoising Dual-Sparse Encoder and Depth
Next, we evaluated the impact of the denoising dual-sparse encoder algorithm and different network depths. To illustrate the effects of the denoising mechanism, we compare the results of models with and without denoising to train the parameters in MFFN. The comparison results are shown in Table 2. It can be seen that by introducing a denoising mechanism, the proposed model can obtain better performance, which indicates that our denoising dual-sparse encoder model is more robust to noise. Further, the deeper MFFN can learn a more high-level of abstraction for the SAR image, and the abstraction level of features can significantly impact the classification results. From this table, we can see that the best performance can be achieved by setting the network depth to 4. Note that we have not explored the deeper layers because the 64 × 64 pixels are reduced to 4 × 4 pixels through 4 layers of downsampling. The deeper layers would overly contract the feature space and blur the boundary of each category. Thus, the depth was set to 4 as the default setting in our experiments.

Effect of Multi-Layer Feature Fusion
To evaluate the effectiveness of our fusion strategy of MFFN, we compared it with different combination schemes of features. For convenience, we take the form "Intra-sum&inter-concat" as an example to show the intra-layer summation and inter-layer concatenation scheme. We use "none" to indicate that no feature fusion is performed, and only the features of the last convolutional layer are used for classification. "Sum" and "concat" represent the fusion of intra-layer or inter-layer features to obtain the final output features, respectively. Additionally, global average-pooling is used to aggregate the final features for classification. Table 3 lists the different feature fusion schemes. Meanwhile, it also gives the corresponding model complexity, model running time, OA, and AA. It can be seen that the "Intra-concat&inter-concat" scheme achieves the highest accuracy, but its running time is about 2.5 times that of the intra-layer summation scheme. Further, we can observe that the inter-layer summation scheme makes the classification accuracy have a certain decrease. This may be due to the sum-based inter-layer fusion causing the loss of the specific information of the local structure of each layer. The "Intra-sum&inter-concat" scheme provides a tradeoff between performance and the running time. Thus, we choose this scheme as the default setting for MFFN in our experiments.

Effect of Covariance Pooling Manifold Network
To verify the effectiveness of the covariance pooling manifold network, we defined nine models with different architectures to evaluate their impact on accuracy. We used M-1 to indicate that only the last layer features of MFFN (LLFN) were aggregated through global average pooling (GAP), and it was referred to as "LLFN + GAP." Correspondingly, we define three manifold network structures based on LLFN. The M-2 was referred to as "LFFN + CPMN (200)", which means that there is no BiRe layer, but a 200 × 200 covariance matrix directly for LogEig transformation and classification. We use M-3 to represent that there is one BiRe layer, which was referred to as "LFFN + CPMN (200-100)". The dimensionalities of the transformation matrices were set to 200 × 100. For the M-4, it is was referred to as "LFFN + CPMN (200-100-50)". The M-4 includes two BiRe layers, and the transformation matrix is 200 × 100, 100 × 50, respectively. Besides, we define the model of M-5 for MFFN and GAP, which is denoted as "MFFN + GAP." Similarly, we define models M6~M9 as manifold network models containing different BIRE layers, respectively. As shown in Table 4, the specific matrix transformation settings are similar to the model settings of the above LLFN model. From Table 4, we can see that the LLFN-based manifold networks can obtain better OA than GAP. Moreover, the OA and kappa of the MFFN+GAP model have higher accuracy than the LFFN+GAP model, but AA is also close to LFFN+GAP. The reason is that the GAP may ignore the spatial structure information of some targets, which makes the accuracy of some categories decline. Further, we can see that the manifold network based on multi-layer feature fusion can obtain better accuracy. By comparing the manifold network of different layers, the best classification performance can be obtained when the transformation matrix was set to 800 × 400. As the number of layers increases, some structural information may be lost due to the downsampling of the covariance matrix. Meanwhile, the risk of overfitting may increase, which eventually leads to a decrease in classification accuracy. Based on the above results, we chose M-7 as the classification model for subsequent experiments.
To further illustrate the effectiveness of model training, Figure 11 shows the training and verification accuracy and loss corresponding to the above 9 models in the case of minimum loss on the validation data. As we know, depth can improve the accuracy but adding too many layers may cause overfitting and also downgrade the accuracy as well. It can be seen that the M-7 model obtains the lowest loss on the validation set, and meanwhile, it can be seen in Table 4 that M7 obtains the best result on the test set, which is consistent with our analysis of Table 4.

Experiments Results and Comparisons
To evaluate the performance of the proposed method, the related methods are considered for comparison, including two groups of feature extraction algorithms based on traditional features and deep learning models. The approaches and settings included in the comparison are summarized as follows. Gabor [16]: The mean of the magnitude of Gabor filter responses with 5 scales and 8 orientations are adopted.
Covariance of Textural Features (CoTF) [34]: The covariance descriptor based on Gabor features are calculated. Then, the covariance matrices are mapped into a reproducing kernel Hilbert space.
BOVW [13]: The same number of codebooks as MFFN is generated from the small unlabeled patches. Then, the histogram features are computed by using these codebooks to characterize each SAR sample.
DCAE [18]: A deep convolutional autoencoder (DCAE) is designed as in [18]. First, a series of filters are utilized as convolutional units to comprise the GLCM, and Gabor features together. Furthermore, the two-layer SAE is used to learn high-level features.
EPLS [27]: We adopt the same number of network layers and feature units as MFFN. The differences are that the original EPLS algorithm [27] is utilized for training the parameters of each layer of CNN. The CNN model only uses the 3 × 3 convolution kernel to extract the features.
Standard CNN [22]: The standard CNN (SCNN) model contains five layers, and the first three layers are the convolutional layers. The size of all the convolutional kernels is 5 × 5. The numbers of the convolutional filters are 64, 128, and 256, respectively. An FC layer with 500 units and a Softmax layer is connected to the end of CNN.
A-ConvNets [47]: To avoid the overfitting problem due to limited training data in SAR target classification, an all-convolutional network (A-ConvNets) is constructed. This CNN model only consists of five convolutional layers, without FC layers being used. All parameters are set to the default values as in [47].

MFFN-GAP:
To further illustrate the difference between the first-order statistics and second-order statistics, we also compared the MFFN model based on global average pooling. This method is consistent with the M-5 model we mentioned in Section 3.2.5 above.
MSCP [32]: To evaluate the effect of the manifold network, we use the multi-layer stacked covariance pooling (MSCP) for SAR image classification. Due to the difference in SAR and optical imaging mechanisms, we use our MFFN instead of the VGG16 model as feature extractor, focusing on contrasting the covariance pooling by MSCP. Although MSCP is not designed for SAR images, it can still be used as a benchmark to verify our covariance pooling manifold network algorithm.
Note that the features extracted by all the above algorithms are classified using Softmax for fair comparison. The same model structure obtained above is applied to three datasets to verify whether the model is stable enough for SAR images from different sensors.

TerraSAR-X SAR Image
In this section, experiments are conducted on the TerraSAR-X data to evaluate the performance of the different classification methods. Table 5 reports the class-specific accuracy, AA, OA, and kappa coefficient. We can see that the proposed MFFN-CPMN produces much better classification accuracies than other classification methods. The overall accuracy of our approach can reach 89.33%, the average accuracy can reach 89.78%, and the kappa coefficient can reach 0.8511. Compared with Gabor and CoTF, the proposed MFFN-GAP yields superior classification accuracy than the traditional feature descriptor. This indicates that our MFFN-GAP learns the more discriminant feature representation from Gabor features. Compared with BoVW and DCAE, our proposed model yields higher classification results, which shows that our deep MFFN model can extract more effective features than these shallow feature learning methods. Compared with CNN models including EPLS, SCNN, and A-ConvNet, MFFN achieves better performance in terms of OA and AA. This is because our MFFN considers the complementarity of multiscale and inter-layer information. It can also be seen that the recognition performance of our method is relatively stable in each category. The SCNN method has a very high recognition rate for water, while the classification accuracy for the road is low. It illustrates that our unsupervised feature learning has better generalization performance in feature extraction compared with the supervised training network which is directly oriented to the classification task. Moreover, Compared with MFFN-GAP and MSCP, the CPMN in our MFFN-CPMN is able to improve the OA and AA, which indicates that our CPMN can not only capture the correlation of MFFN features but also further improve the discriminative ability of covariance descriptor through the manifold network. In summary, the proposed MFFN-CPMN shows that the joint consideration of the deep data and statistical features of SAR images can effectively improve the performance of the algorithm for complex SAR image classification tasks. The classification result maps of all methods are shown in Figure 12. It can be seen that traditional feature descriptors such as Gabor and CoTF can hardly identify road categories with structural features. Due to the influence of shadows, woodland and open lands have similar scattering intensities in some areas. It can be seen that methods such as BoVW, DCAE, and EPLS have produced severe misclassification in these two categories. The SCNN, A-ConvNet, and MFFN-GAP models can identify and distinguish each type of target. Meanwhile, it can be seen that there are fewer "pepper" noise classification phenomena appearing on the classification map. Finally, compared with the groundtruth, it can be concluded that the proposed MFFN-CPMN method has a smoother label consistency in each class area and has better classification appearance performance.

Gaofen-3 SAR Image
The four quantitative metrics, including the accuracy of each class, OA, AA, and kappa coefficient of the different classification methods, are listed in Table 6. As can be observed, the proposed MFFN-CPMN outperforms the other approaches as it produced the highest classification accuracies. The OA, AA, and kappa reach 90.03%, 91.91%, and 0.8704, respectively. Compared with Gabor and CoTF, the proposed MFFN-GAP achieves higher accuracy than traditional features, which illustrates that our MFFN model can learn a high-level representation from low-level Gabor features. The classification accuracies of BoVW, DCAE, and EPLS are unsatisfactory, mainly because the multiplicative noise contained in the Gaofen-3 image weakens the feature expression ability of these models in the feature learning process. Our MFFN takes into account the influence of noise in feature learning, and the introduced denoise mechanism makes the learned features more robust. Compared with SCNN and A-ConvNet, we can see that the MFFN-GAP can get a 2%~4% improvement in AA. This indicates that the multi-scale convolution module and feature fusion strategy designed in our MFFN model improve the feature discrimination ability than the single-scale convolution module in SCNN and A-ConvNet. In addition, MFFN-CPMN outperforms MFFN-GAP with about 3% improvement in AA. It indicates that global second-order statistics in the MFFN can enhance the classification performance than global average pooling. Meanwhile, our MFFN-CPMN integrates the covariance matrix into the deep manifold network, which can also obtain more accurate feature representations and classification performance than the pooling method proposed by MSCP.   Figure 13 shows the classification result maps by using different methods on the Gaofen-3 SAR image. As we can see, the Gabor, BoVW, and EPLS methods produce more serious misclassifications on the classification results between mountains and open land.
Meanwhile, buildings and woodland areas are confused in supervised CNN methods, including SCNN and A-ConvNet. The method based on the covariance descriptor can obtain more superior performance in the building areas, which shows that the covariance feature can deal with targets containing complex terrain information. Compared with the ground-truth, the proposed MFFN-CPMN method can maintain fewer noise classifications in each class, which indicates that our method can extract more robust and effective features than other methods. Hence, the MFFN-CPMN shows great efficiency for processing complex SAR image classification tasks.  Table 7 lists the accuracy of each class, OA, AA, kappa coefficient of the Airborne SAR image with different methods. Furthermore, the classification map of our method and several contrast methods are shown in Figure 14. The OA, AA, and kappa of the proposed MFFN-CPMN are much superior to that of the others. The OA, AA, and kappa obtained by the MFFN-CPMN model is 88.37%, 93.79%, and 0.7894, respectively. Compared with Gabor and CoTF, the proposed MFFN-GAP achieves higher accuracy than traditional features. Due to the limited ability of feature expression, the traditional features failed to capture structural information present in the road. Compared with BoVW, DCAE, and EPLS, our MFFN-GAP yields superior accuracies, in which OA is improved over 5%. It illustrates that the proposed MFFN can learn effective spatial features to enhance classification performance. The proposed MFFN-GAP has an average accuracy improvement of about 5% over supervised training methods, including SCNN and A-ConvNet. This indicated the advantages of multi-scale and multi-layer feature fusion introduced by our model. Besides, the MFFN-CPMN can acquire the optimal classification accuracy compared with MFFN-GAP and MSCP, which shows that the proposed covariance classification framework can help MFFN improve its accuracy in SAR image classification.  From the classification maps, it can be seen that traditional feature methods cannot identify roads. The BoVW, DCAE, and EPLS models produced more "salt and pepper" noise classification results. For SCNN and A-ConvNet, there is confusion between woodland and residential areas. For the proposed MFFN-CPMN, it appears more homogeneous on the classification map than others, especially in commercial and open land areas. Note that there is confusion between the runway area and its adjacent open land category. Therefore, it still needs to further improve the accuracy of the class boundary in our MFFN-CPMN method. Table 8 shows the accuracy of per class, OA, AA, kappa coefficient of the F-SAR image with different methods. These results are consistent with the observations above. It is seen from the compared results that the proposed MFFN-CPMN achieves the highest classification accuracies. The overall accuracy of our approach can reach 96.61%, the average accuracy can reach 96.35%, and the κ can reach 0.9399. Compared with Gabor, CoTF, BoVW, and DCAE, the proposed MFFN-GAP produces higher accuracies, which indicates that the proposed model has a superior ability to learn more discriminative features. The scenes of water, open land, and vegetation in the image are relatively simple and regular, and the comparison methods can also achieve relatively high accuracy. The challenging task is to classify the residential area accurately, we can see that our MFFN-CPMN model achieves the highest classification accuracy compared with EPLS, SCNN, and A-ConvNet. In addition, compared with MFFN-GAP, we can see that our MFFN-CPMN is more suitable to deal with objects with complex structural information in the SAR classification task.  Figure 15 depicts the classification result map by using different methods on the F-SAR image. It can be observed that the proposed MFFN-CPMN achieves the optimal visual effect, in which the spatial label smoothness is much better than other methods. Compared with deep models such as EPLS, SCNN, and A-ConvNet, our MFFN-CPMN yields better classification performance in residential, which is coincident with the results of Table 8. Hence, the MFFN-CPMN can greatly improve the performance for processing complex SAR image classification tasks.

Discussion on Transferability of Pre-Trained MFFN
Another noteworthy point is to explore the transferability of the pre-trained MFFN model over different datasets. In some application scenarios with tight time constraints, it is necessary to realize the fast feature extraction for some new SAR datasets from different sensors or different resolutions. To explore the effectiveness of the transferability of the MFFN, we conducted experiments on four real SAR datasets. Specifically, we trained the MFFN-GAP model with unlabeled data from one of the datasets and then transferred it to the other three datasets for feature extraction and classification. Tables 9-12 report whether the three images transfer the pre-trained MFFN from other images to evaluate the classification accuracy of the current data set. From Tables 9-12, we can see that the accuracy difference between the results obtained using the migration model on other datasets and the results obtained without the migration model is only about 1-2%, except for the migration of F-SAR to Airborne SAR. It shows that our model can quickly obtain a relatively reliable classification result when migrating to other datasets for feature extraction. When facing some real-time application scenarios, it saves a bit of network pre-training time. Additionally, we found that better classification accuracy can be obtained by transferring the pre-trained model based on the Gaofen-3 SAR image to the Airborne SAR image. The possible reason is that Gaofen-3 SAR data contains more complex structural information than Airborne SAR data. This information provides a more effective feature extractor for Airborne SAR. In contrast, when the model pre-trained with Airborne SAR or F-SAR is applied to the other two images, we see that their classification accuracy has decreased. This is mainly because the Airborne SAR and F-SAR images contain too many homogeneous patches. These patches are not enough to provide enough discriminative information. Thus, it is necessary to select the image with rich structure information to improve the transferability of the model.

Conclusions
In this paper, a novel HR SAR image classification method, using multi-scale feature fusion and covariance pooling manifold network (MFFN-CPMN), is presented. In the MFFN-CPMN, deep data features and global statistical properties of SAR image are jointly considered in the representation learning. Specifically, considering the scarcity of SAR labeled data, a denoising dual-sparse encoder is proposed to pre-train the parameters of the constructed MFFN. Meanwhile, to reduce the burden of MFFN training, we introduce the multi-scale and multi-directional Gabor features at the input of MFFN to suppress speckle noise and provide more abundant low-level features. Further, a covariance pooling manifold network is utilized to extract the global second-order statistics of SAR images over the fusional feature maps. Our MFFN-CPMN combines the advantages of multi-feature information fusion of SAR images, making it more suitable for processing complex SAR image classification tasks. Experimental results on three HR SAR images demonstrate that our proposed framework produces optimal results in both accuracy and visual appearance compared with some related approaches. Besides, experiments verify the potential transferability of the pre-training model between SAR images of different sensors. It provides a solution for some rapid SAR application scenarios.
Future work can be carried out in the following aspects. To solve the problem of limited labeled samples, we intend to consider using some new data generation techniques, such as generating adversarial networks (GAN) [38] to increase the amount of SAR data. Moreover, we will try to use the limited labeled samples to achieve the end-to-end training of the entire MFFN-CPMN model.