Scene Classification from Synthetic Aperture Radar Images Using Generalized Compact Channel-Boosted High-Order Orderless Pooling Network

The convolutional neural network (CNN) has achieved great success in the field of scene classification. Nevertheless, strong spatial information in CNN and irregular repetitive patterns in synthetic aperture radar (SAR) images make the feature descriptors less discriminative for scene classification. Aiming at providing more discriminative feature representations for SAR scene classification, a generalized compact channel-boosted high-order orderless pooling network (GCCH) is proposed. The GCCH network includes four parts, namely the standard convolution layer, second-order generalized layer, squeeze and excitation block, and the compact high-order generalized orderless pooling layer. Here, all of the layers are trained by back-propagation, and the parameters enable end-to-end optimization. First of all, the second-order orderless feature representation is acquired by the parameterized locality constrained affine subspace coding (LASC) in the second-order generalized layer, which cascades the first and second-order orderless feature descriptors of the output of the standard convolution layer. Subsequently, the squeeze and excitation block is employed to learn the channel information of parameterized LASC statistic representation by explicitly modelling interdependencies between channels. Lastly, the compact high-order orderless feature descriptors can be learned by the kernelled outer product automatically, which enables low-dimensional but highly discriminative feature descriptors. For validation and comparison, we conducted extensive experiments into the SAR scene classification dataset from TerraSAR-X images. Experimental results illustrate that the GCCH network achieves more competitive performance than the state-of-art network in the SAR image scene classification task.


Introduction
With the rapid development of various synthetic aperture radar (SAR) sensors, large amounts of high-quality SAR remote sensing images have been produced.SAR images have the advantages of a certain penetration ability of vegetation and work in all weathers and round-the-clock [1].These images have been widely used in land cover classification [2], marine observation [3,4], and environment projection [5]; however, characterizing SAR images effectively is still one of the most challenging tasks in these applications.SAR image scene classification, which is a direct understanding and interpretation of SAR data [6,7], has become an important technique in the extraction of the ground objects, target recognition and so on [8].The description of SAR images is more difficult than high-resolution optical remote sensing images.As shown in the Figure 1, speckle noise makes the extraction of discriminative descriptors difficult in SAR scene images; in particular, there are many irregular repetitive patterns which are spatially correlated in the SAR image [6].Additionally, various kinds of objects exhibit diversified singularities which have different shapes, contours and textures in the scene image.Considering all of them, SAR image scene classification is indeed onerous.
Remote Sens. 2019, 11, x FOR PEER REVIEW 2 of 19 sensing images.As shown in the Figure 1, speckle noise makes the extraction of discriminative descriptors difficult in SAR scene images; in particular, there are many irregular repetitive patterns which are spatially correlated in the SAR image [6].Additionally, various kinds of objects exhibit diversified singularities which have different shapes, contours and textures in the scene image.Considering all of them, SAR image scene classification is indeed onerous.Inspired by the great success achieved by CNN [9] in the computer vision community, the powerful feature representations learnt through CNN and the parameters enabling end-to-end optimization [10] make the characterization of SAR scene images possible.Research shows the different layers learn different levels of semantic information in the CNN [11,12]; they extract highlevel semantic information from the edges and corners which are most salient in the deepest layer [13], and this high-level semantic information is helpful for SAR image scene classification.However, the standard convolution operation is performed by a fixed-size convolution kernel with a fixed step size corresponding to the feature map of the previous layer, and the obtained calculation result is also placed at a fixed position of the output; this is the same as the standard pooling operation.This has led to deep feature representation having strong global spatial information, and the geometric invariance is weak.To improve the performance of SAR remote sensing scene classification, we should globally order the spatial structure in small regions, and the spatial structure should be orderless in large regions (due to the layout differences of the scene image); however, the deep feature representation lacks the orderless feature descriptor [14].The bag of visual words (BoVW) model [15][16][17] which produces orderless features and makes features characterize intra-class differences [18] achieves better performance on the scene classification.The famous feature coding used in the BoVW pipeline includes vector quantization (VQ) [19,20], sparse coding (SC) [21], locality-constrained linear coding (LLC) [22], super vector (SV) [23], Fisher vector (FV) [24], locality-constrained affine subspace coding (LASC) [25], and so on.All of them could generate orderless feature representation, which is helpful for improving the accuracy of scene classification.
Nowadays, more and more experts are focused on the BoVW model [26] and CNN [27] for remote sensing scene classification [28][29][30].Cheng.et al [31] present a novel feature representation method for scene classification based on the bag of convolutional features (BoCF) which generates visual words from deep convolutional features using convolutional neural networks.Yuan.et al [32] obtain more discriminative deep features than direct extraction from the top convolutional layers of CNN, and the final feature representations are obtained by the LASC pooling of deep features.Meanwhile, Banerjee.et al [33] present a pattern mining-based approach for the efficient discovery of mid-level visual elements, which highlights the correlations between such local descriptors and the classes efficiently.Although these works have shown their powerful feature representation in remote sensing scene classification, they still have some shortcomings: (1) only the first-order orderless feature is considered, while high-order orderless feature representation, which can better characterize the distributions of SAR scene images, is missed; (2) the parameters of the BoVW model do not enable end-to-end optimization to make the model less flexible; and (3) the channel information of the feature descriptor is fixed in the BoVW model to make the feature descriptors less discriminative for the classification.Inspired by the great success achieved by CNN [9] in the computer vision community, the powerful feature representations learnt through CNN and the parameters enabling end-to-end optimization [10] make the characterization of SAR scene images possible.Research shows the different layers learn different levels of semantic information in the CNN [11,12]; they extract high-level semantic information from the edges and corners which are most salient in the deepest layer [13], and this high-level semantic information is helpful for SAR image scene classification.However, the standard convolution operation is performed by a fixed-size convolution kernel with a fixed step size corresponding to the feature map of the previous layer, and the obtained calculation result is also placed at a fixed position of the output; this is the same as the standard pooling operation.This has led to deep feature representation having strong global spatial information, and the geometric invariance is weak.To improve the performance of SAR remote sensing scene classification, we should globally order the spatial structure in small regions, and the spatial structure should be orderless in large regions (due to the layout differences of the scene image); however, the deep feature representation lacks the orderless feature descriptor [14].The bag of visual words (BoVW) model [15][16][17] which produces orderless features and makes features characterize intra-class differences [18] achieves better performance on the scene classification.The famous feature coding used in the BoVW pipeline includes vector quantization (VQ) [19,20], sparse coding (SC) [21], locality-constrained linear coding (LLC) [22], super vector (SV) [23], Fisher vector (FV) [24], locality-constrained affine subspace coding (LASC) [25], and so on.All of them could generate orderless feature representation, which is helpful for improving the accuracy of scene classification.
Nowadays, more and more experts are focused on the BoVW model [26] and CNN [27] for remote sensing scene classification [28][29][30].Cheng.et al [31] present a novel feature representation method for scene classification based on the bag of convolutional features (BoCF) which generates visual words from deep convolutional features using convolutional neural networks.Yuan.et al [32] obtain more discriminative deep features than direct extraction from the top convolutional layers of CNN, and the final feature representations are obtained by the LASC pooling of deep features.Meanwhile, Banerjee.et al [33] present a pattern mining-based approach for the efficient discovery of mid-level visual elements, which highlights the correlations between such local descriptors and the classes efficiently.Although these works have shown their powerful feature representation in remote sensing scene classification, they still have some shortcomings: (1) only the first-order orderless feature is considered, while high-order orderless feature representation, which can better characterize the distributions of SAR scene images, is missed; (2) the parameters of the BoVW model do not enable end-to-end optimization to make the model less flexible; and (3) the channel information of the feature descriptor is fixed in the BoVW model to make the feature descriptors less discriminative for the classification.
To solve the above issues, a generalized compact channel-boosted high-order orderless pooling network (GCCH) is presented.The proposed GCCH network can generate a compact channel-boosted high-order orderless feature descriptor, which is low-dimensional but has high discrimination; additionally, the parameters enable end-to-end optimization.The overall framework of the proposed network is illustrated in Figure 2.
Remote Sens. 2019, 11, x FOR PEER REVIEW 3 of 19 information of the feature descriptor is fixed in the BoVW model to make the feature descriptors less discriminative for the classification.To solve the above issues, a generalized compact channel-boosted high-order orderless pooling network (GCCH) is presented.The proposed GCCH network can generate a compact channelboosted high-order orderless feature descriptor, which is low-dimensional but has high discrimination; additionally, the parameters enable end-to-end optimization.The overall framework of the proposed network is illustrated in Figure 2.  The main contributions of this paper are summarized below.

•
We present a generalized compact channel-boosted high-order orderless pooling network (GCCH) which learns the high-order vector of parameterized locally constrain affine subspace coding by a kernelled outer product.It enables low-dimensional but highly discriminative feature descriptors, and the dimension of the final feature descriptors can be set by us.

•
We employ the squeeze and excitation block to learn the channel information of the parameterized LASC statistic representation by explicitly modelling the interdependencies between channels; the block obtains the importance of each feature channel by the adaptive learning network.

•
All of the layers can be trained by back-propagation, and some important parameters on the proposed network are made to obtain better experimental parameter settings, and research on the SAR image scene classification dataset demonstrates the competitive performance.The remainder of the paper is organized as follows: Section 2 discusses the traditional LASC method; the generalized compact channel-boosted high-order orderless pooling network is proposed in Section 3; comprehensive experimental results are reported in Section 4; and Section 5 concludes the paper and proposes future research work.

Traditional LASC Method
In the past decade, the bag of visual words model has played an important role in a variety of vision tasks.The standard BoVW pipeline [34,35] consists of feature extraction, dictionary learning, feature coding and spatial aggregation, and classification.Locality-constrained affine subspace coding (LASC), which is part of the BoVW model, performs first-order coding over a dictionary of affine subspaces and second-order coding based on information geometry, which has proven helpful to improve classification accuracy.Figure 3 gives the illustration of the traditional LASC method.The main contributions of this paper are summarized below.

•
We present a generalized compact channel-boosted high-order orderless pooling network (GCCH) which learns the high-order vector of parameterized locally constrain affine subspace coding by a kernelled outer product.It enables low-dimensional but highly discriminative feature descriptors, and the dimension of the final feature descriptors can be set by us.

•
We employ the squeeze and excitation block to learn the channel information of the parameterized LASC statistic representation by explicitly modelling the interdependencies between channels; the block obtains the importance of each feature channel by the adaptive learning network.

•
All of the layers can be trained by back-propagation, and some important parameters on the proposed network are made to obtain better experimental parameter settings, and research on the SAR image scene classification dataset demonstrates the competitive performance.
The remainder of the paper is organized as follows: Section 2 discusses the traditional LASC method; the generalized compact channel-boosted high-order orderless pooling network is proposed in Section 3; comprehensive experimental results are reported in Section 4; and Section 5 concludes the paper and proposes future research work.

Traditional LASC Method
In the past decade, the bag of visual words model has played an important role in a variety of vision tasks.The standard BoVW pipeline [34,35] consists of feature extraction, dictionary learning, feature coding and spatial aggregation, and classification.Locality-constrained affine subspace coding (LASC), which is part of the BoVW model, performs first-order coding over a dictionary of affine subspaces and second-order coding based on information geometry, which has proven helpful to improve classification accuracy.Figure 3 gives the illustration of the traditional LASC method.[36], and the dictionary of LASC is an ensemble of low-dimensional linear subspaces attached to the affine subspaces.
In the traditional LASC method, the geometry of the feature space is represented by an ensemble S of low-dimensional subspace i S [25], where i  is the attachment point of the i th subspace, which is the cluster center obtained by the K - means algorithm.The columns of i A ( Dp  matrix, D is the feature dimension, p is the dimension of the subspace) form an orthogonal basis of the linear subspace.i S defines a local coordinate system, and all of i S constitutes an approximation of the feature distribution actually.
The principle of LASC encoding for feature y is to express the nearest top-k affine subspace, while using the neighboring metric constraint feature to project the projection vector in each neighbor affine subspace.Then, the objective function of LASC is formulated as where ( ) , d i yS is defined as the exponentiated Euclidean distance, and  is obtained by cross- validation.We solve Equation ( 2) and obtain the closed-form solution of LASC, named i c .Finally, cascading the feature coding of each subspace, its first-order LASC vector takes the following form: The LASC features are extracted by hand crafting, such as the histogram of oriented gradient (HOG) [36], and the dictionary of LASC is an ensemble of low-dimensional linear subspaces attached to the affine subspaces.
In the traditional LASC method, the geometry of the feature space is represented by an ensemble S of low-dimensional subspace S i [25], where µ i is the attachment point of the ith subspace, which is the cluster center obtained by the K-means algorithm.The columns of A i (D × p matrix, D is the feature dimension, p is the dimension of the subspace) form an orthogonal basis of the linear subspace.S i defines a local coordinate system, and all of S i constitutes an approximation of the feature distribution actually.The principle of LASC encoding for feature y is to express the nearest top-k affine subspace, while using the neighboring metric constraint feature to project the projection vector in each neighbor affine subspace.Then, the objective function of LASC is formulated as where λ > 0 is the parameter of the regular item.Let N S k (y) be the set of k, which is the closest subspaces of y ; d(y, S i ) denotes the proximity measure between y and S i , and • 2 denotes the 2 -norm.Based on reconstruction error, d(y, S i ) is defined as (3) where d(y, S i ) is defined as the exponentiated Euclidean distance, and γ is obtained by cross-validation.We solve Equation ( 2) and obtain the closed-form solution of LASC, named c i .Finally, cascading the feature coding of each subspace, its first-order LASC vector takes the following form: where .In particular, we can derive the second-order coding form of LASC by the Fisher information, then define the gradient vector of the likelihood function as where p(z i |S i ) is the log-likelihood function of z i , and then the Fisher information metric is where E(•) denotes expectation with respect to p(z i |S i ).Lastly, based on Fisher's information theory, the Fisher vector of z i is defined as The second-order LASC is calculated as a weighted Fisher vector, which is formulated as if S i N s k (y), f S i = 0. Finally, we concatenate the first and second-order LASC statistical information, and we will acquire the entire LASC feature representation.

Generalized Compact Channel-Boosted High-Order Orderless Pooling Network
In this section, the proposed GCCH network will be presented in detail.As illustrated in Figure 2, the proposed network includes four parts: (1) the standard convolution layers, (2) the second-order generalized layer (parameterized LASC layer), (3) the squeeze and excitation block, and (4) the compact high-order generalized orderless pooling layer.Table 1 gives the differences between the GCCH network and other approaches.The GCCH network learns the powerful high-order deep feature vector and the channel information by end-to-end training; additionally, the low-dimensional feature descriptor is produced to reduce the redundancy of the high-order feature vector.First of all, the closed-form solution of Equation (2) takes the following form: where is the hard assignment coefficient; according to Equation ( 3), we replace it with a soft assignment coefficient [37,39], Remote Sens. 2019, 11, 1079 where β is a positive constant, K is the number of visual words (cluster centers [40]).After expanding the squares, the soft assignment coefficient can be expressed as where LASC is a localized and second-order coding method and only considers the nearest top-k affine subspace.The soft-assignment coefficient of LASC layer is given by where N T (y i ) (T nearest neighbor layer, TNN layer) denotes the indexes of T nearest neighbor visual word of y i which is based on Equation (3).Meanwhile, the dimension of dictionary A is reduced by the affine subspace layer which is based on the nearest neighbor principal component analysis (PCA) basis, and the projective matrix is defined by P i (P i ∈ R S×D ), and S is the dimensionality of the subspace.
The forward computation of the first-order LASC feature representation is written as where M denotes the number of the feature descriptors.Different from NetVLAD and LLC, LASC utilizes the second-order statistical information, according to Equation ( 7), and after a series of simple transformations, the forward computation of the second-order LASC statistical information is given by We will acquire the final LASC feature representation by cascading both L 1 and L 2 ; the network architecture of parameterized LASC layer is shown in Figure 4.
Firstly, the deep feature is normalized by l 2 -Norm (element-wise normalization).According to Equation ( 11), w T k y i + b k can be viewed as the 1 × 1 convolution layer with the weight {w} and bias {b} which are well known in CNN; the exponential layer is employed to calculate e w T k y i +b k , which is similar to the softmax layer.The TNN layer only considers the T nearest neighbor visual word of y i ; if it is other than 0, we can implement it by the variant of the max pooling layer.The sum-norm layer is calculated by Equation (12).
The affine subspace layer is defined as the 1 × 1 convolution layer with the weight {P} and bias −P × µ , which is based on P i (y i − µ i ) in Equation ( 13), and we will acquire the first-order LASC feature representation through the product of the soft assignment layer and affine subspace layer.As seen in Equations (13) and Equation ( 14), the second-order LASC can be viewed as the nonlinear active function of first-order LASC; we define the activation function as ϕ(x) = x 2 − 1 in the activation layer, and the derivative of ϕ(x) can be easily obtained.Then, the second-order statistic information is obtained.Lastly, we cascade them to acquire the final LASC statistic information.Here, all of the parameters ({w}, {b}, {P} and −P × µ ) can be trained by back-propagation in the parameterized LASC layer.Then, the algorithm steps (forward computation) of the parameterized LASC layer are given as follows: Input: the output of the convolution u Output: the entire LASC statistic information Step1.Normalize u by the l 2 -norm layer.
Step2.The first-order LASC static information L 1 is obtained by rescaling the output of the soft assignment layer with the affine subspace layer.Step3.The second-order LASC static information L 2 is obtained by rescaling the output of the soft assignment layer with the affine subspace layer and activation layer.Step4.Cascading both L 1 and L 2 .e + , which is similar to the softmax layer.The TNN layer only considers the T nearest neighbor visual word of i y ; if it is other than 0, we can implement it by the variant of the max pooling layer.The sum-norm layer is calculated by Equation (12).
The affine subspace layer is defined as the 11  convolution layer with the weight  P and bias 13), and we will acquire the first-order LASC feature representation through the product of the soft assignment layer and affine subspace layer.As seen in Equations ( 13) and Equation ( 14), the second-order LASC can be viewed as the nonlinear active function of first-order LASC; we define the activation function as Step2.The first-order LASC static information 1 L is obtained by rescaling the output of the soft assignment layer with the affine subspace layer.
Step3.The second-order LASC static information 2 L is obtained by rescaling the output of the soft assignment layer with the affine subspace layer and activation layer.

Squeeze and Excitation Block
The channel information of the feature descriptor is fixed in the popular BoVW model; this makes the feature descriptors less discriminative for the classification.In order to overcome this shortcoming, the squeeze and excitation block is employed to learn the channel information of each second-order feature statistic representation [41] by explicitly modelling the interdependencies between channels; it can obtain the importance of each feature channel by an adaptive learning network.According to this, we should enhance the powerful features and suppress the features which are not helpful for SAR image scene classification.Figure 5 reports the schematic of the squeeze and excitation block.
The squeeze operation is used to generate the channel-wise statistics by using the global average pooling operation after the convolution layer, so that the feature maps (H × W × C) become the real number series of 1 × 1 × C. z is defined as the output of squeeze operation, and the c th element of z is given by where u denotes the output of the convolution layer.The convolution layer can be viewed as the collection of local descriptors, and the global average pooling layer makes the descriptors have a global receptive field.The excitation operation employs the simple gating mechanism to learn the channel-wise statistics of features automatically.The excitation operation employs two fully connected (FC) layers [42] (both of which are 1 × 1 convolution layers) and an ReLU activation layer to limit the complexity and generalization of the model, which can be shown as The first FC layer is named the dimensionality-reduction layer [43] (W 1 ∈ R (C/r)×C , r is ratio of dimension reduction), δ is the ReLU activation function, the second FC layer is named the dimensionality increasing layer with parameters W 2 (W 2 ∈ R C×(C/r) ) in the Figure 5, and σ is the sigmoid activation function and obtains the final channel-wise statistics (1 × 1 × C).The algorithm steps (forward computation) of the squeeze and excitation block are illustrated as follows: Input: the output of the convolution u Output: the channel information of parameterized LASC statistic representation Step1.The squeeze operation z is calculated by Equation ( 15).
Step2.The excitation operation e is obtained by Equation ( 16).
Step3.Then, the output (H × W × C) is obtained by rescaling the convolution output u with e.
The squeeze and excitation block learns the channel information of the parameterized LASC statistic representation by explicitly modelling the interdependencies between channels, and helps to boost feature discriminability.

Squeeze and Excitation Block
The channel information of the feature descriptor is fixed in the popular BoVW model; this makes the feature descriptors less discriminative for the classification.In order to overcome this shortcoming, the squeeze and excitation block is employed to learn the channel information of each second-order feature statistic representation [41] by explicitly modelling the interdependencies between channels; it can obtain the importance of each feature channel by an adaptive learning network.According to this, we should enhance the powerful features and suppress the features which are not helpful for SAR image scene classification.Figure 5 reports the schematic of the squeeze and excitation block.The squeeze operation is used to generate the channel-wise statistics by using the global average pooling operation after the convolution layer, so that the feature maps ( H W C ) become the real number series of11C  .z is defined as the output of squeeze operation, and the th c element of z is given by ( ) 11 1 , where u denotes the output of the convolution layer.The convolution layer can be viewed as the collection of local descriptors, and the global average pooling layer makes the descriptors have a global receptive field.
The excitation operation employs the simple gating mechanism to learn the channel-wise statistics of features automatically.The excitation operation employs two fully connected (FC) layers [42] (both of which are 11  convolution layers) and an ReLU activation layer to limit the complexity and generalization of the model, which can be shown as ( ) ( ) The first FC layer is named the dimensionality-reduction layer [43] ( , r is ratio of dimension reduction),  is the ReLU activation function, the second FC layer is named the

Compact High-Order Generalized Orderless Pooling Layer
The high-order orderless feature descriptor is obtained by the kernelled outer product at a single spatial location; compared with the outer product [44,45], it enables low-dimensional but highly discriminative feature descriptors.Given the input of the compact high-order generalized orderless pooling layer X = x 1 , . . ., x |S| , x s ∈ R c , S is the set of spatial locations, and the fully bilinear pooling (c × c matrix) can be shown as where • is the inner product operation; the fully bilinear descriptors are a second-order polynomial kernel, and it is possible to find low-dimensional bilinear descriptors.Let k(x, x) denote the kernel function, and φ(x), φ(x) ≈ k(x, x), φ(x) ∈ R d , d < c 2 .Equation ( 17) can be approximated by where C(X) is the compact high-order orderless feature descriptor; we utilize the random maclaurin (RM) projection approach [46] to approximate these operations.The forward computation of the compact high-order generalized orderless pooling layer is given in the following steps: Input: the output of second-order generalized orderless pooling layer x Output: the compact high-order orderless feature representation φ(x) Step1.Generate random W 1 , W 2 ∈ R d×c (the parameters needed to be learned), and make each entry be either +1 or −1 with equal probability.
Especially, the back-propagation of the compact high-order generalized orderless pooling layer takes the following form: where L denotes the loss function, d means the projected dimension, n is the index of the training sample, S is the spatial index, y n d denotes output of the compact high-order generalized orderless pooling layer at dimension d-for instance n. k = 1, 2, k = 2, 1 -and W k (d) is the d th row of W k .Here, the compact high-order generalized orderless pooling layer can be trained by the chain rule, and the parameters can be updated by back-propagation.

Experimental Results and Discussion
The challenging SAR dataset used for evaluating the proposed network is first given in this section.Subsequently, the experimental setup and experimental results for SAR image scene classification are discussed.

Date Set
The evaluation of the GCCH network is tested on the TerraSAR-X database; this dataset contains 5000 scene images from 10 classes (agricultural, beach, building, forest, intersection, island, mountain, ocean, river, and runway), and each class consists of 500 images with the size of 256 × 256.Example images from the 10-class TerraSAR-X dataset are shown in Figure 6.

Date Set
The evaluation of the GCCH network is tested on the TerraSAR-X database; this dataset contains 5000 scene images from 10 classes (agricultural, beach, building, forest, intersection, island, mountain, ocean, river, and runway), and each class consists of 500 images with the size of 256 256  . Example images from the 10-class TerraSAR-X dataset are shown in Figure 6.

Experimental Setup
To demonstrate the effectiveness of the GCCH network, we truncate the conv5_3 layer of VGG-16 net as the standard convolution layers, which can be downloaded from http://www.vlfeat.org/matconvnet/pretrained.The parameters of the LASC layer are set as follows: the dimension of each subspace (subspace dimension) is set to be 256, the number of subspaces (dictionary size) is fixed at 128, and k = 5 (the number of nearest subspaces).The simulations of all the methods were carried out in Matlab R2016b on the PC with Intel Core i7-8700K @3.70Ghz/16.00GB RAM and NVIDIA 1070ti GPU; the average results are given in the following section.
For the TerraSAR-X dataset, we set the ratios of the training set to 10%, 20% and 30% (which is randomly selected per class), and 90%, 80% and 70% for testing.

Experimental Results and Discussion
First of all, a series of experimental results are made to analyze the influence of some important parameters on the GCCH network, and then we make comparisons with other state-of-the art and mid-level feature representation approaches on the TerraSAR-X dataset.

Evaluation of Projected Dimensions
The GCCH network generates the high-order orderless feature representations which are parameterized by the user-defined projected dimensions d .We perform experiments on the TerraSAR-X dataset to investigate the parameter d of the proposed network.The implementation details are as follows: the ratio of reduction 8 r = , the batch size is set to 16, the learning rates are fixed at 0.001, the momentum is set to 0.9, and the weight decay is 0.005.The Adam optimizer was

Experimental Setup
To demonstrate the effectiveness of the GCCH network, we truncate the conv5_3 layer of VGG-16 net as the standard convolution layers, which can be downloaded from http://www.vlfeat.org/matconvnet/pretrained.The parameters of the LASC layer are set as follows: the dimension of each subspace (subspace dimension) is set to be 256, the number of subspaces (dictionary size) is fixed at 128, and k = 5 (the number of nearest subspaces).The simulations of all the methods were carried out in Matlab R2016b on the PC with Intel Core i7-8700K @3.70Ghz/16.00GB RAM and NVIDIA 1070ti GPU; the average results are given in the following section.
For the TerraSAR-X dataset, we set the ratios of the training set to 10%, 20% and 30% (which is randomly selected per class), and 90%, 80% and 70% for testing.

Experimental Results and Discussion
First of all, a series of experimental results are made to analyze the influence of some important parameters on the GCCH network, and then we make comparisons with other state-of-the art and mid-level feature representation approaches on the TerraSAR-X dataset.

Evaluation of Projected Dimensions
The GCCH network generates the high-order orderless feature representations which are parameterized by the user-defined projected dimensions d.We perform experiments on the TerraSAR-X dataset to investigate the parameter d of the proposed network.The implementation details are as follows: the ratio of reduction r = 8, the batch size is set to 16, the learning rates are fixed at 0.001, the momentum is set to 0.9, and the weight decay is 0.005.The Adam optimizer was employed throughout our experiments, and fine-tuned the parameters of layers from Conv4_3 and above in the proposed network.The experimental results are illustrated in Figure 7.
As shown in Figure 7, when d = 128 or 256 or 512, the experimental results decrease significantly using the low-dimensional representation, when d ≥ 1024, the proposed network achieves better performance than the low-dimensional representation.Additionally, d = 8192 and d = 16384 get the highest score.In conclusion, considering the tradeoff between the model complexity and classification accuracy, we suggest that between 1000 and 8000 feature dimensions is appropriate, and we set d = 8192 in our paper.employed throughout our experiments, and fine-tuned the parameters of layers from Conv4_3 and above in the proposed network.The experimental results are illustrated in Figure 7.

Evaluation of the Ratio of Reduction
The ratio of reduction r affects the feature channel learning of the parameterized LASC layer and the parameters of the final model; different r brings different model complexity, and we set the ratio of reduction from 2 to 16 in the proposed network and make experiments on the TerraSAR-X dataset.The other parameters are set as follows: projected dimensions 8192 d = , while the batch size, learning rates, momentum and weight decay are set the same as subsection 4.3.1, and we finetune the parameters of layers from Conv4_3 and above in the GCCH network.In the squeeze and excitation block, the additional parameters result solely from the two FC layers and make the total network capacity increase; however, the parameters will also affect the final model accuracy.Checking in Table 2

Evaluation of the Ratio of Reduction
The ratio of reduction r affects the feature channel learning of the parameterized LASC layer and the parameters of the final model; different r brings different model complexity, and we set the ratio of reduction from 2 to 16 in the proposed network and make experiments on the TerraSAR-X dataset.The other parameters are set as follows: projected dimensions d = 8192, while the batch size, learning rates, momentum and weight decay are set the same as Section 4.3.1, and we fine-tune the parameters of layers from Conv4_3 and above in the GCCH network.
In the squeeze and excitation block, the additional parameters result solely from the two FC layers and make the total network capacity increase; however, the parameters will also affect the final model accuracy.Checking in Table 2, r = 8, r = 4 and r = 2 achieve similar performances which are better than r = 16, and the accuracy of r = 16 has a small degree of reduction.Approximately, the parameter of r = 4 is twice that of r = 2, and the parameter of r = 8 is about four times than that of r = 2. Here, considering the tradeoff between improved performance and increased model complexity, r = 8 is optimal in the GCCH network.In this subsection, we make comparisons with other state of the art methods on the TerraSAR-X dataset; all the competing methods are end-to-end training (based on VGG-16 net), such as B-CNN [44], NetVLAD [37], and LSO-VLADNet [39].For the TerraSAR-X dataset, three different training-test ratios (10%-90%, 20%-80% and 30%-70%) are set; additionally, the projected dimensions d = 8192, the ratio of reduction r = 8, and the other parameters are set the same as in Section 4.3.1.Figure 8 presents the performance comparisons, and Table 3 gives the performance results (average accuracy) compared with other state of the art methods.
GoogleNet(fine-tuning) [48] 89.71 90.40 91.74 VGG-16net(fine-tuning) [10] 89  Because the competing methods are all based on VGG-16 net, Figure 8 only shows comparison of VGG-16 net.As displayed in Table 3 and Figure 8, NetVLAD is a first-or generalized VLAD layer network (the feature dimension is 512), and achieves better performa than VGG-16 net (fine-tuning the last FC layer).LSO-VLADNet is a localized and second-or VLAD network and extends the novel feature coding approach to the end-to-end model (the feat dimension is 512*2=1024), and therefore it exceeds the performance of NetVLAD and VGG-16 n The B-CNN method could acquire global feature representation using the outer product at e location of the images (we also can see it as the first-order network in our paper) and models lo pairwise feature interactions in a translationally invariant manner (the feature dimension 512*512=262144); B-CNN achieves similar performance as LSO-VLADNet, but it only considers first-order feature representation and high dimension.The GCCH network enables low-dimensio but highly discriminative feature descriptors, which can learn the high-order of parameteri locality-constrained affine subspace coding by the kernelled outer product automatically, and squeeze and excitation block learns the channel information of each feature representation and he to boost feature discriminability.Therefore, the proposed GCCH network has advantages (b convergence and accuracy) over these state-of-art methods on the TerraSAR-X dataset, at about 2 3% larger than the well-known Resnet-50, Google Net and VGG-16 net method (fine tuning).

Comparison with Other Mid-Level Feature Representation Methods
In this section, Table 4 demonstrates the performance comparisons between the propo network and other mid-level feature representation methods, which are based on handcraf features (the scale-invariant feature transform (SIFT) feature is used in this subsection).T experimental parameter settings are set the same as in Section 4.3.3,all of the experiments are run times except the proposed network, the resulting features are classified by the linear support vec  Because the competing methods are all based on VGG-16 net, Figure 8 only shows the comparison of VGG-16 net.As displayed in Table 3 and Figure 8, NetVLAD is a first-order generalized VLAD layer network (the feature dimension is 512), and achieves better performance than VGG-16 net (fine-tuning the last FC layer).LSO-VLADNet is a localized and second-order VLAD network and extends the novel feature coding approach to the end-to-end model (the feature dimension is 512*2=1024), and therefore it exceeds the performance of NetVLAD and VGG-16 net.The B-CNN method could acquire global feature representation using the outer product at each location of the images (we also can see it as the first-order network in our paper) and models local pairwise feature interactions in a translationally invariant manner (the feature dimension is 512*512=262144); B-CNN achieves similar performance as LSO-VLADNet, but it only considers the first-order feature representation and high dimension.The GCCH network enables low-dimensional but highly discriminative feature descriptors, which can learn the high-order of parameterized locality-constrained affine subspace coding by the kernelled outer product automatically, and the squeeze and excitation block learns the channel information of each feature representation and helps to boost feature discriminability.Therefore, the proposed GCCH network has advantages (both convergence and accuracy) over these state-of-art methods on the TerraSAR-X dataset, at about 2%-3% larger than the well-known Resnet-50, Google Net and VGG-16 net method (fine tuning).

Comparison with Other Mid-Level Feature Representation Methods
In this section, Table 4 demonstrates the performance comparisons between the proposed network and other mid-level feature representation methods, which are based on handcrafted features (the scale-invariant feature transform (SIFT) feature is used in this subsection).The experimental parameter settings are set the same as in Section 4.3.3,all of the experiments are run 10 times except the proposed network, the resulting features are classified by the linear support vector machine (SVM), and the overall accuracy (OA) is demonstrated in the following table.Table 4 demonstrates that the larger training ratio is, the better the classification will be in most cases.Sparse coding (SC) is one of the soft-assignment methods and decomposes each feature into a sparse, linear combination of the visual words; therefore, the SC method has better performance than the traditional Bow approach (hard-assignment) in most cases.Different from SC and LLC, VLAD is the first-order feature coding style which most fully considers the redundancy of the local geometric structure around visual words, and therefore the feature representation is more efficient than the zeroth-order feature coding method.The FV method, which is a second-order feature coding style, computes gradients of log-likelihood functions of features with respect to covariance and mean vectors; thus, it could obtain the highest score except the proposed network (ours).The GCCH network learns the high-order of parameterized LASC (second order feature coding) by the kernelled outer product automatically; additionally, the squeeze and excitation block learns the channel information of statistic feature representation.Checking in Table 4, we can observe that the GCCH network outperforms all the other mid-level feature representation methods with an increase between 2% and 6.8%.
The confusion matrix reports the detailed classification results of each scene label; each column represents the predicted class, and each row represents the real class.As displayed in Figure 9, some scene categories such as agricultural, beach and runway cannot be effectively classified in the LLC+SIFT and VLAD+SIFT methods, but most categories are classified correctly and produce a classification accuracy higher than 90% in the proposed GCCH network; especially, the agricultural and intersection attain a classification accuracy higher than 95%.In addition, all of the scene classes get a higher score than the LLC+SIFT and VLAD+SIFT approaches.Therefore, the feature descriptors extracted by our proposed network are more discriminative than others.
vectors; thus, it could obtain the highest score except the proposed network (ours).The GCCH network learns the high-order of parameterized LASC (second order feature coding) by the kernelled outer product automatically; additionally, the squeeze and excitation block learns the channel information of statistic feature representation.Checking in Table 4, we can observe that the GCCH network outperforms all the other mid-level feature representation methods with an increase between 2% and 6.8%.The confusion matrix reports the detailed classification results of each scene label; each column represents the predicted class, and each row represents the real class.As displayed in Figure 9, some scene categories such as agricultural, beach and runway cannot be effectively classified in the LLC+SIFT and VLAD+SIFT methods, but most categories are classified correctly and produce a

The Ablation and Combined Experiments
We performed ablation and combined experiments to demonstrate the effectiveness of each module (parameterized LASC layer (PLASC), squeeze and excitation block (SE) and compact high-order generalized orderless pooling layer (CHG)).In this subsection, we truncate the conv5_3 layer of VGG-16 net as the standard convolution layer and make experiments under three different training-test ratios: 10%-90%, 20%-80% and 30%-70%.
We can observe that the VGG-16+CHG approach, which can be viewed as the first-order feature representation, achieves better performance than VGG-16 (fine tuning); meanwhile, LASC is locality-constrained affine subspace coding and constructs a dictionary consisting of an ensemble of linear subspaces attached to representative points, and PLASC contains the first and second-order feature representation.Therefore, vgg-16+PLASC is better than VGG-16 (fine tuning) and VGG-16+CHG.VGG-16+PLASC+SE learns the channel information of the parameterized LASC statistic representation by explicitly modelling the interdependencies between channels and acquires a higher score than VGG-16+PLASC.Similarly, VGG-16+PLASC+CHG can learn generalized compact high-order feature descriptors, which is more discriminative than VGG-16+PLASC and VGG-16+CHG.The GCCH network not only learns the channel information of the parameterized LASC statistic representation, but also acquires the high-order feature descriptors of channel-boosted parameterized LASC statistic representation; therefore, the proposed GCCH network acquires the highest score.
In addition, we introduce the computational complexity between the GCCH network and the ablation and combined networks in Table 5.The SAR remote sensing scene image resolution is 256 * 256.The computational efficiency of each module is reported in detail.The proposed GCCH network truncates the conv5_3 layer of VGG-16 net, which does not contain the last three fully connected layers in VGG-16 net, and replaces them with PLASC, SE and CHG (they contain fewer parameters than the last three fully connected layers in VGG-16 net); therefore, the GCCH network runs faster than the VGG-16 (fine tuning) network.

Conclusions
The paper presents a novel effective network for SAR image scene classification, named the generalized compact channel-boosted high-order orderless pooling network (GCCH), which generates compact high-order and low-dimensional but highly discriminative feature descriptors.The proposed network learns the high-order of parameterized LASC (second-order feature coding style) by the kernelled outer product automatically, and the squeeze and excitation block is employed to learn the channel information of statistic feature representation by explicitly modelling the interdependencies between channels.All of the layers can be trained by back-propagation, and the parameters enable end-to-end optimization.Experimental results on the TerraSAR-X dataset demonstrate that the proposed GCCH network further improves the SAR scene classification performance, and performs better than other methods.
The full statistics of convolutional activations and higher-order feature coding style will be considered in future work.Furthermore, the high-order feature statistics information encoding on different network layers is worthy of study.

Figure 1 .
Figure 1.The synthetic aperture radar (SAR) scene images in the same class.(a) and (b) are the agricultural images, (c) and (d) are the building images.

Figure 1 .
Figure 1.The synthetic aperture radar (SAR) scene images in the same class.(a) and (b) are the agricultural images, (c) and (d) are the building images.

Figure 2 .
Figure 2. The overall framework of the generalized compact channel-boosted high-order orderless pooling (GCCH) network.The proposed GCCH network includes the standard convolution layers, second-order generalized layer (parameterized locality constrained affine subspace coding (LASC) layer), squeeze and excitation block, and the compact high-order generalized orderless pooling layer.The black arrow represents the forward operation of the proposed network; all of the parameters can be updated by back-propagation in the GCCH network.

Figure 2 .
Figure 2. The overall framework of the generalized compact channel-boosted high-order orderless pooling (GCCH) network.The proposed GCCH network includes the standard convolution layers, second-order generalized layer (parameterized locality constrained affine subspace coding (LASC) layer), squeeze and excitation block, and the compact high-order generalized orderless pooling layer.The black arrow represents the forward operation of the proposed network; all of the parameters can be updated by back-propagation in the GCCH network.

Figure 3 .
Figure 3.The illustration of the traditional LASC method.The LASC features are extracted by hand crafting, such as the histogram of oriented gradient (HOG)[36], and the dictionary of LASC is an ensemble of low-dimensional linear subspaces attached to the affine subspaces.

where 0  2 •
 is the parameter of the regular item.Let ( ) denotes the 2 -norm.Based on reconstruction error, ( )

Figure 3 .
Figure 3.The illustration of the traditional LASC method.The LASC features are extracted by hand crafting, such as the histogram of oriented gradient (HOG)[36], and the dictionary of LASC is an ensemble of low-dimensional linear subspaces attached to the affine subspaces.

Figure 4 .+
Figure 4.The network architecture of the parameterized LASC layer.The input of the parameterized LASC layer is the standard convolution layer in the convolutional neural network (CNN), which is a D dimensional feature vector.Both the dimensionality of the first-order and second-order LASC statistical information are KS  .Finally, we cascade them.Firstly, the deep feature is normalized by 2 l -Norm (element-wise normalization).According to Equation (11), T k i the activation layer, and the derivative of ( ) x  can be easily obtained.Then, the second-order statistic information is obtained.Lastly, we cascade them to acquire the final LASC statistic information.Here, all of the parameters (   w ,   b ,   P and   P  − ) can be trained by back-propagation in the parameterized LASC layer.Then, the algorithm steps (forward computation) of the parameterized LASC layer are given as follows: Input: the output of the convolution u Output: the entire LASC statistic information Step1.Normalize u by the 2 l -norm layer.

Figure 4 .
Figure 4.The network architecture of the parameterized LASC layer.The input of the parameterized LASC layer is the standard convolution layer in the convolutional neural network (CNN), which is a D dimensional feature vector.Both the dimensionality of the first-order and second-order LASC statistical information are K × S. Finally, we cascade them.

Figure 5 .
Figure 5.The schematic of squeeze and excitation block, including the squeeze operation and excitation operation.

Figure 5 .
Figure 5.The schematic of squeeze and excitation block, including the squeeze operation and excitation operation.

Figure 7 .
Figure 7. Influence of projected dimensions in the GCCH network. , degree of reduction.Approximately, the parameter of 4 r = is twice that of 2 r = , and the parameter of 8 r = is about four times than that of 2 r = .Here, considering the tradeoff between improved performance and increased model complexity, 8 r = is optimal in the GCCH network.

Figure 7 .
Figure 7. Influence of projected dimensions in the GCCH network.

Table 1 .
The differences between the proposed GCCH network and other approaches.

Table 2 .
Evaluation of the ratio of reduction.

Table 2 .
Evaluation of the ratio of reduction.

Table 3 .
The performance results compared with other state of the art methods.

Table 4 .
Comparison with other mid-level feature representation methods.

Table 5 .
The ablation and combined experiments of the GCCH network.