Concentric Circle Pooling in Deep Convolutional Networks for Remote Sensing Scene Classification

Convolutional neural networks (CNNs) have been increasingly used in remote sensing scene classification/recognition. The conventional CNNs are sensitive to the rotation of the image scene, which will inevitably result in the misclassification of remote sensing scene images that belong to the same category. In this work, we equip the networks with a new pooling strategy, “concentric circle pooling”, to alleviate the above problem. The new network structure, called CCP-net can generate a concentric circle-based spatial-rotation-invariant representation of an image, hence improving the classification accuracy. The square kernel is adopted to approximate the circle kernels in concentric circle pooling, which is much more efficient and suitable for CNNs to propagate gradients. We implement the training of the proposed network structure with standard back-propagation, thus CCP-net is an end-to-end trainable CNNs. With these advantages, CCP-net should in general improve CNN-based remote sensing scene classification methods. Experiments using two publicly available remote sensing scene datasets demonstrate that using CCP-net can achieve competitive classification results compared with the state-of-art methods.


Introduction
With the development of remote sensing technology, large amounts of Earth-observation images with high resolution are becoming increasingly available and playing an ever-more important role in remote sensing scene classification [1].High-Resolution Satellite (HRS) images provide much of the appearance and spatial arrangement information that is useful for remote sensing scene category recognition [2].However, it is difficult to recognize remote sensing scene categories because they usually cover multiple land use categories or ground objects [3][4][5][6][7][8][9][10] such as airports with airplanes, runways, and grass.The classification of HRS images turns from a single remote sensing scene category-based or single object-based classification to a remote sensing scene-based semantic classification [11].Remote sensing scene categories are largely affected and determined by human and social activities, and the recognition of remote sensing scene image is therefore based on a priori knowledge.As a result of such difficulties, traditional pixel-based [12] and low-level feature-based image classification techniques [13,14] can no longer achieve satisfactory results for remote sensing scene classification.
In the past few years, a large number of feature representation models have been proposed for scene classification.One of the most popularly used models is the Bag-Of-Visual-Words (BOVW) model [15][16][17][18], which provides an efficient solution for remote sensing scene classification.The BOVW model, initially proposed for text categorization, treats an image as a collection of unordered appearance descriptors, and represents the images with the frequency of "visual words" that are constructed by quantizing local features, such as the Scale-Invariant Feature Transform (SIFT) [19] or Histograms of Oriented Gradients (HOGs) [20] with a clustering method (for example, k-means) [15].The original BOVW method discards the spatial order of the local features and severely limits the descriptive capability of image representation.Therefore, many variant methods [4,[21][22][23] based on the BOVW model have been developed for improving the ability to depict the spatial relationships of local features.These methods are based on hand-crafted features, which rely heavily on the experience and domain knowledge of experts.Due to the lack of consideration for the details of actual data, it is difficult with these low-level features to attain a balance between discriminability and robustness [24].Such features often fail to accurately characterize the complex remote sensing scenes found in HRS images [25].
Deep Learning (DL) algorithms, especially Convolutional Neural Networks (CNNs), have achieved great success in image classification [26], detection [27], and segmentation [28] on several benchmarks [29].CNN is a hierarchical network invariant to image translations, which is composed of convolutional layers, pooling layers, and fully-connected layers.The key to success is the ability to learn the increasingly complex transformations of the input and to capture invariances from large labelled datasets [30].However, it is difficult to directly apply the CNNs to remote sensing scene classification for millions of parameters to train the CNNs, which the training samples are insufficient for training.The studies in References [31][32][33][34][35][36][37] have demonstrated that CNNs can be pre-trained on large natural image datasets such as ImageNet [38], which contains general-purposed feature extractors, and can be transferable to many other domains.This is very helpful in the remote sensing scene classification because of the difficulty of training a deep CNN with a small number of training samples.Many approaches utilize the outputs of a deep and fully-connected layer as features to achieve transfer in CNNs.These methods, however, concatenate the outputs of the last convolutional layer to link with the fully-connected layer.This transformation does not capture the information concerning the spatial layout, which has limited descriptive ability in remote sensing scene classification.The works in References [4,25,39] can capture the spatial arrangement for the local features.They are designed for the hand-craft features and are not end-to-end trainable as CNNs are.The Spatial Pyramid Pooling (SPP) [40][41][42][43][44] (popularly known as the Spatial Pyramid Matching or SPM [21]) can incorporate spatial information by partitioning the image into increasingly fine subregions and computing the histograms of local features found inside each subregion.He et al. [41] introduce an SPP layer, which should, in general, improve the CNN-based image classification methods.Nevertheless, this pooling method was designed for natural image scene classification using ordered regular grids that incorporate spatial information into the representation, and therefore, are sensitive to the rotation of image scenes.This sensitivity problem inevitably causes the misclassification of scene images, especially for remote sensing scene images, and influences classification performance.These works in References [45,46] can learn a rotation-invariant CNN model object detection in remote sensing images by using data augmentation method which generates a set of new training samples by rotating transformation.These methods were designed for object detection and the data augmentation operation will inevitably increase the computational cost especially on large dataset because several transformations are required for each training samples.
In this paper, we introduce a Concentric Circle Pooling (CCP) layer to incorporate rotation-invariant spatial layout information of remote sensing scene images.The concentric circle-based partition strategy of an image has been proven effective for rotation-invariant spatial information representation in color and texture feature extraction [47,48] and the BOVW [11] and FV representations [49].It partitions the image into a series of annular subregions and aggregates the local features found inside each annular subregion.However, concentric circle pooling has not been considered in the context of CNNs for remote sensing images.We applied this strategy to CNN models and designed a new network structure (called CCP-net) for remote sensing scene classification.Specifically, we added a CCP layer on top of the last convolutional layer.The CCP layer pools the convolutional features and then feeds them into the fully-connected layers.Thus, for the CCP layer, using annular spatial bins, we can pool the convolutional features to achieve a rotation invariant spatial representation.The experiments were conducted based on two public ground truth image datasets, manually extracted from publicly available high-resolution overhead imagery.The experimental results show that the CCP layer helps CNNs to represent the remote sensing scene images and achieve high classification accuracies.
The remainder of this paper is organized as follows: Section 2 introduces the proposed CCP layer for remote sensing scene classification.The experimental results and analysis are presented in Section 3, followed by an analysis and discussion in Section 4. In Section 5, some concluding remarks are presented and perspectives on future work close the paper.

Deep Convolutional Neural Network
The typical architecture of a CNN is composed of multiple cascaded layers, including convolutional layers, nonlinear pooling layers, and fully-connected layers.The convolutional layer is the core building block of a CNN, which is used to detect local conjunctions of features from the previous layer and mapping their appearance to a feature map.Each element of these feature maps is obtained by computing a dot product between the local region connecting to the input feature maps and a set of weights (called filters or kernels).The pooling layer is responsible for reducing the special size of the activation maps by a downsampling operation along the spatial dimensions of feature maps via computing the aggregated values on a local region.Two different aggregated methods including max pooling and average pooling are conducted through the experiments.The fully-connected layers on the top of several stacked convolutional and pooling layers usually use nonlinear activation functions [50] based on a hyperbolic tangent or rectified linear units (ReLU) [51].The last fully-connected layer (also called the softmax layer) computes the scores for each defined class using the softmax activate function.The parameters of CNNs are trained with Stochastic Gradient Descent (SGD) based on the backpropagation algorithm [52].
We use a popular architecture, VGG-VD networks [53], which won the runner-up prize in the 2014 ImageNet Large Scale Visual Recognition Challenge (ILSVRC-2014).The VGG-VD models include two very deep CNNs, knowns as VGG-VD16 (containing 13 convolutional layers and 3 fully-connected layers) and VGG-VD19 (containing 16 convolutional layers and 3 fully-connected layers).Figure 1 shows an example architecture using VGG-VD16 for remote sensing scene classification.The input images for VGG-VD16 are required to be resized to 224 × 224 × 3 to be compatible with the fully-connected layers.For the difficulty of training this deep network with small remote sensing datasets, we can employ a pre-trained VGG-VD16 and fine-tune them on scene datasets [54] or simply take the pre-trained CNN as a fixed feature extractor [36].

Concentric Circle Pooling Layer
Due to the problem of object rotation variation in the remote sensing scene, it is problematic to directly use CNNs for remote sensing scene classification.Figure 2 illustrates this problem.Take the remote sensing scene image as an example, for Figure 2a, where they all belong to the same scene category.The 16 × 16 grey-scale maps in Figure 2a are one channel of the feature maps extracting by the last convolutional layer of VGG-VD16.The difference among these four feature maps is not simply rotating them.The traditional CNNs flatten the feature maps to connect with fully-connected layers.In Figure 2b, we split each 256-dimensional flattened representation into four parts and = each part in different subgraph.In each subgraph, we can observe that the dissimilarity among these four representations becomes larger.This will limit the capability of CNNs applied to remote sensing applications.

Concentric Circle Pooling Layer
Due to the problem of object rotation variation in the remote sensing scene, it is problematic to directly use CNNs for remote sensing scene classification.Figure 2 illustrates this problem.Take the remote sensing scene image as an example, for Figure 2a, where they all belong to the same scene category.The 16 × 16 grey-scale maps in Figure 2a are one channel of the feature maps extracting by the last convolutional layer of VGG-VD16.The difference among these four feature maps is not simply rotating them.The traditional CNNs flatten the feature maps to connect with fully-connected layers.In Figure 2b, we split each 256-dimensional flattened representation into four parts and = each part in different subgraph.In each subgraph, we can observe that the dissimilarity among these four representations becomes larger.This will limit the capability of CNNs applied to remote sensing applications.

Concentric Circle Pooling Layer
Due to the problem of object rotation variation in the remote sensing scene, it is problematic to directly use CNNs for remote sensing scene classification.Figure 2 illustrates this problem.Take the remote sensing scene image as an example, for Figure 2a, where they all belong to the same scene category.The 16 × 16 grey-scale maps in Figure 2a are one channel of the feature maps extracting by the last convolutional layer of VGG-VD16.The difference among these four feature maps is not simply rotating them.The traditional CNNs flatten the feature maps to connect with fully-connected layers.In Figure 2b, we split each 256-dimensional flattened representation into four parts and = each part in different subgraph.In each subgraph, we can observe that the dissimilarity among these four representations becomes larger.This will limit the capability of CNNs applied to remote sensing applications.Concentric circle structure, which is used to improve BOVW method for remote sensing scene classification, can maintain spatial and rotation-invariance information by pooling in the local annular subregions (Figure 3a).We can acquire the spatial information of rotation-invariance by using this concentric circle structure to extract the spatial distribution of visual words.Therefore, we introduce the concentric circle pooling to the CNNs to capture the rotation-invariance information for remote sensing scene classification.As shown in Figure 3, circular kernels (Figure 3a) induce rotational invariance, but square kernels (Figure 3b) are computationally more efficient at the partial expense of rotational invariance.Furthermore, square kernels are more suitable for the CNNs to calculate and propagate gradients.Therefore, we adopt the square kernels in our proposed network architecture.
Remote Sens. 2018, 10, x FOR PEER REVIEW 5 of 19 Concentric circle structure, which is used to improve BOVW method for remote sensing scene classification, can maintain spatial and rotation-invariance information by pooling in the local annular subregions (Figure 3a).We can acquire the spatial information of rotation-invariance by using this concentric circle structure to extract the spatial distribution of visual words.Therefore, we introduce the concentric circle pooling to the CNNs to capture the rotation-invariance information for remote sensing scene classification.As shown in Figure 3, circular kernels (Figure 3a) induce rotational invariance, but square kernels (Figure 3b) are computationally more efficient at the partial expense of rotational invariance.Furthermore, square kernels are more suitable for the CNNs to calculate and propagate gradients.Therefore, we adopt the square kernels in our proposed network architecture.To adopt the deep network for the spatial information of rotation-invariance, we replaced the last pooling layer (the pool5 after the last convolutional layer conv5 in VGG-VD16) with a CCP layer.Figure 4 illustrates our method.In each annular subregion, we pool the response of each filter with max/average pooling.The output size of the last convolutional layer may not divide exactly by the number of subregions, so the valid number of the annular subregion is between 1 and R, where R denotes the circle number, which is the input number of subregions.The outputs of the CCP layer are  × K-dimensional vectors, where  ∈ [1, ], and K denotes the number of filters in the last convolutional layer.In the next subsection, we interpret the output size of the CCP layer in detail.To adopt the deep network for the spatial information of rotation-invariance, we replaced the last pooling layer (the pool 5 after the last convolutional layer conv 5 in VGG-VD16) with a CCP layer.Figure 4 illustrates our method.In each annular subregion, we pool the response of each filter with max/average pooling.The output size of the last convolutional layer may not divide exactly by the number of subregions, so the valid number of the annular subregion is between 1 and R, where R denotes the circle number, which is the input number of subregions.The outputs of the CCP layer are r× K-dimensional vectors, where r ∈ [1, R], and K denotes the number of filters in the last convolutional layer.In the next subsection, we interpret the output size of the CCP layer in detail.
Remote Sens. 2018, 10, x FOR PEER REVIEW 5 of 19 Concentric circle structure, which is used to improve BOVW method for remote sensing scene classification, can maintain spatial and rotation-invariance information by pooling in the local annular subregions (Figure 3a).We can acquire the spatial information of rotation-invariance by using this concentric circle structure to extract the spatial distribution of visual words.Therefore, we introduce the concentric circle pooling to the CNNs to capture the rotation-invariance information for remote sensing scene classification.As shown in Figure 3, circular kernels (Figure 3a) induce rotational invariance, but square kernels (Figure 3b) are computationally more efficient at the partial expense of rotational invariance.Furthermore, square kernels are more suitable for the CNNs to calculate and propagate gradients.Therefore, we adopt the square kernels in our proposed network architecture.To adopt the deep network for the spatial information of rotation-invariance, we replaced the last pooling layer (the pool5 after the last convolutional layer conv5 in VGG-VD16) with a CCP layer.Figure 4 illustrates our method.In each annular subregion, we pool the response of each filter with max/average pooling.The output size of the last convolutional layer may not divide exactly by the number of subregions, so the valid number of the annular subregion is between 1 and R, where R denotes the circle number, which is the input number of subregions.The outputs of the CCP layer are  × K-dimensional vectors, where  ∈ [1, ], and K denotes the number of filters in the last convolutional layer.In the next subsection, we interpret the output size of the CCP layer in detail.

Training the Network
The network with the CCP layer can be trained with standard SGD [52].We will describe our training procedure using concentric circle pooling and the output size of the CCP layer.
We consider a network with fixed-size input (256 × 256) images.To compute the concentric circle pooling, we can firstly partition the feature maps into several bins and compute the maximal or average values (corresponding to the max pooling and average pooling methods) for each bin.As shown in Figure 5, we assume that the feature maps after last convolutional layer have a size of a × a × 3 (a = 16) (Figure 5a).With an n-level (n = 4) concentric circle, we adopt a sliding window pooling where the window size and stride is s = a/(2 × n) (s = 2) with • denoting the ceiling operation.In the sliding window pooling, the "same" padding operation is used.We can obtain a new feature map that has a size of m × m × 3 (m = 8), where m = a/s (Figure 5b).Then, we can efficiently compute the maximum or mean of each annular subregion and obtain r × 3 representations, where r = m/2 (r = 4) (Figure 5c).The next fully-connected layer will concatenate these r × 3 outputs.Here, r may not be equal to n for the ceiling operation.Therefore, the outputs of the CCP layer are r × K, where r ∈ [1, R].

Training the Network
The network with the CCP layer can be trained with standard SGD [52].We will describe our training procedure using concentric circle pooling and the output size of the CCP layer.
We consider a network with fixed-size input (256 × 256) images.To compute the concentric circle pooling, we can firstly partition the feature maps into several bins and compute the maximal or average values (corresponding to the max pooling and average pooling methods) for each bin.As shown in Figure 5, we assume that the feature maps after last convolutional layer have a size of  ×  × 3 ( = 16) (Figure 5a).With an -level ( = 4) concentric circle, we adopt a sliding window pooling where the window size and stride is  = ⌈/(2 × )⌉ ( = 2) with ⌈•⌉ denoting the ceiling operation.In the sliding window pooling, the "same" padding operation is used.We can obtain a new feature map that has a size of  ×  × 3 ( = 8), where  = ⌈/⌉ (Figure 5b).Then, we can efficiently compute the maximum or mean of each annular subregion and obtain  × 3 representations, where  = /2 ( = 4) (Figure 5c).The next fully-connected layer will concatenate these  × 3 outputs.Here,  may not be equal to  for the ceiling operation.Therefore, the outputs of the CCP layer are  × K, where  ∈ [1, ].

Experiments and Analysis
In this section, we provide the experimental setups and discuss the results of the two public datasets.We implement our CCP-net using all the pretrained convolutional layers in VGG-VD networks as the feature extractor.Then, a fully-connected layer on the top of the last convolutional layer computes the scores for each defined class using the softmax activate function.Dropout [26] is used on the fully-connected layer for controlling the overfitting while training with SGD.We conducted several groups of experiments to investigate the effectiveness of CCP-net for remote sensing scene classification.

Experimental Setup
We evaluated our proposed model on two public remote sensing scene datasets, which were:


UC Merced Land Use Dataset.The UC Merced dataset (UCM) is one of the first publicly available high-resolution remote sensing imagery datasets [4].This dataset contains 21 typical remote sensing scene categories, each of which consists of 100 images measuring 256 × 256 pixels with a pixel resolution of 30 cm in the red-green-blue color space.Figure 6 shows two examples of ground truth images from each class in this dataset.The classification of the UCM dataset is

Experiments and Analysis
In this section, we provide the experimental setups and discuss the results of the two public datasets.We implement our CCP-net using all the pretrained convolutional layers in VGG-VD networks as the feature extractor.Then, a fully-connected layer on the top of the last convolutional layer computes the scores for each defined class using the softmax activate function.Dropout [26] is used on the fully-connected layer for controlling the overfitting while training with SGD.We conducted several groups of experiments to investigate the effectiveness of CCP-net for remote sensing scene classification.

Experimental Setup
We evaluated our proposed model on two public remote sensing scene datasets, which were:

•
UC Merced Land Use Dataset.The UC Merced dataset (UCM) is one of the first publicly available high-resolution remote sensing imagery datasets [4].This dataset contains 21 typical remote sensing scene categories, each of which consists of 100 images measuring 256 × 256 pixels with a pixel resolution of 30 cm in the red-green-blue color space.Figure 6 shows two examples of ground truth images from each class in this dataset.The classification of the UCM dataset is challenging because of the high inter-class similarity among categories such as medium residential and dense residential areas.

•
WHU-RS Dataset.The WHU-RS dataset is a publicly available dataset in which all the images were collected from Google Earth (Google Inc.) [5].This dataset consists of 950 images with a size of 600 × 600 pixels distributed among 19 scene classes.Examples of ground truth images are shown in Figure 7.As compared to the UCM dataset, the scene categories in the WHU-RS dataset are more complicated due to the variation in scale, resolution, and viewpoint-dependent appearance.
Remote Sens. 2018, 10, x FOR PEER REVIEW 7 of 19 challenging because of the high inter-class similarity among categories such as medium residential and dense residential areas. WHU-RS Dataset.The WHU-RS dataset is a publicly available dataset in which all the images were collected from Google Earth (Google Inc.) [5].This dataset consists of 950 images with a size of 600 × 600 pixels distributed among 19 scene classes.Examples of ground truth images are shown in Figure 7.As compared to the UCM dataset, the scene categories in the WHU-RS dataset are more complicated due to the variation in scale, resolution, and viewpoint-dependent appearance.We randomly selected samples of each class for training the CNNs and the left rest for testing.The sampling setting as in References [4,36,55] for the two datasets is as follows: 80 training samples per class for the UCM dataset and 30 training samples per class for the WHU-RS dataset.These two datasets were divided 10 times, each run with randomly selected training and testing samples, to obtain reliable results.The classification accuracy rate for the categories was recorded as the mean and standard deviation of 10 runs.We used a high-level neural network named API Keras [56] which was running on top of Tensorflow [57] to implement our network using the CCP layer.Keras   We randomly selected samples of each class for training the CNNs and the left rest for testing.The sampling setting as in References [4,36,55] for the two datasets is as follows: 80 training samples per class for the UCM dataset and 30 training samples per class for the WHU-RS dataset.These two datasets were divided 10 times, each run with randomly selected training and testing samples, to obtain reliable results.The classification accuracy rate for the categories was recorded as the mean and standard deviation of 10 runs.We used a high-level neural network named API Keras [56] which was running on top of Tensorflow [57] to implement our network using the CCP layer.Keras We randomly selected samples of each class for training the CNNs and the left rest for testing.The sampling setting as in References [4,36,55] for the two datasets is as follows: 80 training samples per class for the UCM dataset and 30 training samples per class for the WHU-RS dataset.These two datasets were divided 10 times, each run with randomly selected training and testing samples, to obtain reliable results.The classification accuracy rate for the categories was recorded as the mean and standard deviation of 10 runs.We used a high-level neural network named API Keras [56] which was running on top of Tensorflow [57] to implement our network using the CCP layer.Keras Visualization Toolkit [58] was introduced as a high-level toolkit for visualizing the trained Keras neural net models.Experiments in this work were implemented using PyCharm 2017.3/Windows 10 and run on a workstation equipped with a single NVIDIA GeForce GTX 1080 Ti 12 GB GPU.

Baseline Network Architectures
As with the other pooling method, the CCP is independent of the convolutional network architectures used.The coarsest concentric circle level which has a single annular region is a "global pooling" operation [59].We investigated the VGG-VD network architectures and showed that CCP improves the accuracy of these architectures.Because fewer layers are suitable for small datasets, we used the pooling layer after the last convolutional layer with a softmax layer following it.We used the original image size for the UCM and WHU-RS datasets.The last convolutional layer generates 16 × 16 feature maps with a 21-way softmax layer following it for the UCM dataset, and 37 ×37 feature maps with a 19-way softmax layer following it for the WHU-RS dataset.The baseline network architectures with different pooling methods are as follows: global pooling and spatial pyramid pooling.The global pooling layer generates 1 × 1 feature maps and the spatial pyramid pooling generates {1 × 1, 2 × 2}, {1 × 1, 2 × 2, 3 × 3}, and {1 × 1, 2 × 2, 3 × 3, 4 × 4} feature maps for 2-level, 3-level, and 4-level pyramids respectively.

Parameter Evaluation
In the HRS images, remote sensing scenes show great variation in shape and scale, thus, the circle number is critical for the effective representation of complex scenes.To quantitatively evaluate the effect of the circle number, we tested different circle numbers from 1 to 8. In these experiments, we trained 100 epochs for these networks.For the training speed, we froze all the convolutional layers and set the initial learning rate to 1 × 10 −4 .
Figure 8 shows the results of CCP-net with different circle numbers and aggregating methods (max pooling and average pooling methods) on the UCM and WHU-RS datasets.In these networks, the convolutional layers have the same structures as the baseline models, whereas the pooling layer after the final convolutional layer is replaced with the CCP layer.Our results in Figure 8 show considerable improvement over the global pooling baselines (the classification accuracy obtained by using only one concentric circle).For the UCM dataset (Figure 8a), the classification accuracies improved gradually with an increase in the circle number, and up to a relative saturation point at 4. The accuracies fluctuate narrowly while the circle number is greater than 4. On the one hand, the outputs of the CCP layer are equal when the circle numbers are between 4 and 7 for the ceiling operation interpreting in Section 2.2.On the other hand, the gap between 4 and 8 is small because the rotation-invariant information is saturated while the circle number is 4. The optimal performance of CCP-net with VGG-VD19 is slightly better than the CCP-net with VGG-VD16.In Figure 8b, we can see that the optimal circle number is 5 for each model on the WHU-RS dataset.Generally, the average pooling method is prior to the max pooling method for each VGG-VD model.For the average pooling method, the accuracies of VGG-VD19 are slightly higher than VGG-VD16, but more time-consuming due to the deeper layer structure.Therefore, we chosen the VGG-VD16 with average pooling method in the following sections for our CCP-net.
We compare our concentric circle pooling with spatial pyramid pooling in Table 1.For the SPP-net, the optimal pyramid for VGG-VD16 is a 4-level pyramid: {1 × 1, 2 × 2, 3 × 3, 4 × 4}; and for VGG-VD19, it is a 3-level pyramid: {1 × 1, 2 × 2, 3 × 3}.We can see that the classification accuracies of CPP-net are all better than the SPP-net.The results show that spatial arrangement using regular grid is insufficient for arbitrary oriented objects in remote sensing scene.In Table 1, we also list the results of traditional VGG-VD networks in References [36] which removes the last fully-connected layer (the softmax layer) of a pre-trained CNN and fixes the rest of CNN.The CCP-net perform better than the traditional VGG-VD networks on these two datasets.Overall, the CCP-net considering the rotation-invariant information is effective for remote sensing scene classification tasks.We compare our concentric circle pooling with spatial pyramid pooling in Table 1.For the SPPnet, the optimal pyramid for VGG-VD16 is a 4-level pyramid: {1 × 1, 2 × 2, 3 × 3, 4 × 4}; and for VGG-VD19, it is a 3-level pyramid: {1 × 1, 2 × 2, 3 × 3}.We can see that the classification accuracies of CPP-net are all better than the SPP-net.The results show that spatial arrangement using regular grid is insufficient for arbitrary oriented objects in remote sensing scene.In Table 1, we also list the results of traditional VGG-VD networks in References [36] which removes the last fully-connected layer (the softmax layer) of a pre-trained CNN and fixes the rest of CNN.The CCP-net perform better than the traditional VGG-VD networks on these two datasets.Overall, the CCP-net considering the rotation-invariant information is effective for remote sensing scene classification tasks.We evaluate the time consumption (measured in terms of seconds) of training models with different pooling methods on the UCM and WHU-RS datasets, shown in Figure 9.These models have almost the same computational cost on the same dataset.Because the length of features after CCP layer is smaller than SPP layer, the CCP-net lead to slightly less time consumption than SPP-net.
We evaluate the time consumption (measured in terms of seconds) of training models with different pooling methods on the UCM and WHU-RS datasets, shown in Figure 9.These models have almost the same computational cost on the same dataset.Because the length of features after CCP layer is smaller than SPP layer, the CCP-net lead to slightly less time consumption than SPP-net.

Confusion Matrix
To obtain the optimal performance in CCP-net, we used the VGG-VD16 model and unfroze all the layers to train the entire network.We trained 1000 epochs with a small learning rate equal to 5 × 10 −6 .Early stopping [60] was used to stop the training before the weights have converged for controlling overfitting while training with SGD.The classification accuracies (%) of the individual classes on the UCM and WHU-RS datasets using the CCP-net with an optimal circle count as previously described are shown in Figure 9.As shown in these confusion matrices, our proposed method can extract meaningful information for different categories in these two datasets.
In Figure 10, there are 19 among the 21 UCM remote sensing scene classes that have classification accuracies exceeding 95%; the classification accuracies of the baseball diamond, forest, freeway, harbor, overpass, river, sparse residential, storage tank, and tennis court can exceed 99%.The error rates of these easily confused classes, dense residentials, medium residentials, and sparse residentials, are all reduced under 2% for the rotation-invariant spatial arrangements captured by concentric circle pooling.The classification values of 9 among the 19 remote sensing scene categories in the WHU-RS dataset were over 99%.The most confused scene classes are the beach and chaparral in the UCM dataset, and river and forest in the WHU-RS dataset.This is likely to result from the insufficiency of spatial arrangement information and the high similarity of these scene pairs in the aspect of texture structure.

Confusion Matrix
To obtain the optimal performance in CCP-net, we used the VGG-VD16 model and unfroze all the layers to train the entire network.We trained 1000 epochs with a small learning rate equal to 5 × 10 −6 .Early stopping [60] was used to stop the training before the weights have converged for controlling overfitting while training with SGD.The classification accuracies (%) of the individual classes on the UCM and WHU-RS datasets using the CCP-net with an optimal circle count as previously described are shown in Figure 9.As shown in these confusion matrices, our proposed method can extract meaningful information for different categories in these two datasets.
In Figure 10, there are 19 among the 21 UCM remote sensing scene classes that have classification accuracies exceeding 95%; the classification accuracies of the baseball diamond, forest, freeway, harbor, overpass, river, sparse residential, storage tank, and tennis court can exceed 99%.The error rates of these easily confused classes, dense residentials, medium residentials, and sparse residentials, are all reduced under 2% for the rotation-invariant spatial arrangements captured by concentric circle pooling.The classification values of 9 among the 19 remote sensing scene categories in the WHU-RS dataset were over 99%.The most confused scene classes are the beach and chaparral in the UCM dataset, and river and forest in the WHU-RS dataset.This is likely to result from the insufficiency of spatial arrangement information and the high similarity of these scene pairs in the aspect of texture structure.

Comparision with State-of-the-Art Methods
To illustrate the effectiveness of the CCP-net, we compared our results with various state-of-theart methods that have reported classification accuracies on the UCM dataset.As shown in Table 2, our CCP-net largely outperforms methods that use a sophisticated learning strategy with low-level hand-engineered feature and non-linear classifiers, such as these SIFT-based BOVW and its extension forms like SPM [21] and Spatial Pyramid Co-occurrence Kernel (SPCK++) [4].Furthermore, Unsupervised Feature Learning methods (UFLs) [7] and their Saliency-Guided version SG + UFL [8] were also involved in the comparison.The results show that our proposed method outperforms these methods, even the well-designed deep learning framework, GoogleLeNet + Fine-tune approach [59], in the terms of classification accuracy.On the WHU-RS dataset, our method achieved considerably better performance (98.23 ± 0.40%) than the MS-CLBP + FV method (94.32 ± 1.2%) [61] and the SIFT +

Comparision with State-of-the-Art Methods
To illustrate the effectiveness of the CCP-net, we compared our results with various state-of-the-art methods that have reported classification accuracies on the UCM dataset.As shown in Table 2, our CCP-net largely outperforms methods that use a sophisticated learning strategy with low-level hand-engineered feature and non-linear classifiers, such as these SIFT-based BOVW and its extension forms like SPM [21] and Spatial Pyramid Co-occurrence Kernel (SPCK++) [4].Furthermore, Unsupervised Feature Learning methods (UFLs) [7] and their Saliency-Guided version SG + UFL [8] were also involved in the comparison.The results show that our proposed method outperforms these methods, even the well-designed deep learning framework, GoogleLeNet + Fine-tune approach [59], in the terms of classification accuracy.On the WHU-RS dataset, our method achieved considerably better performance (98.23 ± 0.40%) than the MS-CLBP + FV method (94.32 ± 1.2%) [61] and the SIFT + LTP-HF + Color Histogram (93.6%) [55].The classification accuracies of our CCP-net are slightly inferior to MDDC (98.27 ± 0.53) [39] and the method (98.64%) presented in Reference [36].These two methods extract features from the convolutional layers of a pre-trained CNN and use a simple linear classifier to train and test.This method can achieve high performance with a small number of training samples.However, our method is more straightforward and end-to-end trainable.The performance of the proposed method is greater than these two methods while the amount of data increases.Overall, CCP-net can obtain remarkable classification results on the public benchmark.The results indicate that our proposed model has great potential for the representation and classification of remote sensing scene images.

Discussion
Extensive experiments show that our CCP-net is simple but very effective for remote sensing scene classification in HRS images.We use the concentric circle pooling to capture the rotation-invariant spatial information for the CNN architectures.We create a rotated dataset based on the UCM dataset with each image randomly rotated.Thus, each class includes 200 images in the rotated dataset.
In Table 3 we compare the CCP-net and SPP-net on the UCM and the rotated datasets.We froze all the convolutional layers and set the initial learning rate to 1 × 10 −4 .These networks use the optimal parameters obtained in Section 3.2.2.For these two networks, the overall classification accuracies on the rotated dataset are greater than the results on the UCM dataset.This may be because the rotated dataset contains additional rotated images based on the UCM dataset, which makes the sample images more abundant.For the additional rotated images, the CCP-net has more advantage over the SPP-net in the classification accuracy.This proves that the CCP-net are insensitive to the rotation of remote sensing scenes.The performance of CCP-net in almost all categories are improved more than SPP-net.The classification accuracies of some categories in the SPP-net are decreased more than 1%.The possible reason is that some of the classes are similar in part of scene image and the rotated images will increase their similarity.For example, the freeway and overpass are more easily to be confused, and the forest are more easily to be recognized as sparse residential.This is an indication of the importance of rotation-invariance for the remote sensing scene images.Compare with the results on the UCM dataset, the classification accuracies on the rotated dataset are 1.7% higher for CCP-net and 0.74% higher for SPP-net.The promotion of CCP-net is more than twice that of SPP-net.To sum up, our proposed method is rotation-invariant to image scenes and is effective for remote sensing scene image classification.We use attention heatmap and gradient-weighted class activation maps [57] to assess whether a network is attending to correct parts of the image to generate a decision.The entire models were trained as Section 3.3 to fine-tune the convolutional layers that transfer from VGG-VD16.Figure 11 show the visualization results of CCP-net and SPP-net on the UCM dataset.In Figure 11b,d, the heatmap images indicate that the edges and corners in the scene images are contribute minimizing the weighted losses the most for these two methods.The saliency regions in the heatmaps of CCP-net are relatively concentrated for the concentric circle pooling in CCP-net.Due to the multiple pyramid-level pooling, the SPP-net may pay attention to more regions, especially for the categories with flat and similar pattern, e.g., forest and river.In Figure 11c,e, we show the gradient based class activation maps to produce a coarse localization map of the important regions in the image for CCP-net and SPP-net.These maps use the class-specific gradient information flowing into the final convolutional layer of CNNs.It is shown that these networks can learn meaningful representation for scene classification, e.g., tennis courts and baseball diamond.Owing to the rotation invariance, the CCP-net can take notice of the distinguishing parts in scene images, for example, intersection part in the overpass, and airplane in the airport.This illustrates that the rotation invariance can help the CNNs to follow the discriminative parts between remote sensing scene categories.
Several example images are presented in Figures 12 and 13 that compare results from the networks with concentric circle pooling and spatial pyramid pooling on the UCM and WHU-RS datasets.As shown in Figure 12, these remote sensing scene images containing small objects, like the baseball diamond and storage tank, can be recognized by our CCP-net model.These remote sensing scenes are easily misclassified using SPP-net; the occurrences of small objects in different subregions, such as the baseball diamond, are recognized incorrectly as a golf course.Due to the rotation-invariant information, our model is more robust with respect to clutter background, like tennis courts in Figure 8 and parks in Figure 13.SPP-net, however, is more suitable for the scene images with regular grid layouts, like mobile home parks and beaches in Figure 12 and parks in Figure 13.Overall, CCP-net outperform SPP-net in classification accuracy because the remote sensing scene images are orthographical and irregular.The rotation-invariant is more import than the regular spatial arrangement for the remote sensing scene classification tasks.
These experimental results demonstrate the importance of rotation-invariant information for deep CNN features for remote sensing scene datasets.Our proposed concentric circle pooling method can assist the CNNs to be insensitive to the rotation of scenes and localize class-discriminative regions, thus, improve classification accuracies for remote sensing scene images.We use attention heatmap and gradient-weighted class activation maps [57] to assess whether a network is attending to correct parts of the image to generate a decision.The entire models were trained as Section 3.3 to fine-tune the convolutional layers that transfer from VGG-VD16.Figure 11 show the visualization results of CCP-net and SPP-net on the UCM dataset.In Figure 11b,d, the heatmap images indicate that the edges and corners in the scene images are contribute minimizing the weighted losses the most for these two methods.The saliency regions in the heatmaps of CCPnet are relatively concentrated for the concentric circle pooling in CCP-net.Due to the multiple pyramid-level pooling, the SPP-net may pay attention to more regions, especially for the categories with flat and similar pattern, e.g., forest and river.In Figure 11c,e, we show the gradient based class activation maps to produce a coarse localization map of the important regions in the image for CCPnet and SPP-net.These maps use the class-specific gradient information flowing into the final convolutional layer of CNNs.It is shown that these networks can learn meaningful representation for scene classification, e.g., tennis courts and baseball diamond.Owing to the rotation invariance, the CCP-net can take notice of the distinguishing parts in scene images, for example, intersection part in the overpass, and airplane in the airport.This illustrates that the rotation invariance can help the CNNs to follow the discriminative parts between remote sensing scene categories.
Several example images are presented in Figures 12 and 13 that compare results from the networks with concentric circle pooling and spatial pyramid pooling on the UCM and WHU-RS datasets.As shown in Figure 12, these remote sensing scene images containing small objects, like the baseball diamond and storage tank, can be recognized by our CCP-net model.These remote sensing scenes are easily misclassified using SPP-net; the occurrences of small objects in different subregions, such as the baseball diamond, are recognized incorrectly as a golf course.Due to the rotationinvariant information, our model is more robust with respect to clutter background, like tennis courts in Figure 8 and parks in Figure 13.SPP-net, however, is more suitable for the scene images with regular grid layouts, like mobile home parks and beaches in Figure 12 and parks in Figure 13.Overall, CCP-net outperform SPP-net in classification accuracy because the remote sensing scene images are orthographical and irregular.The rotation-invariant is more import than the regular spatial arrangement for the remote sensing scene classification tasks.
These experimental results demonstrate the importance of rotation-invariant information for deep CNN features for remote sensing scene datasets.Our proposed concentric circle pooling method can assist the CNNs to be insensitive to the rotation of scenes and localize class-discriminative regions, thus, improve classification accuracies for remote sensing scene images.

Conclusions
Concentric circle pooling is a simple but effective solution for handling rotation-invariance problems.This issue is important in remote sensing scene classification using CNN architectures.We have suggested a solution to train a deep network with a concentric circle pooling layer.The resulting CCP-net outstanding accuracy is greater than for the global pooling and spatial pyramid pooling methods in the scene classification tasks.The CCP-net is evaluated using two publicly available ground truth image datasets.The experimental results prove that the proposed method delivers a competitive performance in the classification accuracy against state-of-the-art methods.In future studies, we plan to investigate a solution to handle the different input sizes and multiple scales for the CCP layer and evaluate the performance on big datasets such as NWPU-resisc45 [29].

Conclusions
Concentric circle pooling is a simple but effective solution for handling rotation-invariance problems.This issue is important in remote sensing scene classification using CNN architectures.We have suggested a solution to train a deep network with a concentric circle pooling layer.The resulting CCP-net outstanding accuracy is greater than for the global pooling and spatial pyramid pooling methods in the scene classification tasks.The CCP-net is evaluated using two publicly available ground truth image datasets.The experimental results prove that the proposed method delivers a competitive performance in the classification accuracy against state-of-the-art methods.In future studies, we plan to investigate a solution to handle the different input sizes and multiple scales for the CCP layer and evaluate the performance on big datasets such as NWPU-resisc45 [29].

Figure 1 .
Figure 1.The example architecture of remote sensing scene classification using VGG-VD16.It is composed of 13 convolutional layers and 3 fully-connected layers.

Figure 2 .
Figure 2. The rotation variation problem in CNNs for remote sensing scene classification.(a) One channel of feature maps extracting by the last convolutional layer of VGG-VD16 for each image; (b) We flattened each feature map in (a) into one-dimensional representation, split them into four parts, and put each part in different subgraph for visualization.Each row in a subgraph is corresponding to different image in (a).

Figure 1 .
Figure 1.The example architecture of remote sensing scene classification using VGG-VD16.It is composed of 13 convolutional layers and 3 fully-connected layers.

19 Figure 1 .
Figure 1.The example architecture of remote sensing scene classification using VGG-VD16.It is composed of 13 convolutional layers and 3 fully-connected layers.

Figure 2 .
Figure 2. The rotation variation problem in CNNs for remote sensing scene classification.(a) One channel of feature maps extracting by the last convolutional layer of VGG-VD16 for each image; (b) We flattened each feature map in (a) into one-dimensional representation, split them into four parts, and put each part in different subgraph for visualization.Each row in a subgraph is corresponding to different image in (a).

Figure 2 .
Figure 2. The rotation variation problem in CNNs for remote sensing scene classification.(a) One channel of feature maps extracting by the last convolutional layer of VGG-VD16 for each image; (b) We flattened each feature map in (a) into one-dimensional representation, split them into four parts, and put each part in different subgraph for visualization.Each row in a subgraph is corresponding to different image in (a).

Figure 3 .
Figure 3.The example of the spatial partition of the concentric circle pooling strategy to represent spatial information.(a) The original concentric circle structure-based pooling method; (b) The proposed concentric circle pooling method.

Figure 4 .
Figure 4.A network structure with a concentric circle pooling layer.The convolutional and pooling layers except for the last pooling layer in VGG-VD16 are transformed to this network and 512 is the filter number of the conv5 layer, which is the last convolutional layer.

Figure 3 .
Figure 3.The example of the spatial partition of the concentric circle pooling strategy to represent spatial information.(a) The original concentric circle structure-based pooling method; (b) The proposed concentric circle pooling method.

Figure 3 .
Figure 3.The example of the spatial partition of the concentric circle pooling strategy to represent spatial information.(a) The original concentric circle structure-based pooling method; (b) The proposed concentric circle pooling method.

Figure 4 .
Figure 4.A network structure with a concentric circle pooling layer.The convolutional and pooling layers except for the last pooling layer in VGG-VD16 are transformed to this network and 512 is the filter number of the conv5 layer, which is the last convolutional layer.

Figure 4 .
Figure 4.A network structure with a concentric circle pooling layer.The convolutional and pooling layers except for the last pooling layer in VGG-VD16 are transformed to this network and 512 is the filter number of the conv 5 layer, which is the last convolutional layer.

Figure 5 .
Figure 5.An example four-level concentric circle pooling.(a) The output feature map of last convolutional layer; (b) The results after sliding window pooling which transforms each circle in output feature map of last convolutional layer to one-pixel width; (c) The outputs of concentric circle pooling which computes the maximum or mean of each annular subregion for each channel and concatenates them.

Figure 5 .
Figure 5.An example four-level concentric circle pooling.(a) The output feature map of last convolutional layer; (b) The results after sliding window pooling which transforms each circle in output feature map of last convolutional layer to one-pixel width; (c) The outputs of concentric circle pooling which computes the maximum or mean of each annular subregion for each channel and concatenates them.

Figure 6 .
Figure 6.Two example ground truth images of each scene category in the UC Merced dataset.

Figure 7 .
Figure 7.The example ground truth images of each scene category in the WHU-RS dataset.

Figure 6 .
Figure 6.Two example ground truth images of each scene category in the UC Merced dataset.

Figure 6 .
Figure 6.Two example ground truth images of each scene category in the UC Merced dataset.

Figure 7 .
Figure 7.The example ground truth images of each scene category in the WHU-RS dataset.

Figure 7 .
Figure 7.The example ground truth images of each scene category in the WHU-RS dataset.

Figure 8 .
Figure 8.The classification accuracy of concentric circle pooling network as a function of circle number.The AVG and MAX in the labels indicate the average aggregate and max average.(a) The UC Merced dataset; (b) The WHU-RS dataset.

Figure 8 .
Figure 8.The classification accuracy of concentric circle pooling network as a function of circle number.The AVG and MAX in the labels indicate the average aggregate and max average.(a) The UC Merced dataset; (b) The WHU-RS dataset.

Figure 9 .
Figure 9.The time consumption of training models with different pooling methods on the UCM and WHU-RS datasets.The AVG, MAX, and GLP in the labels indicate the average aggregate, the max aggregate, and the global pooling.

Figure 9 .
Figure 9.The time consumption of training models with different pooling methods on the UCM and WHU-RS datasets.The AVG, MAX, and GLP in the labels indicate the average aggregate, the max aggregate, and the global pooling.

Figure 10 .
Figure 10.The confusion matrices showing the classification accuracies (%) for the proposed model.(a) Confusion matrix for the UC Merced dataset; (b) Confusion matrix for the WHU-RS dataset.

Figure 10 .
Figure 10.The confusion matrices showing the classification accuracies (%) for the proposed model.(a) Confusion matrix for the UC Merced dataset; (b) Confusion matrix for the WHU-RS dataset.

Figure 11 .
Figure 11.The attention heatmaps and the gradient based class activation maps to show the important regions of the scene images in the UC Merced dataset.(a) Original scene images; (b,d) are attention heatmaps for CPP-net and SPP-net; (d,e) are gradient based class activation maps for CCP-net and SPP-net.

Figure 12 .
Figure 12.Several example images that compare results from CCP-net and SPP-net on the UC Merced dataset.Images, where SPP-net failed, are shown in the first three columns and images where CCPnet failed are shown in the last column.

Figure 11 . 19 Figure 11 .
Figure 11.The attention heatmaps and the gradient based class activation maps to show the important regions of the scene images in the UC Merced dataset.(a) Original scene images; (b,d) are attention heatmaps for CPP-net and SPP-net; (d,e) are gradient based class activation maps for CCP-net and SPP-net.

Figure 12 .
Figure 12.Several example images that compare results from CCP-net and SPP-net on the UC Merced dataset.Images, where SPP-net failed, are shown in the first three columns and images where CCPnet failed are shown in the last column.

Figure 12 .
Figure 12.Several example images that compare results from CCP-net and SPP-net on the UC Merced dataset.Images, where SPP-net failed, are shown in the first three columns and images where CCP-net failed are shown in the last column.

Figure 13 .
Figure 13.Several example images that compare results from CCP-net and SPP-net on the WHU-RS dataset.Images, where SPP-net failed, are shown in the first two columns and images where CCP-net failed are shown in the last column.

Figure 13 .
Figure 13.Several example images that compare results from CCP-net and SPP-net on the WHU-RS dataset.Images, where SPP-net failed, are shown in the first two columns and images where CCP-net failed are shown in the last column.

Table 1 .
The comparison of the classification accuracy with spatial pyramid pooling network.

Table 1 .
The comparison of the classification accuracy with spatial pyramid pooling network.

Table 2 .
The comparison of the classification accuracy (%) on the UC Merced dataset.

Table 3 .
The comparison of the CCP-net and SPP-net on the UC Merced and the rotated datasets.For each network, the two rows are the results of the UC Merced and the rotated datasets, respectively.