Remote Sensing Image Scene Classiﬁcation Using CNN-CapsNet

: Remote sensing image scene classiﬁcation is one of the most challenging problems in understanding high-resolution remote sensing images. Deep learning techniques, especially the convolutional neural network (CNN), have improved the performance of remote sensing image scene classiﬁcation due to the powerful perspective of feature learning and reasoning. However, several fully connected layers are always added to the end of CNN models, which is not efﬁcient in capturing the hierarchical structure of the entities in the images and does not fully consider the spatial information that is important to classiﬁcation. Fortunately, capsule network (CapsNet), which is a novel network architecture that uses a group of neurons as a capsule or vector to replace the neuron in the traditional neural network and can encode the properties and spatial information of features in an image to achieve equivariance, has become an active area in the classiﬁcation ﬁeld in the past two years. Motivated by this idea, this paper proposes an effective remote sensing image scene classiﬁcation architecture named CNN-CapsNet to make full use of the merits of these two models: CNN and CapsNet. First, a CNN without fully connected layers is used as an initial feature maps extractor. In detail, a pretrained deep CNN model that was fully trained on the ImageNet dataset is selected as a feature extractor in this paper. Then, the initial feature maps are fed into a newly designed CapsNet to obtain the ﬁnal classiﬁcation result. The proposed architecture is extensively evaluated on three public challenging benchmark remote sensing image datasets: the UC Merced Land-Use dataset with 21 scene categories, AID dataset with 30 scene categories, and the NWPU-RESISC45 dataset with 45 challenging scene categories. The experimental results demonstrate that the proposed method can lead to a competitive classiﬁcation performance compared with the state-of-the-art methods.


Introduction
With the development of Earth observation technology, many different types (e.g., multi/ hyperspectral [1] and synthetic aperture radar [2]) of high-resolution images of the Earth's surface are readily available.Therefore, it is particularly important to effectively understand their semantic content, and more intelligent identification and classification methods of land use and land cover (LULC) are definitely demanded.Remote sensing image scene classification, which aims to automatically assign a specific semantic label to each remote sensing image scene patch according to its contents, has become an active research topic in the field of remote sensing image interpretation because of its vital applications in LULC, urban planning, land resource management, disaster monitoring, and traffic control [3][4][5][6].
During the last decades, several methods have been developed for remote sensing image scene classification.The early methods for scene classification were mainly based on low-level features or hand-crafted features, which focus on designing various human-engineering features locally or globally, such as color, texture, shape, and spatial information.Representative features, including the scale invariant feature transform (SIFT), color histogram (CH), local binary pattern (LBP), Gabor filters, grey level cooccurrence matrix (GLCM), and the histogram of oriented gradients (HOG) or their combinations, are usually used for scene classification [7][8][9][10][11][12].It is worth noting that methods relying on these low-level features perform well on some images with uniform texture or spatial arrangements, but they are still limited for distinguishing images with more challenging and complex scenes, which is because the involvement of humans in feature design significantly influences the effectiveness of the representation capacity of scene images.In contrast to low-level feature-based methods, the mid-level feature approaches attempt to compute a holistic image representation formed by local visual features such as SIFT, color histogram, or LBP of local image patches.The general pipeline of building mid-level features is to extract local attributes of image patches first and then to encode them to obtain the mid-level representation of remote sensing images.The well-known bag-of-visual-words (BoVW) model is the most popular mid-level approach and has been widely adopted for remote sensing image scene classification because of its simplicity and effectiveness [13][14][15][16][17][18].The methods based on the BoVW have improved the classification performance, but due to the limitation of representation capability of the BOVW model, no further breakthroughs have been achieved for remote sensing image scene classification.
Recently, with the prevalence of deep learning methods, which have achieved impressive performance on many applications including image classification [19], object recognition [20], and semantic segmentation [21], the feature representation of images has stepped into a new era.Unlike low-level and mid-level features, deep learning models can learn more powerful, abstract and discriminative features via deep-architecture neural networks without a considerable amount of engineering skill and domain expertise.All of these deep learning models, especially the convolutional neural network (CNN), are more applicable for remote sensing image scene classification and have achieved state-of-the-art results [22][23][24][25][26][27][28][29][30][31][32][33][34].Although the CNN-based methods have dramatically improved classification accuracy, some scene classes are still easily mis-classified.Taking the AID dataset as an example, the class-specific classification accuracy of 'school' is only 49% [35], which is usually confused with 'dense residential'.As shown in Figure 1, two images labelled 'school' and two images labelled 'dense residential' have been selected from the AID dataset.We can see that the contexts among these four images have similar image distribution and all contain many buildings and trees.However, different from the arrangement irregularity of buildings in 'school', the buildings in 'dense residential' are arranged closely and orderly.This spatial layout difference between them is very helpful in distinguishing the two classes and should be given more consideration in the phase of classification.However, the use of the fully connected layer at the end of the CNN model compresses the two-dimensional feature map into a one-dimensional feature map and cannot fully consider the spatial relationship, which makes it difficult to distinguish the two classes.
Remote Sens. 2018, 10, x FOR PEER REVIEW 2 of 25 During the last decades, several methods have been developed for remote sensing image scene classification.The early methods for scene classification were mainly based on low-level features or hand-crafted features, which focus on designing various human-engineering features locally or globally, such as color, texture, shape, and spatial information.Representative features, including the scale invariant feature transform (SIFT), color histogram (CH), local binary pattern (LBP), Gabor filters, grey level cooccurrence matrix (GLCM), and the histogram of oriented gradients (HOG) or their combinations, are usually used for scene classification [7][8][9][10][11][12].It is worth noting that methods relying on these low-level features perform well on some images with uniform texture or spatial arrangements, but they are still limited for distinguishing images with more challenging and complex scenes, which is because the involvement of humans in feature design significantly influences the effectiveness of the representation capacity of scene images.In contrast to low-level feature-based methods, the mid-level feature approaches attempt to compute a holistic image representation formed by local visual features such as SIFT, color histogram, or LBP of local image patches.The general pipeline of building mid-level features is to extract local attributes of image patches first and then to encode them to obtain the mid-level representation of remote sensing images.The well-known bag-of-visual-words (BoVW) model is the most popular mid-level approach and has been widely adopted for remote sensing image scene classification because of its simplicity and effectiveness [13][14][15][16][17][18].The methods based on the BoVW have improved the classification performance, but due to the limitation of representation capability of the BOVW model, no further breakthroughs have been achieved for remote sensing image scene classification.
Recently, with the prevalence of deep learning methods, which have achieved impressive performance on many applications including image classification [19], object recognition [20], and semantic segmentation [21], the feature representation of images has stepped into a new era.Unlike low-level and mid-level features, deep learning models can learn more powerful, abstract and discriminative features via deep-architecture neural networks without a considerable amount of engineering skill and domain expertise.All of these deep learning models, especially the convolutional neural network (CNN), are more applicable for remote sensing image scene classification and have achieved state-of-the-art results [22][23][24][25][26][27][28][29][30][31][32][33][34].Although the CNN-based methods have dramatically improved classification accuracy, some scene classes are still easily mis-classified.Taking the AID dataset as an example, the class-specific classification accuracy of 'school' is only 49% [35], which is usually confused with 'dense residential'.As shown in Figure 1, two images labelled 'schoolʹ and two images labelled 'dense residentialʹ have been selected from the AID dataset.We can see that the contexts among these four images have similar image distribution and all contain many buildings and trees.However, different from the arrangement irregularity of buildings in 'school', the buildings in 'dense residential' are arranged closely and orderly.This spatial layout difference between them is very helpful in distinguishing the two classes and should be given more consideration in the phase of classification.However, the use of the fully connected layer at the end of the CNN model compresses the two-dimensional feature map into a onedimensional feature map and cannot fully consider the spatial relationship, which makes it difficult to distinguish the two classes.Recently, the advent of the capsule network (CapsNet) [36], which is a novel architecture to encode the properties and spatial relationship of the features in an image and is a more effective image recognition algorithm, shows encouraging results on image classification.Although the CapsNet is still in its infancy [37], it has been successfully applied in many fields [38][39][40][41][42][43][44][45][46][47][48][49] in recent years, such as brain tumor classification, sound event detection, object segmentation, and hyperspectral image classification.The CapsNet uses a group of neurons as a capsule to replace a neuron in the traditional neural network.In addition, the capsule is a vector to represent internal properties that can be used to learn part-whole relationships between various entities, such as objects or object parts, to achieve equivariance [36] and can solve the problem of traditional neural networks using fully connected layers cannot efficiently capture the hierarchical structure of the entities in images to preserve the spatial information [50].

school dense residential
To further improve the accuracy of the remote sensing image scene classification and motivated by the powerful ability of feature learning of deep CNN and the property of equivariance of CapsNet, a new architecture named CNN-CapsNet is proposed to deal with the task of remote sensing image scene classification in this paper.The proposed architecture is composed of two parts.First, a pretrained deep CNN, such as VGG-16 [51], is fully trained on the ImageNet [52] dataset, and its intermediate convolutional layer is used as an initial feature maps extractor.Then, the initial feature maps are fed into a newly designed CapsNet to label the remote sensing image scenes.Experimental results on three challenging benchmark datasets show that the proposed architecture achieves a more competitive accuracy compared with state-of-the-art methods.In summary, the major contributions of this paper are as follows:

•
To further improve classification accuracy, especially classes that have high homogeneity in the image content, a new novel architecture named CNN-CapsNet is proposed to deal with the remote sensing image scene classification problem, which can discriminate scene classes effectively.

•
By combining the CNN and the CapsNet, the proposed method can obtain a superior result compared with the state-of-the-art methods on three challenging datasets without any data-augmentation operation.

•
This paper also analyzes the influence of different factors in the proposed architecture on the classification result, including the routing number in the training phase, the dimension of capsules in the CapsNet and different pretrained CNN models, which can provide valuable guidance for subsequent research on the remote sensing image scene classification using CapsNet.
The remainder of this paper is organized as follows.In Section 2, the materials are illustrated.Section 3 introduces the theory of CNN and CapsNet first, and then describes the proposed method in detail.Section 4 analyzes the influence of different factors, and discusses the experimental results of the proposed method.Finally, conclusions are drawn in Section 5.

Materials
Three popular remote sensing datasets (UC Merced Land-Use [14], AID [35], and NWPU-RESISC45 [53]) with different visual properties are chosen to better demonstrate the robustness and effectiveness of the proposed method.In addition, details about the datasets are described in Sections 2.1-2.3.

UC Merced Land-Use Dataset
The UC Merced Land-Use dataset is composed of 2100 aerial scene images divided into 21 land use scene classes, as shown in Figure 2.Each class contains 100 images with size of 256 × 256 pixels with a pixel spatial resolution of 0.3 m in the red green blue (RGB) color space.These images were selected from aerial orthoimagery downloaded from the United States Geological Survey (USGS) National Map of the following US regions: Birmingham, Boston, Buffalo, Columbus, Dallas, Harrisburg, Houston, Jacksonville, Las Vegas, Los Angeles, Miami, Napa, New York, Reno, San Diego, Santa Barbara, Seattle, Tampa, Tucson, and Ventura.It is not only the diversity of land-use categories contained in the dataset that makes it challenging.Some highly overlapped classes such as dense residential, medium residential and sparse residential are included in this dataset, which are mainly different in the density of structures and makes the dataset more difficult to classify.This dataset has been widely used for the task of remote sensing image scene classification [18,[23][24][25]27,28,30,32,[54][55][56][57][58].
Remote Sens. 2018, 10, x FOR PEER REVIEW 4 of 25 Harrisburg, Houston, Jacksonville, Las Vegas, Los Angeles, Miami, Napa, New York, Reno, San Diego, Santa Barbara, Seattle, Tampa, Tucson, and Ventura.It is not only the diversity of land-use categories contained in the dataset that makes it challenging.Some highly overlapped classes such as dense residential, medium residential and sparse residential are included in this dataset, which are mainly different in the density of structures and makes the dataset more difficult to classify.This dataset has been widely used for the task of remote sensing image scene classification [18,[23][24][25]27,28,30,32,[54][55][56][57][58]. (

AID Dataset
AID is large-scale aerial image dataset, which was collected from Google Earth imagery and is a more challenging dataset compared with the UC Merced Land-Use dataset because of the following reasons.First, the AID dataset contains more scene types and images.In detail, it has 10,000 images with a fixed size of 600 × 600 pixels within 30 classes as shown in Figure 3.Some similar classes make the interclass dissimilarity smaller, and the number of images of different scene types differs from 220 to 420.Moreover, AID images were chosen under different times and seasons and different imaging conditions, and from different countries and regions around the world, including China, the United States, England, France, Italy, Japan, and Germany, which definitely increases the intraclass diversities.Finally, AID images have the property of multiresolution, changing from approximately 8 m to about half a meter.

AID Dataset
AID is large-scale aerial image dataset, which was collected from Google Earth imagery and is a more challenging dataset compared with the UC Merced Land-Use dataset because of the following reasons.First, the AID dataset contains more scene types and images.In detail, it has 10,000 images with a fixed size of 600 × 600 pixels within 30 classes as shown in Figure 3.Some similar classes make the interclass dissimilarity smaller, and the number of images of different scene types differs from 220 to 420.Moreover, AID images were chosen under different times and seasons and different imaging conditions, and from different countries and regions around the world, including China, the United States, England, France, Italy, Japan, and Germany, which definitely increases the intraclass diversities.Finally, AID images have the property of multiresolution, changing from approximately 8 m to about half a meter.Harrisburg, Houston, Jacksonville, Las Vegas, Los Angeles, Miami, Napa, New York, Reno, San Diego, Santa Barbara, Seattle, Tampa, Tucson, and Ventura.It is not only the diversity of land-use categories contained in the dataset that makes it challenging.Some highly overlapped classes such as dense residential, medium residential and sparse residential are included in this dataset, which are mainly different in the density of structures and makes the dataset more difficult to classify.This dataset has been widely used for the task of remote sensing image scene classification [18,[23][24][25]27,28,30,32,[54][55][56][57][58]. (

AID Dataset
AID is large-scale aerial image dataset, which was collected from Google Earth imagery and is a more challenging dataset compared with the UC Merced Land-Use dataset because of the following reasons.First, the AID dataset contains more scene types and images.In detail, it has 10,000 images with a fixed size of 600 × 600 pixels within 30 classes as shown in Figure 3.Some similar classes make the interclass dissimilarity smaller, and the number of images of different scene types differs from 220 to 420.Moreover, AID images were chosen under different times and seasons and different imaging conditions, and from different countries and regions around the world, including China, the United States, England, France, Italy, Japan, and Germany, which definitely increases the intraclass diversities.Finally, AID images have the property of multiresolution, changing from approximately 8 m to about half a meter.NWPU-RESISC45 dataset is more complex than UC Merced Land-Use and AID datasets and consists of a total of 31,500 remote sensing images divided into 45 scene classes as shown in Figure 4.Each class includes 700 images with a size of 256 × 256 pixels in the RGB color space.This dataset was extracted from Google Earth by the experts in the field of remote sensing image interpretation.The spatial resolution varies from approximately 30 to 0.2 m per pixel.This dataset covers more than 100 countries and regions all over the world with developing, transitional, and highly developed economies.

NWPU-RESISC45 Dataset
NWPU-RESISC45 dataset is more complex than UC Merced Land-Use and AID datasets and consists of a total of 31,500 remote sensing images divided into 45 scene classes as shown in Figure 4.Each class includes 700 images with a size of 256 × 256 pixels in the RGB color space.This dataset was extracted from Google Earth by the experts in the field of remote sensing image interpretation.The spatial resolution varies from approximately 30 to 0.2 m per pixel.This dataset covers more than 100 countries and regions all over the world with developing, transitional, and highly developed economies.NWPU-RESISC45 dataset is more complex than UC Merced Land-Use and AID datasets and consists of a total of 31,500 remote sensing images divided into 45 scene classes as shown in Figure 4.Each class includes 700 images with a size of 256 × 256 pixels in the RGB color space.This dataset was extracted from Google Earth by the experts in the field of remote sensing image interpretation.The spatial resolution varies from approximately 30 to 0.2 m per pixel.This dataset covers more than 100 countries and regions all over the world with developing, transitional, and highly developed economies.

Method
In this section, a brief introduction about CNN and CapsNet will be made first and then the proposed architecture will be detailed.

CNN
The convolutional neural network is a type of feed-forward artificial neural network, which is biologically inspired by the organization of the animal visual cortex.They have wide applications in image and video recognition, recommender systems and natural language processing.As shown in Figure 5, CNN is generally made up of two main parts: convolutional layers and pooling layers.The convolutional layer is the core building block of a CNN, which outputs feature maps by computing a dot product between the local region in the input feature maps and a filter.Each of the feature maps is followed by a nonlinear function for approximating arbitrarily complex functions and squashing the output of the neural network to be within certain bounds, such as the rectified linear unit (ReLU) nonlinearity, which is commonly used because of its computational efficiency.The pooling layer performs a downsampling operation to feature maps by computing the maximum or average value on a sub-region.Usually, the fully connected layers follow several stacked convolutional and pooling layers and the last fully connected layer is the softmax layer computing the scores for each class.

Method
In this section, a brief introduction about CNN and CapsNet will be made first and then the proposed architecture will be detailed.

CNN
The convolutional neural network is a type of feed-forward artificial neural network, which is biologically inspired by the organization of the animal visual cortex.They have wide applications in image and video recognition, recommender systems and natural language processing.As shown in Figure 5, CNN is generally made up of two main parts: convolutional layers and pooling layers.The convolutional layer is the core building block of a CNN, which outputs feature maps by computing a dot product between the local region in the input feature maps and a filter.Each of the feature maps is followed by a nonlinear function for approximating arbitrarily complex functions and squashing the output of the neural network to be within certain bounds, such as the rectified linear unit (ReLU) nonlinearity, which is commonly used because of its computational efficiency.The pooling layer performs a downsampling operation to feature maps by computing the maximum or average value on a sub-region.Usually, the fully connected layers follow several stacked convolutional and pooling layers and the last fully connected layer is the softmax layer computing the scores for each class.

CapsNet
CapsNet is a completely novel deep learning architecture, which is robust to affine transformation [41].In CapsNet, a capsule is defined as a vector that consists of a group of neurons, whose parameters can represent various properties of a specific type of entity that is presented in an image, such as position, size, and orientation.The length of each activity vector provides the existence probability of the specific object, and its orientation indicates its properties.Figure 6 illustrates the way that CapsNet routes the information from one layer to another layer by a dynamic routing mechanism [36], which means capsules in lower levels predict the outcome of capsules in higher levels and higher level capsules are activated only if these predictions agree.

CapsNet
CapsNet is a completely novel deep learning architecture, which is robust to affine transformation [41].In CapsNet, a capsule is defined as a vector that consists of a group of neurons, whose parameters can represent various properties of a specific type of entity that is presented in an image, such as position, size, and orientation.The length of each activity vector provides the existence probability of the specific object, and its orientation indicates its properties.Figure 6 illustrates the way that CapsNet routes the information from one layer to another layer by a dynamic routing mechanism [36], which means capsules in lower levels predict the outcome of capsules in higher levels and higher level capsules are activated only if these predictions agree.Considering ui as the output of lower-level capsule i, its prediction for higher level capsule j is computed as: Considering u i as the output of lower-level capsule i, its prediction for higher level capsule j is computed as: where W ij is the weighting matrix that can be learned by back-propagation.Each capsule tries to predict the output of higher level capsules, and if this prediction conforms to the actual output of higher level capsules, the coupling coefficient between these two capsules increases.Based on the degree of conformation, coupling coefficients are calculated using the following softmax function: where b ij is set to 0 initially at the beginning of routing by an agreement process and is the log probability of whether lower-level capsule i should be coupled with higher level capsule j.Then, the input vector to the higher level capsule j can be calculated as follows: Because the length of the output vector represents the probability of existence, the following nonlinear squash function, which is an activation function to ensure that short vectors are decreased to almost zero, and the long vectors are close to one, is used on the output vector computed in Equation ( 3) to prevent the output vectors of capsules from exceeding one.
where s j and v j represent the input vector and output vector, respectively, of capsule j.In addition, the log probabilities b ij is updated in the routing process based on the agreement between v j and û j|i according to the rule that if the two vectors agree, they will have a large inner product.Therefore, agreement a ij for updating log probabilities b ij and coupling coefficients c ij is calculated as follows: As mentioned above, Equations ( 2)-( 5) make up one whole routing procedure for computing v j .The routing algorithm consists of several iterations of the routing procedure [36], and the number of iterations can be described as the routing number.Take the 'school' scene type detection as an example for a clearer explanation.Lengths of the outputs of the lower-level capsules (u 1 , u 2 , . . ., u I ) encode the existence probability of their corresponding entities (e.g., building, tree, road, and playground).Directions of the vectors encode various properties of these entities, such as size, orientation, and position.In training, the network gradually encodes the corresponding part-whole relationship by a routing algorithm to obtain a higher-level capsule (v j ), which encodes the whole scene contexts that the 'school' represents.Thus, the capsule can learn the spatial relationship between entities within an image.
Each capsule k in the last layer is associated with a loss function l k , which can be computed as follows: where T k is 1 when class k is actually present, m + , m − and λ are hyper-parameters that should be indicated while training.The total loss is simply the sum of the loss of all output capsules of the last layer.A typical CapsNet is shown in Figure 7 and contains three layers: one convolutional layer (Conv1), the PrimaryCaps layer and the FinalCaps layer.The Conv1 converts the input image (raw pixels) to initial feature maps, whose size can be described as H × W × L.Then, by two reshape functions and one squash operation, the PrimaryCaps can be computed, which contains H × W × L/S1 capsules (each capsule in the PrimaryCaps is an S1 dimension vector and is denoted as the S1-D vector in Figure 7).The FinalCaps has T (number of total predict classes) capsules (each capsule in the FinalCaps is an S2 dimension vector and is denoted as the S2-D vector in Figure 7), and each of these capsules receives input from all the capsules in the PrimaryCaps layer.The detail of FinalCaps is illustrated in Figure 8.At the end of the CapsNet, the length of each capsule in FinalCaps is computed by an L 2 norm function, the corresponding scene category represented by the maximum value is the final classification result.
where Tk is 1 when class k is actually present, m + , m − and λ are hyper-parameters that should be indicated while training.The total loss is simply the sum of the loss of all output capsules of the last layer.
A typical CapsNet is shown in Figure 7 and contains three layers: one convolutional layer (Conv1), the PrimaryCaps layer and the FinalCaps layer.The Conv1 converts the input image (raw pixels) to initial feature maps, whose size can be described as H × W × L.Then, by two reshape functions and one squash operation, the PrimaryCaps can be computed, which contains H × W × L/S1 capsules (each capsule in the PrimaryCaps is an S1 dimension vector and is denoted as the S1-D vector in Figure 7).The FinalCaps has T (number of total predict classes) capsules (each capsule in the FinalCaps is an S2 dimension vector and is denoted as the S2-D vector in Figure 7), and each of these capsules receives input from all the capsules in the PrimaryCaps layer.The detail of FinalCaps is illustrated in Figure 8.At the end of the CapsNet, the length of each capsule in FinalCaps is computed by an L2 norm function, the corresponding scene category represented by the maximum value is the final classification result.

Proposed Method
As illustrated in Figure 9, the proposed architecture CNN-CapsNet can be divided into two parts: CNN and CapsNet.First, a remote sensing image is fed into a CNN model, and the initial feature maps are extracted from the convolutional layers.Then, the initial feature maps are fed into CapsNet to obtain the final classification result.

Proposed Method
As illustrated in Figure 9, the proposed architecture CNN-CapsNet can be divided into two parts: CNN and CapsNet.First, a remote sensing image is fed into a CNN model, and the initial feature maps are extracted from the convolutional layers.Then, the initial feature maps are fed into CapsNet to obtain the final classification result.

Proposed Method
As illustrated in Figure 9, the proposed architecture CNN-CapsNet can be divided into two parts: CNN and CapsNet.First, a remote sensing image is fed into a CNN model, and the initial feature maps are extracted from the convolutional layers.Then, the initial feature maps are fed into CapsNet to obtain the final classification result.As for CNN, two representative CNN models (VGG-16 and Inception-V3) fully trained on the ImageNet dataset are used as initial feature map extractors, considering their popularity in the remote sensing field [25,27,28,35,53,56,59].The "block4_pool" layer of VGG-16 and the "mixd7" of As for CNN, two representative CNN models (VGG-16 and Inception-V3) fully trained on the ImageNet dataset are used as initial feature map extractors, considering their popularity in the remote sensing field [25,27,28,35,53,56,59].The "block4_pool" layer of VGG-16 and the "mixd7" of Inception-V3 are selected as the layer of initial feature maps, whose sizes are 16 × 16 × 512 and 14 × 14 × 768, respectively, if the input image size is 256 × 256 pixels.The influence of the two pretrained CNN models on the classification results is discussed in Section 4.2.In addition, a brief introduction about them follows.
• VGG-16: Simonyan et al. [51] presented the very deep CNN models that secured the first and the second places in the localization and classification tracks, respectively, on ILSVRC2014.The two best-performing deep models, named VGG-16 (containing 13 convolutional layers and 3 fully connected layers) and VGG-19 (containing 16 convolutional layers and 3 fully connected layers) are the basis of their team's submission, which demonstrates the important aspect of the model's depth.Rather than using relatively large receptive fields in the convolutional layers, such as 11 × 11 with stride 4 in the first convolutional layer of AlexNet [60], VGGNet uses very small 3 × 3 receptive fields through the whole network.VGG-16 is the most representative sequence-like CNN architecture as shown in Figure 5 (consisting of a simple chain of blocks such as the convolution layer and pooling layer), which has achieved great success in the field of remote sensing image scene classification.

•
Inception-v3: Unlike the sequence-like CNN architecture such as VGG-16, which only increases the depth of the convolution layers, the Inception-like CNN architecture attempts to increase the width of a single convolution layer, which means different sizes of kernels are used on the single convolution layer and can extract different scales of features.As shown in Figure 10, it is the core component of GoogLeNet [61] named Inception-v1.Inception-v3 [62] is an improved version of Inception-v1 and is designed on the following four principles: to avoid representation bottlenecks, especially early in the network; higher dimensional representations are easier to process locally within a network; spatial aggregation can be done over lower dimensional embedding without much or any loss in representation; to balance the width and depth of the network.The Inception-v3 reached 21.2% top-1 and 5.6% top-5 error on the ILSVR 2012 classification.
increases the depth of the convolution layers, the Inception-like CNN architecture attempts to increase the width of a single convolution layer, which means different sizes of kernels are used on the single convolution layer and can extract different scales of features.As shown in Figure 10, it is the core component of GoogLeNet [61] named Inception-v1.Inception-v3 [62] is an improved version of Inception-v1 and is designed on the following four principles: to avoid representation bottlenecks, especially early in the network; higher dimensional representations are easier to process locally within a network; spatial aggregation can be done over lower dimensional embedding without much or any loss in representation; to balance the width and depth of the network.The Inception-v3 reached 21.2% top-1 and 5.6% top-5 error on the ILSVR 2012 classification.For CapsNet, a CapsNet with an analogical architecture as shown in Figure 7 is designed, including three layers: one convolutional layer, one PrimaryCaps layer and one FinalCaps layer.A 5 × 5 convolution kernel with a stride of 2, and a ReLU activation function is used in the convolution layer.The number of output feature maps (the variable L) is set as 512.The dimension of the capsules in the PrimaryCaps and FinalCaps layers (the variables S1 and S2) are the vital parameters of the CapsNet and their influence on the classification result is discussed in Section 4.2.The variable T is determined by the remote sensing datasets and is set as 21, 30, and 45 for the UC Merced Land-Use dataset, AID dataset and NWPU-RESISC45 dataset, respectively.In addition, For CapsNet, a CapsNet with an analogical architecture as shown in Figure 7 is designed, including three layers: one convolutional layer, one PrimaryCaps layer and one FinalCaps layer.A 5 × 5 convolution kernel with a stride of 2, and a ReLU activation function is used in the convolution layer.The number of output feature maps (the variable L) is set as 512.The dimension of the capsules in the PrimaryCaps and FinalCaps layers (the variables S1 and S2) are the vital parameters of the CapsNet and their influence on the classification result is discussed in Section 4.2.The variable T is determined by the remote sensing datasets and is set as 21, 30, and 45 for the UC Merced Land-Use dataset, AID dataset and NWPU-RESISC45 dataset, respectively.In addition, 50% dropout was used between the PrimaryCaps layer and the FinalCaps layer to prevent overfitting.
As shown in Figure 11, the proposed method includes two training phases.In the first training phase, the parameters in the pretrained CNN model are frozen, and weights in the CapsNet are initialized by Gaussian distribution with zero mean and unit variance.Then, they are trained with a learning rate of lr1 to minimize the sum of the margin losses in Equation (6).When the CapsNet is fully trained, the second training phase begins with a lower learning rate lr2 to fine-tune the whole architecture until convergence.The parameters between the adjacent capsule layers except for the coupling coefficient can be updated by a gradient descent algorithm, while the coupling coefficients are determined by the iterative dynamic routing algorithm [36].The optimal routing number in the iterative dynamic routing algorithm is discussed in Section 3.  As shown in Figure 11, the proposed method includes two training phases.In the first training phase, the parameters in the pretrained CNN model are frozen, and weights in the CapsNet are initialized by Gaussian distribution with zero mean and unit variance.Then, they are trained with a learning rate of lr1 to minimize the sum of the margin losses in Equation (6).When the CapsNet is fully trained, the second training phase begins with a lower learning rate lr2 to fine-tune the whole architecture until convergence.The parameters between the adjacent capsule layers except for the coupling coefficient can be updated by a gradient descent algorithm, while the coupling coefficients

Implementation Details
In this work, the Keras framework was used to implement the proposed method.The hyperparameters used in the training stage were set by trial and error as follows.For the Adam optimization algorithm, the batch-size was set as 64 and 50 to cater to the computer memory (due to the different volume of training parameters of the model in two training phases); the learning rates lr1 and lr2 were set as 0.001 and 0.0002 separately for two training phases.The sum of all classes' margin losses in Equation ( 6) was used for the loss function, and m + , m − , and λ were set as 0.9, 0.1 and 0.5.All models were trained until the training loss converged.At the same time, for a fair comparison, the same ratios were applied in the following experiments according to the experimental settings in works [23][24][25]27,28,30,35,[53][54][55][56][57][63][64][65][66][67][68].For the UC Merced Land-Use dataset, the 80% and 50% training ratio were set separately.For the AID dataset, 50% and 20% of the images were randomly selected as the training samples, and the rest were left for testing.In addition, a 20% and 10% training ratio were used for the NWPU-RESISC45 dataset.Here, two training ratios were considered for each of the three datasets to comprehensively evaluate the proposed method.Moreover, different ratios were used for different datasets because the numbers of images for the three datasets are different.A small ratio can usually satisfy the full training requirement of the models when a dataset has a large amount of data.Note that all images in the AID dataset were resized to 256 × 256 pixel from the original 600 × 600 pixel because of memory overflow in the training phase.All the implementations were evaluated on an Ubuntu 16.04 operating system with one 3.6 GHz 8-core i7-4790CPU and 32GB memory.Additionally, a NVIDIA GTX 1070 graphics processing unit (GPU) was used to accelerate computing.

Evaluation Protocol
The overall accuracy (OA) and confusion matrix were computed to evaluate experimental results and to compare with the state-of-the-art methods.The OA was defined as the number of correctly classified images divided by the total number of test images, which is a valuable measure to reveal the classification method performance on the whole test images.The value of OA is in the range of 0 to 1, and a higher value indicates a better classification performance.The confusion matrix is an informative table that can allow direct visualization of the performance on each class and can be used for easily analyzing the errors and confusion between different classes, in which the column represents the instances in a predicted class and the row represents the instances in an actual class.Thus, each item x ij in the matrix is the proportion of images that are predicted to be the i-th class while truly belonging to the j-th class.
To compute the overall accuracy, the dataset was randomly divided into training and testing sets according to the ratios in Section 4.1.1 and repeated ten times to reduce the influence of the randomness for a reliable result.The mean and standard deviation of overall accuracies on the testing sets from each individual run were reported.Additionally, the confusion matrix was obtained from the best classification results by fixing the ratios of the training sets of the UC Merced Land-Use dataset, AID dataset and NWPU-RESISC45 dataset to be 50%, 20%, and 20%, respectively.

Analysis of Experimental Parameters
In this section, three parameters including the routing number, the dimension of the capsule in the CapsNet, and different pretrained CNN models, were tested to analyze how these parameters affect the classification result.In addition, the optimal parameters used in the experiments of Sections 4.2.2 and 4.2.3.Training rations of 80%, 50%, 20% were selected for the UC Merced Land-Use dataset, AID dataset and NWPU-RESISC45 dataset, respectively, in this section's experiments.

The routing number
In the dynamic routing algorithm, the routing number is a vital parameter for determining whether the CapsNet can obtain the best coupling coefficients.Therefore, it is necessary to select an optimal routing number.Thus, the routing number was set to (1,2,3,4) while other parameters in the proposed architecture were kept the same.The pretrained VGG-16 model was selected as the primary feature extractor, and the dimension of the capsule in the PrimaryCaps and FinalCaps layers were set to 8 and 16, respectively.As shown in Figure 12, the OAs first increased and then decreased with the increase in the routing number for all three datasets and all reached their peaks at the routing number of 2. A smaller value may generate inadequate training, and a larger value will lead to missing the optimal fitting.In addition, the bigger the value is, the longer the required training time.Comprehensively, the routing number 2 was chosen as the optimal number, considering the training time and was applied in remaining experiments.

The dimension of the capsule
The capsule is the core component of CapsNet and consists of many neurons, and their activities within a capsule represent the various properties of a remote sensing scene image.The primary capsules in the PrimaryCaps are the lower-level capsules that are learned from the primary feature maps extracted from the pretrained CNN models, and they can represent some small entities in the remote sensing image.The capsules with a higher dimension in the FinalCaps are in a higher level and represent more complex entities such as the scene class that the image presents.Thus, the dimension of the capsule in the CapsNet should be considered for its importance in the final classification result.When the dimension of the capsule is low, the representation ability of the capsule is weak, which leads to confusion between two scene classes with high similarity in image context.In contrast, the capsule with a high dimension may contain redundant information or noise, e.g., two neurons may represent very similar properties.Both of them will have a negative influence on the classification result.Thus, a set of values ((6,12), (8,16), (10,20), (12,24)) were set to evaluate the capsule's influence.Additionally, other parameters were fixed with the pretrained VGG-16 model as the primary feature extractor, and the routing number was set to 2. The experimental results are shown in Figure 13.As expected, in all three datasets, the curves of OAs had their single peaks.The value (8,16) obtained the best performance, and thus it was used in the next experiments.

The dimension of the capsule
The capsule is the core component of CapsNet and consists of many neurons, and their activities within a capsule represent the various properties of a remote sensing scene image.The primary capsules in the PrimaryCaps are the lower-level capsules that are learned from the primary feature maps extracted from the pretrained CNN models, and they can represent some small entities in the remote sensing image.The capsules with a higher dimension in the FinalCaps are in a higher level and represent more complex entities such as the scene class that the image presents.Thus, the dimension of the capsule in the CapsNet should be considered for its importance in the final classification result.When the dimension of the capsule is low, the representation ability of the capsule is weak, which leads to confusion between two scene classes with high similarity in image context.In contrast, the capsule with a high dimension may contain redundant information or noise, e.g., two neurons may represent very similar properties.Both of them will have a negative influence on the classification result.Thus, a set of values ((6,12), (8,16), (10,20), (12,24)) were set to evaluate the capsule's influence.Additionally, other parameters were fixed with the pretrained VGG-16 model as the primary feature extractor, and the routing number was set to 2. The experimental results are shown in Figure 13.As expected, in all three datasets, the curves of OAs had their single peaks.The value (8,16) obtained the best performance, and thus it was used in the next experiments.
influence on the classification result.Thus, a set of values ((6,12), (8,16), (10,20), (12,24)) were set to evaluate the capsule's influence.Additionally, other parameters were fixed with the pretrained VGG-16 model as the primary feature extractor, and the routing number was set to 2. The experimental results are shown in Figure 13.As expected, in all three datasets, the curves of OAs had their single peaks.The value (8,16) obtained the best performance, and thus it was used in the next experiments.

Different pretrained CNN models
As described in Section 3.3, two representative CNN architectures (VGG-16 and Inception-v3) were selected as feature extractors to evaluate the effectiveness of convolutional features on classification.The "block4_pool" layer of VGG-16 and the "mixd7" of Inception-V3 were selected as the layer of initial feature maps.Other parameters remained unchanged in the experiment.As shown in Figure 14, the Inception-v3 model achieved the highest classification accuracy on all three datasets.This can be explained by the fact that the Inception-v3 consists of the inception modules, which can extract multiscale features and have a stronger ability to extract effective features than VGG-16; however, compared with the Inception-v3, the VGG-16 may lose considerable information due to the consistent existence of pooling layers.Moreover, the OA differences between VGG-16 and Inception-v3 on the AID and NWPU-RESISC45 datasets were more conspicuous than those on the UC Merced dataset.
Remote Sens. 2018, 10, x FOR PEER REVIEW 14 of 25 As described in Section 3.3, two representative CNN architectures (VGG-16 and Inception-v3) were selected as feature extractors to evaluate the effectiveness of convolutional features on classification.The "block4_pool" layer of VGG-16 and the "mixd7" of Inception-V3 were selected as the layer of initial feature maps.Other parameters remained unchanged in the experiment.As shown in Figure 14, the Inception-v3 model achieved the highest classification accuracy on all three datasets.This can be explained by the fact that the Inception-v3 consists of the inception modules, which can extract multiscale features and have a stronger ability to extract effective features than VGG-16; however, compared with the Inception-v3, the VGG-16 may lose considerable information due to the consistent existence of pooling layers.Moreover, the OA differences between VGG-16 and Inception-v3 on the AID and NWPU-RESISC45 datasets were more conspicuous than those on the UC Merced dataset.Compared with the UC Merced dataset, the other two datasets have more classes, higher intraclass variations and smaller interclass dissimilarity.Since the Inception-v3 shows its effectiveness in extracting features with more complex datasets, it was chosen as the final feature extractor on the evaluation of the proposed method.

Classification of the UC Merced Land-Use dataset
To evaluate the classification performance of the proposed method, a comparative evaluation against several state-of-the-art classification methods on the UC Merced Land-Use dataset is shown in Table 1.As seen from Table 1, the proposed architecture CNN-CapsNet using pretrained Inception-v3 as the initial feature maps extractor (denoted as Inception-v3-CapsNet) achieved the Compared with the UC Merced dataset, the other two datasets have more classes, higher intraclass variations and smaller interclass dissimilarity.Since the Inception-v3 shows its effectiveness in extracting features with more complex datasets, it was chosen as the final feature extractor on the evaluation of the proposed method.

Classification of the UC Merced Land-Use dataset
To evaluate the classification performance of the proposed method, a comparative evaluation against several state-of-the-art classification methods on the UC Merced Land-Use dataset is shown in Table 1.As seen from Table 1, the proposed architecture CNN-CapsNet using pretrained Inception-v3 as the initial feature maps extractor (denoted as Inception-v3-CapsNet) achieved the highest OA of 99.05% and 97.59% for 80% and 50% training ratio, respectively, among all methods.The CNN-CapsNet using pretrained VGG-16 as the initial feature maps extractor (denoted as VGG-16-CapsNet) also outperformed most methods.This demonstrates that the CNN-CapsNet architecture can learn a higher level representation of scene images by combining CNN and CapsNet.
Table 1.Overall accuracy (%) and standard deviations of the proposed method and the comparison methods under the training ratios of 80% and 50% on the UC-Merced dataset.

Classification of AID dataset
The AID dataset was also tested to demonstrate the effectiveness of the proposed method, compared with other state-of-the-art methods on the same dataset.The results are shown in Table 2.It can be seen that the proposed method of the Inception-v3-CapsNet model generated the best performance with OAs of 96.32% and 93.79% by using 50% and 20% samples, respectively, for training, except for approximately 0.53% lower performance than the method of GCFs + LOFs in a 50% training ratio.This can be explained that the process of downsampling from 600 × 600 to 256 × 256 for the AID dataset in the preprocessing causes some loss of important information and has a negative effect on the classification result.However, in the 20% training ratio, the proposed method outperforms GCFs + LOFs by approximately 1.31%.In addition, data augmentation was used in GCFs + LOFs.Thus, overall, the proposed method yields the state-of-the-art result on AID dataset comprehensively.

Classification of AID dataset
The AID dataset was also tested to demonstrate the effectiveness of the proposed method, compared with other state-of-the-art methods on the same dataset.The results are shown in Table 2.It can be seen that the proposed method of the Inception-v3-CapsNet model generated the best performance with OAs of 96.32% and 93.79% by using 50% and 20% samples, respectively, for training, except for approximately 0.53% lower performance than the method of GCFs + LOFs in a 50% training ratio.This can be explained that the process of downsampling from 600 × 600 to 256 × 256 for the AID dataset in the preprocessing causes some loss of important information and has a negative effect on the classification result.However, in the 20% training ratio, the proposed method outperforms GCFs + LOFs by approximately 1.31%.In addition, data augmentation was used in GCFs + LOFs.Thus, overall, the proposed method yields the state-of-the-art result on AID dataset comprehensively.
As for the analysis of the confusion matrix, shown in Figure 16, 80% of all 30 categories achieved classification accuracies greater than 90% where the mountain class achieved the 100% accuracy.Some categories with small interclass dissimilarity, such as ʹsparse residential', 'medium residential', and 'dense residential' were also classified accurately with 99.17%, 94.83% and 95.73%, respectively.The classes of 'school' and 'resort' had relatively low classification accuracies with 67.92% and 72.84.In detail, the 'school' class was easily confused with 'commercial' because they had the same image distribution.In addition, the resort class was usually misclassified as 'park' due to the existence of some analogous objects such as green belts and ponds.Even so, great Table 2. Overall accuracy (%) and standard deviations of the proposed method and the comparison methods under the training ratios of 50% and 20% on the AID dataset.

Classification of NWPU-RESISC45 dataset
Table 3 shows the classification performance comparison of the proposed architecture and the existing state-of-the-art methods using the most challenging NWPU-RESISC45 dataset.It can be observed that the Inception-v3-CapsNet model also achieved remarkable classification results, with OA improvements of 0.27% and 1.88% over the second best model using 20% and 10% training ratios, respectively.The good performance of the proposed method further verifies the effectiveness of combining the pretrained CNN model and CapsNet.
Fine-tuned VGG-16 [53] 90. Figure 17 gives the confusion matrix generated from the best classification result by Inception-v3-CapsNet with the training ratio of 20%.From the confusion matrix, 36 categories among all 45 categories achieved classification accuracies greater than 90%.The major confusion was in 'palace' and 'church' because both of them have similar styles of buildings.In spite of that, substantial improvements were still achieved with 79.3% and 68% compared with 75% and 64% in [53], respectively.

Further Explanation
In Section 4.2.2, it was found that the proposed method obtains state-of-the-art classification results.This mainly benefits from the following three factors: fine-tuning, capsules and the pretrained CNN model.In this section, further analysis will be performed on how they accomplish the significant performance for classification.In addition, the training ratios of the three datasets were the same as those in Section 4.2.1.

Effectiveness of fine-tuning
First, the strategy of fine-tuning was added to train the proposed architecture.To evaluate the effectiveness of fine-tuning in the proposed method, a comparison was made between the classification results with and without fine-tuning.As shown in Figure 18, the methods with fine-tuning obtained a significant improvement compared with no fine-tuning operation.The reason is that the features extracted from the pretrained CNN models have a strong relationship with the original task.Fine-tuning can adjust the parameters of the pretrained CNN model to cater to the current training datasets for an accuracy improvement.
classification results with and without fine-tuning.As shown in Figure 18, the methods with finetuning obtained a significant improvement compared with no fine-tuning operation.The reason is that the features extracted from the pretrained CNN models have a strong relationship with the original task.Fine-tuning can adjust the parameters of the pretrained CNN model to cater to the current training datasets for an accuracy improvement.

Effectiveness of capsules
In the design of the proposed architecture, the CapsNet is used as the classifier to label the remote sensing image, which uses the capsule to replace the neuron in traditional neural networks.To prove the validity of the positive impact on classification results with this replacement, a comparative experiment was conducted.In detail, a new CNN architecture was designed as the classifier, which consists of one convolutional layer and two fully connected layers.In addition, the only difference between the new CNN architecture and the CapsNet described in Section 3.3 was using the neuron to replace the capsule while other parameters including the training hyperparameters were all kept the same.The experimental results are shown in Figure 19 (the VGG-16-CNN and Inception-v3-CNN in Figure 19 mean that using pretrained VGG-16 and Inception-v3 as feature extractors, respectively, and using the newly designed CNN architecture as the classifier).For three datasets, the models using capsules all achieved better performance than those using traditional neurons.This further demonstrates that the CapsNet can learn more representative information of scene images.

Effectiveness of capsules
In the design of the proposed architecture, the CapsNet is used as the classifier to label the remote sensing image, which uses the capsule to replace the neuron in traditional neural networks.To prove the validity of the positive impact on classification results with this replacement, a comparative experiment was conducted.In detail, a new CNN architecture was designed as the classifier, which consists of one convolutional layer and two fully connected layers.In addition, the only difference between the new CNN architecture and the CapsNet described in Section 3.3 was using the neuron to replace the capsule while other parameters including the training hyperparameters were all kept the same.The experimental results are shown in Figure 19 (the VGG-16-CNN and Inception-v3-CNN in Figure 19 mean that using pretrained VGG-16 and Inception-v3 as feature extractors, respectively, and using the newly designed CNN architecture as the classifier).For three datasets, the models using capsules all achieved better performance than those using traditional neurons.This further demonstrates that the CapsNet can learn more representative information of scene images.

Effectiveness of the pretrained CNN model
The pretrained CNN model was selected as the initial feature maps extractor instead of designing a new CNN architecture.This is also a great factor for the success of the proposed architecture.For comparison, a CNN architecture (Self-CNN) was designed, which only contained four consecutive convolutional layers and the size of its output feature maps was 16 × 16 × 512, the same as that of the pretrained VGG-16 used in this paper.The parameters of the CapsNet were the same.The new Self-CNN-CapsNet architecture was trained from scratch.The classification results are presented in Figure 20.From the Figure, the classification accuracy of CNN-CapsNet with the pretrained CNN model as the feature extractor was much higher than that with self-CNN.This is because the existing datasets cannot fully train the model and further proves the effectiveness of using pretrained CNN models as feature extractors.The pretrained CNN model was selected as the initial feature maps extractor instead of designing a new CNN architecture.This is also a great factor for the success of the proposed architecture.For comparison, a CNN architecture (Self-CNN) was designed, which only contained four consecutive convolutional layers and the size of its output feature maps was 16 × 16 × 512, the same as that of the pretrained VGG-16 used in this paper.The parameters of the CapsNet were the same.The new Self-CNN-CapsNet architecture was trained from scratch.The classification results are presented in Figure 20.From the Figure, the classification accuracy of CNN-CapsNet with the pretrained CNN model as the feature extractor was much higher than that with self-CNN.This is because the existing datasets cannot fully train the model and further proves the effectiveness of using pretrained CNN models as feature extractors.

Conclusions
In recent years, the prevalence of deep learning methods especially the CNN has made the performance of remote sensing scene classification state-of-the-art.However, the scene classes with the same image distribution are still not distinguished effectively.This is mainly because some fully connected layers are added to the end of the CNN, which gives less consideration to the spatial

Conclusions
recent years, the prevalence of deep learning methods especially the CNN has made the performance of remote sensing scene classification state-of-the-art.However, the scene classes with the same image distribution are still not distinguished effectively.This is mainly because some fully connected layers are added to the end of the CNN, which gives less consideration to the spatial relationship that is vital to classification.To preserve the spatial information, the new architecture CapsNet is proposed, which uses the capsule to replace the neuron in the traditional neural network.In addition, the capsule is a vector to represent internal properties that can be used to learn part-whole relationships within an image.In this paper, to further improve the classification accuracy of remote sensing image scene classification and inspired by the CapsNet, a novel architecture named CNN-CapsNet is proposed for remote sensing image scene classification.The proposed architecture consists of two parts: CNN and CapsNet.The CNN part is transferring the original remote sensing images to the original feature maps.In addition, the CapsNet part converts the original feature maps into various levels of capsules and to obtain the final classification result.Experiments were performed on three public challenging datasets, and the experimental results demonstrate the effectiveness of the proposed CNN-CapsNet and show that the proposed method outperforms the current state-of-the-art methods.In future work, different from using feature maps from only one CNN model, in this paper, feature maps from different pretrained CNN models will be merged for remote sensing image scene classification.

Figure 1 .
Figure 1.Sample images labelled school and dense residential in the AID dataset.

Figure 1 .
Figure 1.Sample images labelled school and dense residential in the AID dataset.

25 Figure 6 .
Figure 6.Connections between the lower level and higher level capsules.

Figure 6 .
Figure 6.Connections between the lower level and higher level capsules.

Figure 9 .
Figure 9.The architecture of the proposed classification method.

Figure 9 .
Figure 9.The architecture of the proposed classification method.

2 .
When the training finishes, the testing images are fed into the fully trained CNN-CapsNet architecture to evaluate the classification result.Remote Sens. 2018, 10, x FOR PEER REVIEW 11 of 2550% dropout was used between the PrimaryCaps layer and the FinalCaps layer to prevent overfitting.

Figure 11 .
Figure 11.The flowchart of the proposed method.

Figure 11 .
Figure 11.The flowchart of the proposed method.

25 Figure 12 .
Figure 12.The influence of the routing number on the classification accuracy.

Figure 12 .
Figure 12.The influence of the routing number on the classification accuracy.

Figure 13 .
Figure 13.The influence of the dimension of the capsule on the classification accuracy.

Figure 13 .
Figure 13.The influence of the dimension of the capsule on the classification accuracy.

Figure 14 .
Figure 14.The influence of pretrained CNN models on the classification accuracy.

Figure 14 .
Figure 14.The influence of pretrained CNN models on the classification accuracy.

Figure 15 .
Figure 15.Confusion matrix of the proposed method on UC Merced Land-Use dataset by fixing the training ratio to 50%.

Figure 15 .
Figure 15.Confusion matrix of the proposed method on UC Merced Land-Use dataset by fixing the training ratio to 50%.

25 Figure 16 .
Figure 16.Confusion matrix of the proposed method on the AID dataset by fixing the training ratio as 20%.

Figure 16 .
Figure 16.Confusion matrix of the proposed method on the AID dataset by fixing the training ratio as 20%.

Figure 17 .
Figure 17.Confusion matrix of the proposed method on the NWPU-RESISC45 dataset by fixing the training ratio as 20%.4.2.3.Further Explanation.

Figure 17 .
Figure 17.Confusion matrix of the proposed method on the NWPU-RESISC45 dataset by fixing the training ratio as 20%.

Figure 18 .
Figure 18.Overall accuracy (%) of the proposed method with and without fine-tuning on three datasets.

Figure 18 .
Figure 18.Overall accuracy (%) of the proposed method with and without fine-tuning on three datasets.

25 Figure 19 .
Figure 19.Overall accuracy (%) of the proposed method with neuron and capsule on three datasets.

Figure 19 .
Figure 19.Overall accuracy (%) of the proposed method with neuron and capsule on three datasets.

Figure 19 .
Figure 19.Overall accuracy (%) of the proposed method with neuron and capsule on three datasets.

Figure 20 .
Figure 20.Overall accuracy (%) of the proposed method with the pretrained model and self-model on three datasets.

Figure 20 .
Figure 20.Overall accuracy (%) of the proposed method with the pretrained model and self-model on three datasets.

Table 3 .
Overall accuracy (%) and standard deviations of the proposed method and the comparison methods under the training ratios of 20% and 10% on NWPU-RESISC45 dataset.

Table 3 .
Overall accuracy (%) and standard deviations of the proposed method and the comparison methods under the training ratios of 20% and 10% on NWPU-RESISC45 dataset.