Deep Convolutional Capsule Network for Hyperspectral Image Spectral and Spectral-Spatial Classiﬁcation

: Capsule networks can be considered to be the next era of deep learning and have recently shown their advantages in supervised classiﬁcation. Instead of using scalar values to represent features, the capsule networks use vectors to represent features, which enriches the feature presentation capability. This paper introduces a deep capsule network for hyperspectral image (HSI) classiﬁcation to improve the performance of the conventional convolutional neural networks (CNNs). Furthermore, a modiﬁcation of the capsule network named Conv-Capsule is proposed. Instead of using full connections, local connections and shared transform matrices, which are the core ideas of CNNs, are used in the Conv-Capsule network architecture. In Conv-Capsule, the number of trainable parameters is reduced compared to the original capsule, which potentially mitigates the overﬁtting issue when the number of available training samples is limited. Speciﬁcally, we propose two schemes: (1) A 1D deep capsule network is designed for spectral classiﬁcation, as a combination of principal component analysis, CNN, and the Conv-Capsule network, and (2) a 3D deep capsule network is designed for spectral-spatial classiﬁcation, as a combination of extended multi-attribute proﬁles, CNN, and the Conv-Capsule network. The proposed classiﬁers are tested on three widely-used hyperspectral data sets. The obtained results reveal that the proposed models provide competitive results compared to the state-of-the-art methods, including kernel support vector machines, CNNs, and recurrent neural network.


Introduction
The task of classification, when it relates to hyperspectral images (HSIs), generally refers to assigning a label to each pixel vector in the scene [1].HSI classification is a crucial step for a plethora of applications including urban development [2][3][4], land change monitoring [5][6][7], scene interpretation [8,9], resource management [10,11], and so on.Due to the fundamental importance of this step in various applications, classification of HSI is one of the hottest topics in the remote sensing community.However, the classification of HSI is still challenging due to several factors such as high dimensionality, a limited number of training samples, and complex imaging situations [1].
During the last few decades, a huge number of methods have been proposed for HSI classification [12][13][14].Due to the availability of abundant spectral information in HSIs, lots of spectral classifiers have been proposed for HSI classification including k-nearest-neighbors, maximum likelihood, neural network, logistic regression, and support vector machines (SVMs) [1,15,16].
Hyperspectral sensors provide rich spatial information as well, and the spatial resolution is becoming finer and finer along with the development of sensor technologies.With the help of spatial information, classification performance can be greatly improved [17].Among the spectral-spatial classification techniques, the generation of a morphological profile is a widely-used approach, which is usually followed by either an SVM or a random forest classifier to obtain the final classification result [18][19][20][21].As the extension of SVM, multiple kernel learning is another main stream of spectral-spatial HSI classification, which has a powerful capability to handle the heterogeneous features obtained by spectral-spatial hyperspectral images [22].
Due to the complex atmospheric conditions, scattering from neighboring objects, intra-class variability, and varying sunlight intensity, it is very important to extract invariant and robust features from HSIs for accurate classification.Deep learning uses hierarchical models to extract invariant and discriminate features from HSIs in an effective manner and usually leads to accurate classification.During the past few years, many deep learning methods have been proposed for HSI classification.Deep learning includes a broader family of models, including the stacked auto-encoder, the deep belief network, the deep convolutional neural network (CNN), and the deep recurrent neural network.All of the aforementioned deep models have been used for HSI classification [23,24].
The stacked auto-encoder was the first deep model to be investigated for HSI feature extraction and classification [25].In [25], two stacked auto-encoders were used to hierarchically extract spectral and spatial features.The extracted invariant and discriminant features led to a better classification performance.Furthermore, recently, the deep belief network was introduced for HSI feature extraction and classification [26,27].
Because of the unique and useful model architectures of CNNs (e.g., local connections and shared weights), such networks usually outperform other deep models in terms of classification accuracy.In [28], a well-designed CNN with five layers was proposed to extract spectral features for accurate classification.In [29], a CNN-based spectral classifier that elaborately uses pixel-pair information was proposed, and it was shown to obtain good classification performance under the condition of a limited number of training samples.
Most of the existing CNN-based HSI classification methods have been generalized to consider both spectral and spatial information in a single classification framework.The first spectral-spatial classifier based on CNN was introduced in [30], which was a combination of principal component analysis (PCA), deep CNN, and logistic regression.Due to the fact that the inputs of deep learning models are usually 3D data, it is reasonable to design 3D CNNs for HSI spectral-spatial classification [31,32].Furthermore, CNN can be combined with other powerful techniques to improve the classification performance.In [33], CNN was combined with sparse representation to refine the learnt features.CNNs can be connected with other spatial feature extraction methods, such as morphological profiles and Gabor filtering, to further improve the classification performance [34,35].
The pixel vector of HSIs can be inherently considered to be sequential.Recurrent neural networks have the capability of characterizing sequential data.Therefore, in [27], a deep recurrent neural network that can analyze hyperspectral pixel vectors as sequential data and then determine information categories via network reasoning was proposed.
Although deep learning models have shown their capabilities for HSI classification, some disadvantages exist which downgrade the performance of such techniques.In general, deep models require a huge number of training samples to reliably train a large number of parameters in their networks.On the other hand, having insufficient training samples is a frequent problem in remotely sensed image classification.In 2017, Sabour et al. proposed a new idea based on capsules, which showed its advantages in coping with a limited number of training samples [36].Furthermore, traditional CNNs usually use a pooling layer to obtain invariant features from the input data, but the pooling operation loses the precise positional relationship of features.In hyperspectral remote sensing, abundant spectral information and the positional relationship in a pixel vector are the crucial factors for accurate spectral classification.Therefore, it is important to maintain the precise positional relationship in the feature extraction stage.In addition, when it comes to extracting spectral-spatial features from HSI, it is also important to hold the positional relationship of spectral-spatial features.Moreover, most of the existing deep methods use a scalar value to represent the intensity of a feature.In contrast, capsule networks use vectors to represent features.The usage of vectors enriches the feature representation and is a huge progress and a much more promising method for feature learning than scalar representation [36,37].These properties of the capsule network perfectly align with the goals of this study and the current demands in the hyperspectral community.
Deep learning-based methods, including deep capsule networks, have a powerful feature extraction capability when the number of training samples is sufficient.Unfortunately, the availability of only a limited number of training samples is a common bottleneck in HSI classification.Deep models are often over-trained with a limited number of training samples, which downgrades the classification accuracy on test samples.In order to mitigate the overfitting problem and lessen the feature extraction workload of deep models, the idea of a local connection-based capsule network is proposed in this study.The proposed Conv-Capsule network uses local connections and shared transform matrices to reduce the number of trainable parameters compared to the original capsule, which potentially mitigates the overfitting issue when the number of available training samples is limited.
In the current study, the idea of the capsule network is modified for HSI classification.Two deep capsule classification frameworks, 1D-Capsule and 3D-Capsule, are proposed as spectral and spectral-spatial classifiers, respectively.Furthermore, two modified capsule networks, i.e., 1D-Conv-Capsule and 3D-Conv-Capsule, are proposed to further improve the classification accuracy.
The main contributions of the paper are briefly summarized as follows.
( The rest of the paper is organized as follows.Section 2 presents the background of the deep learning and capsule network.Sections 3 and 4 are dedicated to the details of the proposed deep capsule network frameworks, including spectral and spectral-spatial architectures for HSI classification.The experimental results are reported in Section 5.In Section 6, the conclusions and discussions are presented.

Convolutional Neural Networks
In general, CNN is a special case of deep neural network, which is loosely inspired by the biological visual system [38].Compared with other deep learning methods, there are two unique factors in the architecture of the CNN, i.e., local connections and shared weights.Since each neuron only responds to a small region known as the reception field, CNN efficiently explores the structure correlation.Furthermore, CNN uses the replicated weights and biases across the entire layer, which significantly reduces the parameters in the network.By using specific architectures like local connections and shared weights, CNN tends to provide better generalization for a wide variety of applications.
There are three main building blocks in CNNs: A convolution layer, a nonlinear transformation, and a pooling operation.By stacking several convolution layers with the nonlinear operations and several pooling layers, a deep CNN can be established [39].
A convolutional layer can be defined as follows: where matrix x l−1 i is the i-th feature map of the previous (l-1)-th layer, x l j is the j-th feature map of the current l-th layer, and M is the number of input feature maps.k l ij and b l j are randomly initialized and set to zero.Furthermore, f (•) is a nonlinear function known as the rectified linear unit (ReLU), and * is the convolution operation [40].
The pooling operation offers invariance by reducing the resolution of the feature maps.The neuron in the pooling layer combines a small N × N (e.g., N = 2) patch of the convolution layer.The most common pooling operation is max pooling.
All parameters in the deep CNN model are trained using the back-propagation algorithm.In this study, CNN is adopted as the feature extraction method, and the extracted features are fed to the deep capsule network for further processing.

Capsule Network
The capsule network is a modification of the traditional neural network, which uses a group of neurons to obtain the vector representations of a specific type of entity.
In [36], the input to a capsule s j is a weighted sum of prediction vector u j|i from the previous layers.u j|i is obtained by multiplying u i of the previous capsule by a transform matrix W ij , where c ij represents the coupling coefficients determined by a processing called dynamic routing [36].
The capsule uses the length of the output vector to obtain the probability of the entity, and then, a nonlinear function which we call squash function is used to squash the vector, where v j is the output of capsule j, which is a vector representation of the input, while the traditional neural network uses a scalar value to give the final probability of the entity.There are some advantages when we use a vector representation instead of a scalar value.The vector representation uses the length of the activity vector to obtain the probability of the entity, and the vector representation gives the orientation of the entity too.In traditional CNNs, a pooling layer is used to make the network invariant to small changes in inputs, but the effectiveness is limited [37].CNNs are not robust to translation, rotation, and scale, which usually downgrades their classification performance.In the capsule network, the output of the capsule is a vector representation of a type of entity [36].When changes occur on the entity, the length of the corresponding output vector of the capsule may not change greatly.Through the capsule network, we can obtain a more robust representation of the input.

Capsule Network for HSI Classification
The capsule network can be combined with the traditional neural network (e.g., CNN) to formulate a classification system for a specific task (e.g., HSI classification).In the remote sensing community, two works have already adopted the capsule networks for HSI classification.Paoletti et al. [41] and Deng et al. [42] adopted the capsule network for HSI classification and achieved good classification performance.In this context, Paoletti et al. proposed a spectral-spatial capsule network to capture high abstract-level features for HSI classification while reducing the network design complexity.The classification result in [41] demonstrates that the proposed method can extract more relevant and complete information about HSI data cubes.Deng et al. presented a modified two-layer capsule network capable of handling a limited number of training samples for HSI classification.
Previous capsule networks contained a fully-connected capsule layer, which led to lots of trainable parameters.As we all know, having lots of parameters may cause an overfitting problem with a limited number of training samples.In this study, an improved capsule network named Conv-Capsule, which uses local connections and shared transform matrices in the network, is proposed.Conv-Capsule dramatically reduces the number of trainable parameters and mitigates the overfitting issue in HSI classification.Furthermore, the previous capsule networks for HSI classification are spectral-spatial classifiers.In this study, a 1D capsule network is also proposed as a spectral classifier to enrich the classification techniques of HSI.The details of our proposed methods are explicitly explained in Sections 3 and 4.

One-Dimensional Convolutional Capsule
Deep learning models use multilayer neural networks to hierarchically extract the features of input data, which is the key factor for effectiveness in deep learning-based methods.The traditional capsule network does not contain multiple capsule layers.Therefore, it is necessary to build a multilayer capsule network.
The simple stacking of capsule layers can develop a deep capsule network.However, the traditional capsule layer is fully connected and contains a huge number of trainable parameters.The problem is even worse when the number of training samples is limited.Inspired by the CNN, local connections and shared transform matrices, which are the core ideas of CNN, are combined with the dynamic routing algorithm in the capsule layer, and we call it the convolutional capsule (Conv-Capsule) layer.In the Conv-Capsule layer, each capsule in the current layer only connects with capsules within its local receptive field in the last capsule layer.The transform matrices in the local connections are shared across the entire layer.
In HSI classification, spectral classification is an important research direction.To develop a 1D capsule network for HSI classification, a 1D Conv-Capsule layer needs to be utilized.Here is a description of the 1D Conv-Capsule layer which we use here to shape the spectral classifier.The input of a capsule s x j in a 1D Conv-Capsule layer is a weighted sum of the "prediction vector" u x+p j|i from all channels of the capsule within its receptive field in the last capsule layer.Furthermore, u x+p j|i is obtained by multiplying u x+p i from the capsule in the last layer by the corresponding transform matrix W p ij which is shared across the last capsule layer.By using a squash function, the output of the capsule v x j can be obtained from the input s x j .The equations used here are listed as follows: where I and P are the number of capsule channels in the last capsule layer and kernel size in the current 1D Conv-Capsule layer.s x j is the input of the j-th channel capsule at position x in the current 1D Conv-Capsule layer, and v x j is the corresponding output.u

Dynamic Routing Algorithm in 1D Conv-Capsule layer
Between two consecutive capsule layers, we use the dynamic routing algorithm to iteratively update coupling coefficients.The details about the procedure of the dynamic routing algorithm in 1D Conv-Capsule layer are described as follows: From the description of the 1D Conv-Capsule in the last subsection, we know that each capsule in the current 1D Conv-Capsule layer receives "prediction vectors" from the capsules within its receptive field in the last capsule layer.The weight of each "prediction vector" is represented by coupling coefficients.The coupling coefficients between capsule u x+p i in the last capsule layer and all channels capsules at the same position in the 1D Conv-Capsule layer sum to 1 and can be obtained by a softmax function, where b p ij is initialized to 0 before the training begins and is determined by the dynamic routing algorithm.
In the dynamic routing algorithm, the coefficient b p ij is iteratively refined by measuring the agreement between the "prediction vector" u x+p j|i and v x j .If the agreement is reached to a great extent, capsule u x+p i makes a good prediction for capsule v x j .Then, the coefficient b p ij will be significantly increased.In our network, the agreement is quantified as the inner product between two vectors u x+p j|i and v x j .This agreement is added to b p ij : The pseudo codes of the dynamic routing algorithm in the 1D Conv-Capsule layer are shown in Table 1. for x in spectral dimension in current capsule layer 3.
for j-th channel capsule in current capsule layer 4.
for p across kernel size 5.
for i-th channel capsule in last capsule layer 6.
initialize coupling coefficients b p ij
for j-th channel capsule in current capsule layer 10.
13. return v x j 14. end

One-Dimensional Capsule Framework for HSI Classification
The main framework of the 1D-Conv-Capsule network, which is based on the integration of principal component analysis (PCA), convolutional neural network, and the capsule network, is shown in Figure 2. We build this framework based on HSI spectral features and only use spectral vectors of the training data to train the model.As illustrated in Figure 2, PCA is first used to reduce the dimensionality of the input data [43], which leads to fewer trainable parameters in the network.Then, m principal components of each pixel are chosen as the inputs to the network.Through the capsule network, the predicted label of each pixel can be obtained.
The proposed 1D-Conv-Capsule network contains six layers.The first layer is an input layer which has m principal components for each pixel.The second and third layers are convolutional layers, which are the same as traditional convolutional layers in a CNN.The fourth layer is the first capsule layer with I channels of convolutional d 1 dimension capsules, which means that each capsule contains d 1 convolutional units.The fifth layer is a 1D Conv-Capsule layer which outputs J channels of d 2 dimension capsules.The last layer is a fully connected capsule layer that has n_class (n_class is the number of classes) d 3 dimensional capsules.Each capsule in the last layer represents one class, and it is called the ClassCaps layer for short.The length of the vector output of each capsule represents the probability of the input spectral vector belonging to each class.||L 2 || in Figure 2 is the Euclidean norm of a vector (i.e., the length of the vector).Some details about the network are given below.
In the first two convolutional layers, which have no difference with traditional convolution layers, we use a leaky rectified linear unit (LeakyReLU) to obtain a nonlinear mapping [44], where α is a small positive scalar value.The fourth layer is a transition layer and is also the first capsule layer.This layer translates convolutional units to capsules.Although convolution is still a fundamental operation in this layer, it has many differences with the traditional convolutional layer.In a traditional convolutional layer, the output of each channel's convolution is one feature map.In the convolutional capsule layer, each channel outputs p (i.e., the number of neural units each capsule contains) feature maps.Then, p convolutional units in the same location of the p feature map represent one capsule.The activation of these convolutional units gives an output of each capsule using Equation ( 6).
In the second, third, and fourth layers, the convolution operation is followed with batch normalization (BN) and LeakyReLU activation function [45].There is no pooling operation in the proposed network.
The fifth layer is a 1D Conv-Capsule layer.Local connections and shared transform matrices are used in this layer.We use the dynamic routing algorithm described in the last section to iteratively update coupling coefficients.Then, we can get the output of the capsule in this layer.
The last layer, which we call ClassCaps layer, is a fully connected capsule layer.The dynamic routing algorithm is also used in this layer.
Each capsule in the ClassCaps layer represents one class.The probability of a pixel belonging to one class is denoted by the length of the vector output of each capsule.In our network, we use the margin loss as the loss function, where T j = 1 if the pixel belongs to class j.The parameter m + means that if the length of the vector output ||v j || is bigger than m + , we can make sure the pixel belongs to class j.The parameter m − means that when ||v j || is smaller than m − , we can firmly believe the pixel does not belong to class j.
The loss for the class that the pixel does not belong to may stop the initial learning from shrinking the length of vector output for all capsules in the ClassCaps layer.So λ is used to down-weight it.

Three-Dimensional Convolutional Capsule
The 1D capsule network only extracts spectral features for HSI classification.To obtain an excellent classification performance, spatial information should be taken into consideration.Therefore, we further develop the 3D capsule network for HSI classification.A 3D Conv-Capsule layer is used in the 3D capsule network and is described below.
For each capsule in the 3D Conv-Capsule layer, all capsules in its receptive field make a prediction through the transform matrix.Then, the weighted sum of all "prediction vectors" serves as the input of the capsule.Finally, the input vector is squashed by a nonlinear function (i.e., squash function) to generate the output of the capsule.The detailed equations are listed below: where I is the number of capsule channels in the last capsule layer.P and Q represent the kernel size.
Furthermore, u is the output of the capsule which is the i-th channel's capsule in the last capsule layer at position (x + p, y + q).In addition, W pq ij is the shared transform matrix between the i-th channel capsule in the last capsule layer and the j-th channel capsule in the current Conv-Capsule layer.c pq ij represents the corresponding coupling coefficients determined by the dynamic routing algorithm.Figure 3 shows an illustration of the 3D Conv-Capsule layer.The dynamic routing algorithm in the 3D Conv-Capsule layer is similar to the one in the 1D Conv-Capsule layer.The pseudo codes are shown in Table 2. for x in width dimension in current capsule layer 3.
for y in height dimension in current capsule layer 4.
for j-th channel capsule in current capsule layer 5.
for p across kernel width 6.
for q across kernel height 7.
for i-th channel capsule in last capsule layer 8.
initialize coupling coefficients b pq ij

9.
for r iterations

Three-Dimensional Capsule Framework for HSI Classification
The main framework of the 3D-Conv-Capsule network is shown in Figure 4. Different from the 1D-Conv-Capsule network which extracts spectral features only, the spatial information of HSIs is also taken into consideration.From the framework shown in Figure 4, it can be seen that, first, EMAP (Extended Multi-Attributes Profile) is used as a preprocessing technique, which significantly reduces the dimensionality of the inputs and the number of training parameters.Then, a × a neighbors of each pixel, as the input 3D images, are imported to the 3D-Conv-Capsule network.Through the network, each pixel gets n_class (i.e., the number of classes) d 3 dimension capsules.Each capsule represents a class of entity.The length of the output vector of each capsule shows the probability that the corresponding entity exists.In other words, it represents the probability of the pixel belonging to each class.Therefore, the classification results can be obtained by calculating the length of the vectors.
Attribute profiles (APs), the basis of EMAP, are a generalization of the widely used morphological profiles (MPs) [20].EMAP uses multiple morphological attributes to replace the fixed structure elements, which enables the EMAP to model the spatial information more accurately.
In order to extract spatial information more comprehensively, different kinds of attribute can be used.In this paper, four attributes are considered: (1) a, the area of the regions; (2) d, the length of the diagonal of the box bounding the region; (3) i, the first moment of Hu [46]; (4) s, the standard deviation.EMAPs are generated by concatenating EAPs (Extend Attribute Profiles) computed by different attributes where EAPs are obtained by applying APs to principal components extracted by PCA.
Similar to the 1D-Conv-Capsule network, the 3D-Conv-Capsule network also has six layers, i.e., the input layer, two convolutional layers, and three consecutive capsule layers.The two convolutional layers serve as a local feature detector.Then, a transition layer (i.e., capsule layer), which is similar to the 1D-Conv-Capsule network, is adopted.In the last two capsule layers, we use a dynamic routing algorithm to calculate the capsule output in the Conv-Capsule layer and the ClassCaps layer.Compared to the 1D-Conv-Capsule network, the input data changes from 1D spectral information to 3D spectral-spatial information and from the 1D convolution operation to the 2D convolution operation.The 3D-Conv-Capsule network uses the ReLU as the activation function.Batch normalization is also used to alleviate the overfitting problem and boost the classification accuracy.

Data Description
In our study, three widely-used hyperspectral data sets with different environmental settings were used to validate the effectiveness of the proposed methods.They were captured over Salinas Valley in California (Salinas), Kennedy Space Center (KSC) in Florida, and an urban site over the University of Houston campus and the neighboring area (Houston).
The first data set was captured by the 224-band AVIRIS sensor over Salinas Valley, California.After removing the low signal to noise ratio (SNR) bands, the available data set was composed of 204 bands with 512 × 217 pixels.The ground reference map covers 16 classes of interest.The hyperspectral image is of high spatial resolution (3.7-meter pixels).Figure 5 demonstrates the false-color composite image and the corresponding ground reference map.The number of samples in each class is listed in Table 3.The second data set, KSC, was collected by the airborne AVIRIS instrument over the Kennedy Space Center, Florida.The KSC data set has an altitude of approximately 20 km, with a spatial resolution of 18 m.After removing water absorption and low SNR bands, 176 bands with 512 × 614 pixel vectors were used for the analysis.For classification purpose, 13 classes were selected.The classes of the KSC data set and the corresponding false-color composite map are demonstrated in Figure 6.The number of samples for each class is given in Table 4.The third data set is an urban site over the University of Houston campus and neighboring area which was collected by an ITRES-CASI 1500 sensor.The data set is of 2.5-m spatial resolution and consists of 349 × 1905 pixel vectors.The hyperspectral image is composed of 144 spectral bands ranging from 380 to 1050 nm.Fifteen different land-cover classes are provided in the ground reference map, as shown in Figure 7.The samples are listed in Table 5.For all three data sets, we split the labeled samples into three subsets, i.e., training, validation, and test samples.In our experiment, we randomly chose 200 labeled samples as the training set to train the weights and biases of each neuron and transformation matrix between two consecutive capsule layers.The proper architectures of our network were designed based on performance evaluation on 100 validation samples, which were also randomly chosen from labeled samples.The choice of hyper-parameters, like kernel size in the convolution operation and the dimensions of the vector output of each capsule, were also guided by the validation set.After the training was done, all remaining labeled samples served as the test set to evaluate the capability of the network and to obtain the final classification results.Three evaluation criteria were investigated: overall accuracy (OA), average accuracy (AA), and Kappa coefficients (K).

The Classification Results of the 1D Capsule Network
The 1D capsule network, which is built only based on spectral features, contains two parts.One is a fully connected capsule network that uses normalized spectral vectors as input.The other is the Conv-Capsule network which inputs spectral features extracted by PCA.We call the two methods 1D-Capsule and 1D-Conv-Capsule for short.In the 1D-Conv-Capsule, we first used PCA to reduce the spectral dimensions of the data.Then, we randomly chose 200 and 100 labeled samples as the training and validation data for each data set.The training samples were imported to the 1D capsule network.The number of principal components was chosen based on the classification result for the validation samples.Some other hyper-parameters (e.g., the learning rate, the convolutional kernel size, the α in LeakyReLU, etc.) were also determined by the validation set.In our method, the size of the mini-batch was 100 and the number of training epochs was set to 150 for our network.We used a decreasing learning rate which was initialized to 0.01 at the beginning of the training process.The number of the principal components was set to 20, 20, and 30, respectively, for the Salinas, KSC, and Houston data sets.We used α = 0.1 in the LeakyReLU function.The parameters m + , m − , and λ in the loss function were set to 0.9, 0.1, and 0.5, respectively.
The main architectures of the 1D-Conv-Capsule network for each data set are shown in Table 6.Due to the fact that the same number of principal components was chosen as the input, the network for the Salinas and KSC data sets had the same architecture.In Table 6, (5 × 1 × 8) × 8 in the fourth layer (i.e., transition layer) means that eight channels of convolution with the kernel size of 5 × 1 were used, and each channel output eight feature maps.Thus, the fourth layer output a capsule with eight channels.The fifth layer was a Conv-Capsule layer with eight (i.e., the number of capsule channels output by the fourth layer) channels of capsule input and 16 channels of capsule output.The kernel size was 5 × 1.We used (5 × 1 × 8) × 16 to represent this operation.The last layer was a fully connected capsule layer.All capsules from the fifth layer were connected with n_class capsules in this layer.The length of the vector output of each capsule in this layer represents the probability of the network's input belonging to each class.Between consecutive capsule layers in the 1D-Conv-Capsule, three routing iterations were used to determine the coupling coefficients b p ij .In this set of experiments, our methods were compared with other classical classification methods that are only based on spectral information.These methods included random forest (RF) [47], multiple layer perceptron (MLP) [48], linear support vector machine (L-SVM), support vector machine with the radial basis kernel function (RBF-SVM) [17], recurrent neural network (RNN) [24], and the convolutional neural network (1D-CNN) [28].Furthermore, 1D-PCA-CNN, which has nearly the same architecture as 1D-Conv-Capsule (apart from the capsule layer), was also designed to give a fair comparison.The classification results are shown in Tables 7-9.The experiment setups of the classical classification methods are described as follows.RF was used for classification.A grid search method and four-fold cross-validation were used to define RF's two key hyper-parameters (i.e., the number of features to consider when looking for the best split (F) and the number of trees (T)).In the experiment, the search ranges of F and T were (5,10,15,20) and (100, 200, 300, 400), respectively.The MLP used in this experiment was a fully connected neural network with one hidden layer.The used MLP contained 64 hidden units.L-SVM is a linear SVM with no kernel function.RBF-SVM uses the radial basis function as the kernel.In L-SVM and RBF-SVM, a grid search method and four-fold cross-validation were also used to define the most appropriate hyper-parameters (i.e., C for L-SVM and (C, γ) for RBF-SVM).In this experiment, the search range was exponentially growing sequences of C and γ (C = 10 −3 , 10 −2 , . . ., 10 3 , γ = 10 −3 , 10 −2 , . . ., 10 3 ).A single layer RNN with a gated recurrent unit and the tanh activation function were adopted.The architecture of 1D-CNN was designed as in [28] and contained an input layer, a convolutional layer, a max-pooling layer, a fully connected layer, and an output layer.The convolutional kernel size and number of kernels were 17 and 20 for all three data sets.The pooling size was 5, 5, and 4 for the Salinas, KSC, and Houston data sets, respectively.Tables 7-9 show the classification results obtained when we used the aforementioned experimental settings.All experiments were run ten times with different random training samples.The classification accuracy is given in the form of mean ± standard deviation.The 1D-Conv-Capsule network showed a better performance in terms of accuracy on all three data sets.
For all three data sets, RBF-SVM, which is famous for handling a limited number of training samples, provides competitive classification results.We use the experiments with 200 training samples as an example to discuss the results.For the Salinas data set, 1D-Conv-Capsule exhibited the best OA, AA, and K, with improvements of 2.05%, 3.01%, and 0.023 over RBF-SVM, respectively.Our approach outperformed 1D-PCA-CNN by 1.6%, 1.62%, and 0.0179 in terms of OA, AA, and K, respectively.For the KSC data set, as can be seen, the OA of 1D-Conv-Capsule was 88.22%, which is an increase of 1.58% and 2.2% compared with RBF-SVM and 1D-PCA-CNN, respectively.For the Houston data set, 1D-Conv-Capsule improved the OA, AA, and K of 1D-PCA-CNN by 3.21%, 2.84%, and 0.0346, respectively.The results show that the 1D-Conv-Capsule method demonstrated the best performance in terms of OA, AA, and K for all three data sets.In addition, all experiments with 100 and 300 training samples were also implemented to demonstrate the effectiveness of the proposed methods.From the results reported in Tables 7-9, it can be seen that 1D-Conv-Capsule outperformed the other classical classification methods, especially when the number of training samples was extremely limited (i.e., 100 training samples).
Furthermore, the 1D-Conv-Capsule with a different number of principal components as input was conducted.Figure 8 shows the classification results of the 1D-Conv-Capsule on three data sets by using 200 training samples.Due to the fact that we injected only spectral information into the 1D-Conv-Capsule, relatively more principal components were used to make sure that sufficient spectral information was preserved, and, at the same time, this maintained low computational complexity.From Figure 8, it can be seen that if the number of selected components is too small or too big, the classification results tend to be poor under both circumstances.On one hand, the spectral information is not sufficiently preserved and the network cannot efficiently extract the spectral feature when the number of principal components is low.On the other hand, the networks are over-trained when the number of principal components is high.The situation becomes worse if the number of training samples is limited.The best classification performance was achieved when the number of the principal components was set to 20, 20, and 30 for the Salinas, KSC, and Houston data sets, respectively.

The Analysis of Learnt Features of the 1D Capsule
From the aforementioned description about the capsule, it can be understood that the output of the capsule is a vector representation of the type of entity.In order to demonstrate the real advantage of the capsule network on remote sensing data, we performed another experiment based on the 1D-Capsule network followed by a reconstruction network (1D-Capsule-Recon).The architecture of the reconstruction network is shown in Figure 9.According to the label of the input pixel, the representative vector of the corresponding capsule in the ClassCaps layer was imported to the reconstruction network (e.g., if the input pixel belonged to the i-th class, the vector output of the i-th capsule in the ClassCaps layer was used as input to the reconstruction network).The reconstruction network contained three fully connected (FC) layers.The first two FC layers had 128 and 256 hidden units with the ReLU activation function.The last FC layer with Sigmoid activation function output the reconstructed spectra (i.e., a combination of normalized spectral reflectance of different bands) corresponding to the input of the 1D-Capsule-Recon.The reconstruction loss, i.e., the Euclidean distance between the input and the reconstructed spectra, was added to the margin loss that described in Section 3: where L M is the margin loss and L R is the reconstruction loss.ε is the weight coefficient that is used to avoid L R dominating L M during the training procedure.In the experiment, ε was set to 0.1.L total was used as the loss function for the 1D-Capsule-Recon.To visualize the vector representation of the capsule, we made use of the reconstruction network.After the training procedure of the 1D-Capsule-Recon was done, we randomly chose some samples from different classes and computed the representation vector of their corresponding capsules in the ClassCaps layer.We made perturbations in different dimensions of the vector and fed them to the reconstruction network.Figure 10 shows the reconstructed results of three class samples from the Salinas data set.Two dimensions of the representation vector were tuned.In Figure 10, the original is the input spectra to the 1D-Capsule-Recon.The notation of [v(i) + ∆] in Figure 10 means that we tuned the i-th dimension of the representation vector v with perturbation ∆.The perturbed v was used to reconstruct the spectra.From the results shown in Figure 10, the representation vector (i.e., v) can well reconstruct the spectra, which means that the representation vector contains the information in the spectra with low dimensionality.Furthermore, as shown in Figure 10, v(i) + ∆ can influence the reconstruction of some special bands, which means that v(i) has a close relationship with the special bands.v is a vector that contains several v(i), and v is a robust and condensed representation of spectra.

The Classification Results of the 3D Capsule Network
In the 3D capsule network, the network extracts both spectral and spatial features effectively, which could lead to a better performance in terms of classification accuracy than the one obtained by the 1D capsule network.As mentioned above, we proposed two 3D frameworks, i.e., the 3D-Capsule and the 3D-Conv-Capsule.Similar to a 1D framework, the 3D-Capsule is an original fully connected capsule network, while the 3D-Conv-Capsule is the convolutional capsule network.Additionally, the 3D-Capsule directly uses the original hyperspectral data as input, while the 3D-Conv-Capsule utilizes EMAP to extract features of hyperspectral data.In the 3D-Conv-Capsule, three principal components were used and parameters in EMAP were set as in [21].Through the EMAP analysis, the number of spectral dimensions became 108 for all three data sets.In this set of experiments, the numbers of training and validation samples were the same as for the 1D Capsule network.The mini-batch size was also 100.The training epoch was set to 100 with a learning rate of 0.001.The parameter in loss function was the same as for the 1D capsule network.The details on the architecture of the 3D-Conv-Capsule network are shown in Table 10.The definitions of the parameters in Table 10 can be found in the description for the 1D-Conv-Capsule network.Batch normalization was also used to improve the performance of the network.Three routing iterations and n_class capsules with a 16-dimensional output vector The SVM-based and CNN-based methods were included in the experiments to give a comprehensive comparison.The classification results are shown in Tables 11-13.For the three data sets, we used 27 × 27 neighbors of each pixel as input 3D images in these methods.
Due to the high performance in terms of classification accuracy of SVM, some SVM-based HSIs classifiers were adopted for comparison.The extended morphological profile with SVM (EMP-SVM) is a widely used spectral-spatial classifier [19].In the EMP-SVM method, the morphological opening and closing operations were used to extract spatial information on the first three components of HSIs, which were computed by PCA.In the experiments, the shape structuring element (SE) was set as a disk, and the radius of disk increased from two to eight with an interval of two.Therefore, 27 spatial features were generated.The learned features were fed to an RBF-SVM to obtain the final classification results.EMAP is a generalization of the EMP and can extract more informative spatial information.EMAP was also combined with the random forest classifier (EMAP-RF) [20].In order to have a fair comparison, the parameters in EMAP were kept the same as for the 3D-Conv-Capsule.In RBF-SVM, the optimal parameters C and γ were also obtained by grid-search and four-fold cross-validation methods.Furthermore, CNN was also used for comparison.We conducted 3D-CNN, EMP-CNN and 3D-EMAP-CNN.Their CNN architectures were the same as in [31].To give a comprehensive comparison, a spectral-spatial residual network recently proposed in [49] was adopted for comparison.give the classification results of the proposed methods and contrast methods on the three data sets.We also used the classification results with 200 training samples as an example.For the Salinas data set, the 3D-Conv-Capsule exhibited the highest OA, AA, and K, with the improvements of 3.64%, 3.35%, and 0.0409 over 3D-EMAP-CNN, respectively.On the other hand, our 3D-Capsule approach also performed better than 3D-EMAP-CNN in terms of OA, AA, and K.For the KSC data set, 3D-Conv-Capsule improved the OA, AA, and K of EMP-CNN by 2.19%, 3.36%, and 0.0244, respectively.Our 3D-Capsule method also showed higher classification accuracy than EMP-CNN with improvements of 0.52%, 1.12%, and 0.0057 in terms of OA, AA, and K.For the Houston data set, we obtained similar results.Experiments with 100 and 300 training samples were investigated as well.The detailed classification results are shown in Tables 11-13.Compared with other state-of-the-art methods, the 3D-Conv-Capsule demonstrated the best performance under different training samples.
In the experiment using the 3D-Conv-Capsule, we also explored how a different number of principal components that are used in EMAP analysis may affect the classification results.Due to the spatial information being considered and the EMAP analysis significantly increasing the data volume, we used relatively fewer principal components here compared with the 1D-Conv-Capsule.Figure 11 shows the classification result for the 3D-Conv-Capsule.The 3D-Conv-Capsule with different numbers of principal components outperformed the other contrast experiments.Unlike 1D-Conv-Capsule, the preservation of more principal components leads to a vast data volume which brings a higher requirement for hardware and longer training time in 3D-Conv-Capsule.Though the classification accuracy may be higher with relatively more components, we only used three principal components in consideration of computational cost in the 3D-Conv-Capsule.

Parameter Analysis
In the 3D-Conv-Capsule, convolutional layers were used as feature extractors, and they converted the original input into a capsule's input.Thus, the number of convolutional layers and the convolutional kernel size used in 3D-Conv-Capsule influences the classification performance of the model.Furthermore, due to the fact that the input of a 3D-Conv-Capsule is the a × a neighbors around the pixel, the size of neighborhoods is also an important factor.These factors are analyzed below.
When we explored the influence of a parameter on the classification result, the other parameters were fixed.The neighborhood size and convolution kernel size were set to 27 and 3 when we analyzed the number of convolutional layers.For the analysis of the convolution kernel size, 27 × 27 neighborhoods and two convolutional layers were used in the 3D-Conv-Capsule.Similarly, the number of convolutional layers and the convolution kernel size were set to 2 and 3 for analysis of the size of the neighborhood.All the experiments for this analysis were conducted with 200 training samples.Tables 14-16 shows the detailed classification results.As reported in Table 14, the use of two convolutional layers gave better classification results.Furthermore, one convolutional layer could not extract features efficiently while three layers made the model prone to overfitting.Table 15 shows the classification results with different convolution kernel sizes.The 3D-Conv-Capsule performed better when the kernel size was 3.For the neighborhood size, the 3D-Conv-Capsule obtained good classification accuracies on the Salinas and KSC data sets when the neighborhood size was relatively large, but the result for the Houston data set was the other way around.

Visualization of Learnt Features from the Capsule Network
Unlike traditional neural networks which use a sequence of scalar value to represent the probability of the input belonging to different classes, capsule networks output n_class (i.e., the number of classes) capsules that represent different classes of entity.The length of the vector output of each capsule (i.e., the Euclidean norm of the vector) represents the probability that a corresponding entity exists.In HSI classification tasks, the length of different capsules' output vectors can be interpreted as the probability that the input belongs to different classes.
We randomly choose several samples from the test data set and imported them into the trained 3D-Conv-Capsule network.The length of the vector output of each capsule in the ClassCaps layer is computed and visualized in Figure 12.From the results shown in Figure 12, it is possible to observe that the capsule corresponding to the true class output the longest vector.Due to the similarity between the Graminoid marsh and Spartina marsh, the experimental results of three samples from the Graminoid marsh class show that the length of the vector corresponding to the similar class was longer than those of the other classes.

Time Consumption
All experiments in this paper were conducted on a Dell laptop equipped with an Intel Core i5-7300H processor with 2.5 GHz, 8 GB of DDR4 RAM, and an NVIDIA GeForce GTX 1050Ti graphical processing unit (GPU).The software environment used Windows 10 as an operating system, CUDA 9.0 and cuDNN 7.1, Keras framework using TensorFlow as a backend, and Python 3.6 as the programing language.The training and test times of different models are reported in Tables 17 and 18.The traditional RF and SVM classifiers demonstrated superior computational efficiency.As for deep learning models, the model was able to be trained within a few minutes due to the limited number of training samples and the GPU's strong computing acceleration power.The 3D-Conv-Capsule required nearly the same training time as 3D-CNN and less time than SSRN.In the experiments, it was found that capsule network-based method converged "faster" than the CNN-based method (e.g., 100 epochs for 3D-Conv-Capsule and 500 epochs for 3D-EMAP-CNN).In future work, the use of more specific computing acceleration for the capsule network could further boost the computational efficiency of the capsule-based method.The proposed methods explored the convolutional capsule network for HSI classification, representing a new methodology for better modeling and processing of HSI.Compared with a fully connected capsule layer, the convolutional capsule layer dramatically reduces the trainable parameters, which is critical in order to avoid over-training.In our future work, based on the convolutional capsule, deep capsule architecture like SSRN in CNN will be conducted to fully investigate the potential of capsule networks.
) A modification of the capsule network named Conv-Capsule is proposed.The Conv-Capsule uses local connections and shared transform matrices in the network, which reduces the number of trainable parameters and mitigates the overfitting issue in classification.(2) Two frameworks, called 1D-Capsule and 3D-Capsule, based on the capsule network are proposed for HSI classification.(3) To further improve the HSI classification performance, two frameworks, called 1D-Conv-Capsule and 3D-Conv-Capsule, are proposed.(4) The proposed methods are tested on three well-known hyperspectral data sets under the condition of having a limited number of training samples.
x+p i is the output of the i-th channel capsule at position (x + p) in the last capsule layer.W p ij is the transform matrix between u x+p i and v x j .c p ij represents the coupling coefficients determined by the dynamic routing algorithm.An illustration of the 1D Conv-Capsule layer is shown in Figure 1.

Figure 2 .
Figure 2. The framework of the 1D-Conv-Capsule network for hyperspectral image (hsi) classification.

Figure 4 .
Figure 4.The framework of the 3D-Conv-Capsule network for HSI classification.

Figure 5 .
Figure 5.The Salinas data set.(a) False-color composite and (b) ground reference map.

Figure 6 .
Figure 6.The Kennedy Space Center (KSC) data set.(a) False-color composite and (b) ground reference map.

Figure 7 .
Figure 7.The Houston data set.(a) False-color composite and (b) ground reference map.

Figure 8 .
Figure 8. Classification results of the 1D-Conv-Capsule on three data sets with respect to different numbers of principal components.

Figure 9 .
Figure 9.The architecture of the reconstruction network.

Figure 10 .
Figure 10.Normalized spectral reflectance reconstructed by the perturbed representation vector of three samples from the Salinas data set.The two pictures in each row are the results reconstructed by tuning different dimensions of the representation vector of the same sample.(a) sample from broccoli_green_weeds_1 class; (b) sample from grapes_untrained class; (c) sample from vinyard_untrained class.

Figure 11 .
Figure 11.Classification results of the 3D-Conv-Capsule on three data sets with respect to different principal components.

Figure 12 .
Figure 12.The visualization of learnt features (i.e., length of vector output of each capsule in ClassCaps layer) from 3D-Conv-Capsule network on the KSC data set.The four pictures in each row are the results of three randomly selected samples of the same class and an example of input images (i.e., false color image).(a) Scrub class; (b) Willow swamp class; and (c) Graminoid marsh class.
In this paper, an improved capsule network called the convolutional capsule (Conv-Capsule) was proposed.On the basis of Conv-Capsule, new deep models called 1D-Conv-Capsule and 3D-Conv-Capsule were investigated for HSI classification.Furthermore, 1D-Conv-Capsule and the 3D-Conv-Capsule were combined with PCA and EMAP, respectively, to further improve the classification performance.The proposed models, 1D-Conv-Capsule and 3D-Conv-Capsule, can effectively extract spectral and spectral-spatial features from HSI data.They were tested on three widely-used hyperspectral data sets under the condition of having a limited number of training samples.The experimental results showed the superiority over the classical SVM-based and CNN-based methods in terms of classification accuracy.

Table 3 .
Land cover classes and numbers of samples in the Salinas data set.

Table 4 .
Land cover classes and numbers of samples in the KSC data set.

Table 5 .
Land cover classes and numbers of samples in the Houston data set.

Table 6 .
The architectures of the 1D-Conv-Capsule network for the different data sets.

Table 7 .
Classification with spectral features on the Salinas data set with different training samples.

Table 8 .
Classification with spectral features on the KSC data set with different training samples.

Table 9 .
Classification with spectral features on the Houston data set with different training samples.

Table 10 .
The architectures of the 3D-Conv-Capsule network.

Table 11 .
Classification with spectral-spatial features on the Salinas data set with different training samples.

Table 12 .
Classification with spectral-spatial features on the KSC data set with different training samples.

Table 13 .
Classification with spectral-spatial features on the Houston data set with different training samples.

Table 14 .
Classification results of the 3D-Conv-Capsule with different numbers of convolutional layers on three data sets.

Table 15 .
Classification results of the 3D-Conv-Capsule with different convolutional kernel sizes on three data sets.

Table 16 .
Classification result of the 3D-Conv-Capsule with different neighborhood sizes on three data sets.