Patch-Wise Semantic Segmentation for Hyperspectral Images via a Cubic Capsule Network with EMAP Features

: In order to overcome the disadvantages of convolution neural network (CNN) in the current hyperspectral image (HSI) classiﬁcation/segmentation methods, such as the inability to recognize the rotation of spatial objects, the difﬁculty to capture the ﬁne spatial features and the problem that principal component analysis (PCA) ignores some important information when it retains few components, in this paper, an HSI segmentation model based on extended multi-morphological attribute proﬁle (EMAP) features and cubic capsule network (EMAP–Cubic-Caps) was proposed. EMAP features can effectively extract various attributes proﬁle features of entities in HSI, and the cubic capsule neural network can effectively capture complex spatial features with more details. Firstly, EMAP algorithm is introduced to extract the morphological attribute proﬁle features of the principal components extracted by PCA, and the EMAP feature map is used as the input of the network. Then, the spectral and spatial low-layer information of the HSI is extracted by a cubic convolution network, and the high-layer information of HSI is extracted by the capsule module, which consists of an initial capsule layer and a digital capsule layer. Through the experimental comparison on three well-known HSI datasets, the superiority of the proposed algorithm in semantic segmentation is validated. In-vestigation, H.G. and Methodology, L.S. and X.S.; X.S.; Supervision, L.S. and J.W.; X.S. and L.S.; Visualization, L.S., and X.S.; Writing—original L.S., X.S., H.G.,


Introduction
In recent years, hyperspectral remote sensing has become an important means of earth observation [1]. Hyperspectral images (HSIs) contain rich spectral and spatial information and have been widely used in agricultural production [2], geological prospecting [3], food safety [4], military target reconnaissance [5] and other important fields. The classification of hyperspectral images plays a very important role in the above fields. With the development of hyperspectral imaging instruments, researchers can obtain HSIs with high spatial resolutions [6,7], which makes HSIs contain more effective information, thus providing great convenience for the development of HSIs segmentation.

•
In theory, the neural network can extract any feature, as long as the network architecture is good enough. However, it is very complicated and time-consuming to design a neural network that can extract a specific geometric structure. Therefore, in this paper, EMAP features are used as the input of the network, which has the advantage of being able to extract rich spatial geometric features well.

•
The cubic convolutional network can extract the spatial-spectral features of the hyperspectral image from the three dimensions, which is conducive to making full use of the existing information and improving the classification accuracy.

•
The capsule network can further extract more discriminative deep features, such as spectra with the properties of heterogeneity and homogeneity, to better distinguish pixels at the class boundary.

Deep Capsule Network
Due to the intrinsic structure, CNNs have some shortcomings. For example, the main function of pooling is to retain the key features, reduce the time complexity of the network and make the network invariant to translational transformations. Because of this invariance, it is difficult for CNNs to distinguish the positional relationship of features in the spatial domain, which results in poor ability in distinguishing the fine detailed objects. In 2011, inspired by neuroanatomy and cognitive neuroscience, Hinton proposed the concept of capsules to identify spatial location information. In 2017, Hinton published two papers on the classification of handwritten character sets using the capsule network to achieve the highest classification accuracy.
The capsule network is not composed of scalar neurons but capsules. A capsule is a vector composed of a group of neurons expressed as a vector so that it can represent various features such as the pose and edge of the entity [46]. Multiple capsules to form a hidden layer; the modulus length of the vector in the capsule represents the probability of Remote Sens. 2021, 13, 3497 4 of 19 classifying the entity, so the squash function is used to specify the modulus length in the interval {0, 1}, as shown in Equation (1).
where s j represents the capsule input vector, and v j represents the output vector of the capsule. The squash function does not change the direction of vector s j , only the value of s j , and the greater the value of s j 2 , the closer the value of s j 2 /1 + s j 2 is to 1; the smaller the value of s j 2 , the closer the value of s j 2 /1 + s j 2 is to 0. This ensures that the learning of features is more stable. Different from the connection of neurons, the two adjacent layers of capsules are connected in a fully connected way, as shown in Figure 1. The capsule of the l − 1th layer is fully connected with the capsule of the lth layer. s l i and v l i represent the input vector and output vector of the ith capsule of the lth layer, respectively. w l ij and c l ij respectively represent the weight and coupling coefficient of the connection between the ith capsule in the l − 1th layer and the jth capsule in the lth layer.
various features such as the pose and edge of the entity [46]. Multiple capsules to form a hidden layer; the modulus length of the vector in the capsule represents the probability of classifying the entity, so the squash function is used to specify the modulus length in the interval {0, 1}, as shown in Equation (1).   In addition to the input capsule of the first layer, the capsule input vector l j s of the subsequent layers needs to be obtained by weighted summation of the prediction vector | l j i u . The formula is defined as follows: The value of the coupling coefficient  In addition to the input capsule of the first layer, the capsule input vector s l j of the subsequent layers needs to be obtained by weighted summation of the prediction vector u l j|i . The formula is defined as follows: The value of the coupling coefficient c l ij is updated by the dynamic routing algorithm in the iterative process. The sum of the coupling coefficients of capsule s l j and all capsules in the lth layer is 1, which is constrained by the softmax activation function [47]. The specific formula is expressed as follows: where the initial value of parameter b l ij is 0, which changes during the iteration of the dynamic routing algorithm. It represents the prior probability of the coupling between Remote Sens. 2021, 13, 3497 5 of 19 the capsule i in the l − 1th layer and the capsule j in the lth layer. The updated formula is as follows: The dynamic routing algorithm is an excellent iterative algorithm. Its main purpose is to update the coupling coefficient by constantly comparing the degree of consistency between the prediction vector of the previous capsule layer and the output vector of the next capsule layer. It is re-allocated to the prediction vector to coordinate the relationship between the capsule layers so that the output vector of the next capsule layer can find the accurate prediction result. The pseudo-code of the dynamic routing algorithm is elaborated in Algorithm 1 [48].
Update the output vector v l j ← squash(s l j ) . 6. Update the parameter b l ij ← b l ij +û l j|i ·v l j . 7. Update the number of iterations k ← k + 1 . 8. End 9. Return the output vector v l j . Figure 2 shows an architecture of a capsule network, which is composed of a convolutional layer, an initial capsule layer and a digital capsule layer. The convolutional layer uses convolution operations and rectified linear unit (ReLU) activation functions to perform feature extraction on the image, and the output features are used as the inputs of the capsule layer. The initial capsule layer continues to perform convolution operations on the obtained feature maps, converts the local convoluted features into capsules and is fully connected with each capsule in the digital capsule layer. The digital capsule layer has a total of C capsules; here, C represents the number of categories in the data set, and the digital capsules are obtained through a dynamic routing algorithm, where the modulus of the output vector in a capsule represents the probability of being classified into this category.

Extended Morphological Attribute Profile
Hyperspectral images have hundreds of bands and high spectral resolution. Therefore, it is still a great challenge to analyze and process them. Specifically, due to the Hughes phenomenon, the high dimensionality of the data is a key problem; that is, given a fixed number of training samples, as the feature dimension increases up to a threshold, the generalization performance of the classifier does not increase but decreases. The threshold mainly depends on the number of training samples used for the classifier. For

Extended Morphological Attribute Profile
Hyperspectral images have hundreds of bands and high spectral resolution. Therefore, it is still a great challenge to analyze and process them. Specifically, due to the Hughes phenomenon, the high dimensionality of the data is a key problem; that is, given a fixed number of training samples, as the feature dimension increases up to a threshold, the generalization performance of the classifier does not increase but decreases. The threshold mainly depends on the number of training samples used for the classifier. For these reasons, to alleviate the disaster of dimensionality and reduce the amount of calculation, feature extraction is usually used as a preprocessing step. PCA is usually used for hyperspectral image classification tasks. The principle of PCA is to project the data into an orthogonal space so that the eigenvector corresponding to the largest eigenvalue maintains the maximum variance of the data. However, PCA may ignore some important information, especially when few components are retained. Therefore, it is a better choice to use morphological analysis to exploit spatial features of the image after PCA.
Morphological Profile (MP) [49] is a spatial feature extraction operator based on mathematical morphology. The spatial information obtained by using multi-scale analysis can characterize the multi-scale variability of the structure in the image. The disadvantage is that it is difficult to simulate other geometric features in the image. Morphological attribute profile (AP) [50] uses morphological attributes to perform operations under the constraints of specific attribute criteria to obtain a series of attribute refinement profile maps and attribute coarsening profile maps and stack them together. Extended Multiple Morphological Attribute Profile (EMAP) [51,52] uses multiple morphological attributes on the basis of AP algorithm to combine all the obtained profile feature maps for stacking. Compared to the MP algorithm, EMAP is more accurate in representing the spatial information of the image.
For a single band image f , its AP is obtained by attribute thickenings and attribute thinnings operation. For a given attribute A and threshold set B, the AP algorithm calculates the value of the attribute A of each connected component in the image by comparing with the elements in the set and uses the opening and closing operation to determine whether the attribute thickening (ϕ) operation or the attribute thinning (γ) operation is to be performed. After comparing with all elements in the set, a set of attribute profiles can be obtained.
In order to reduce the dimensionality of the hyperspectral image and extract the effective features of the image, PCA is generally used to reduce the dimensionality of the image. Suppose the number of channels after dimensionality reduction is C; then, the stacked feature maps after AP operation on the single band image of each channel are called extended morphological attribute profile (EAP).
where PC i represents the ith principal component. The EMAP feature uses multiple morphological attributes, obtains EAP separately and stacks them together.
where a i represents the ith morphological attribute. Commonly used morphological attributes include area, diagonal length of the external moment of the area, moment of inertia, standard deviation, etc. EMAP has a stronger ability to extract spatial features and has more advantages in extracting the spatial structure of the image.

Cubic Capsule Network with EMAP Features
Aiming at alleviating the difficulty of neural networks to obtain specific spatial structure features in hyperspectral images, inspired by [43], we combined the cubic convolutional network with the capsule network and proposed a cubic capsule network with EMAP features for hyperspectral image classification. At the beginning of the network, the original hyperspectral image was first pre-processed; the high-dimensional hyperspectral data was analyzed by PCA, and the first three principal components were extracted. In addition, three morphological attributes, i.e., the area, the diagonal length of the external moment of the area, and the standard deviation, were used to extract the EMAP features, and finally, a stacked data cube with 108 feature maps was obtained and was used as the input of the network. We chose an image patch with a size of 15 × 15 × 108 as the size of each training sample, and the batch_size was 100. Figure 3 shows the structure diagram of the proposed EMAP-Cubic-Caps network. The network is divided into three parts, namely cubic convolutional network, initial capsule layer and digital capsule layer.    The initial capsule layer further combines the features extracted from the cubic convolutional network and encapsulates them into capsules. As shown in Figure 3, the tensor dimension output by the initial capsule layer is (6 × 6 × 9, 32). The input tensor of the capsule layer was conducted by the convolutions operation, where the kernel size is 5 × 5 × 60, the number of kernels is 32, the convolution step size is (2,2,8) and those convolutions are not filled by 0. It performs feature integration and obtains a feature map with a size of (6 × 6 × 9, 32). Therefore, we extracted every 9 scalars of the third dimension as vectors and encapsulated them into each capsule.
In the digital capsule layer, we set the vector length of each capsule to 12; the number of capsules is C, which represents the number of sample categories in the data set. Each capsule represents a type of feature, and each dimension of the vector represents a cate- Cubic convolutional network can effectively extract spatial features and spatialspectral features and is more flexible in training parameters than three-dimensional convolution, and the training speed is faster. The part of cubic convolutional network is shown in Figure 4. That is, in the input 15 × 15 × 108 image patch, the convolution operation is performed on the three planes of the data cube. The size of the convolution kernel of each branch is 3 × 3 × 1, and the number of convolution kernels is 12. The convolution step size is (1, 1, 1), and the convolution is filled with 0. After each convolution layer, the feature map is batch normalized, and the ReLU activation function is performed. Three convolutions are performed on the three branches. After the convolutions, the three branches respectively generate feature maps with sizes (15 × 15 × 108, 12), (15 × 108 × 15, 12) and (108 × 15 × 15, 12). The three data cubes are stacked together to generate a (15 × 15 × 108, 36)-sized feature map, which is used as the input of the initial capsule layer.
The initial capsule layer further combines the features extracted from the cubic convolutional network and encapsulates them into capsules. As shown in Figure 3, the tensor dimension output by the initial capsule layer is (6 × 6 × 9, 32). The input tensor of the capsule layer was conducted by the convolutions operation, where the kernel size is 5 × 5 × 60, the number of kernels is 32, the convolution step size is (2,2,8) and those convolutions are not filled by 0. It performs feature integration and obtains a feature map with a size of (6 × 6 × 9, 32). Therefore, we extracted every 9 scalars of the third dimension as vectors and encapsulated them into each capsule.    The initial capsule layer further combines the features extracted from the cubic convolutional network and encapsulates them into capsules. As shown in Figure 3, the tensor dimension output by the initial capsule layer is (6 × 6 × 9, 32). The input tensor of the capsule layer was conducted by the convolutions operation, where the kernel size is 5 × 5 × 60, the number of kernels is 32, the convolution step size is (2,2,8) and those convolutions are not filled by 0. It performs feature integration and obtains a feature map with a size of (6 × 6 × 9, 32). Therefore, we extracted every 9 scalars of the third dimension as vectors and encapsulated them into each capsule.
In the digital capsule layer, we set the vector length of each capsule to 12; the number of capsules is C, which represents the number of sample categories in the data set. Each capsule represents a type of feature, and each dimension of the vector represents a category of the feature. For example, posture, texture and edge information are each fully connected with all capsules in the initial capsule layer, and the vector in the digital capsule In the digital capsule layer, we set the vector length of each capsule to 12; the number of capsules is C, which represents the number of sample categories in the data set. Each capsule represents a type of feature, and each dimension of the vector represents a category of the feature. For example, posture, texture and edge information are each fully connected with all capsules in the initial capsule layer, and the vector in the digital capsule is updated through the dynamic routing algorithm. The number of dynamic routing iterations is 3.
The network proposed in this paper uses Margin loss as the loss function. Since the vector modulus in the capsule of the digital capsule layer represents the probability of being classified into the corresponding class, if the feature is classified as category k in the digital capsule layer, then the k-th capsule vector modulus will be the largest. The loss function is defined as follows: where T k is the indicator function, and it is 1 when the samples belong to the category k; otherwise, it is 0. m + represents the upper bound, and m − represents the lower bound, with values of 0.9 and 0.1, respectively. The pseudo-code of the proposed EMAP-Cubic-Caps network for hyperspectral image classification is shown in Algorithm 2. Algorithm 2. The pseudo code of EMAP-Cubic Caps network 1. Input: Hyperspectral data X and corresponding label Y, the number of iterations k 1 ←0, k 2 ←0, total number of iterations T 1 = 100, T 2 = 3, learning rate η = 0.0003. 2. Obtain hyperspectral data X EMAP after EMAP feature extraction. 3. Divide X EMAP into training set, verification set and test set, and input the training set and verification set into the cubic convolutional network. The sampling rate is shown in Tables 1-3. 4. While k 1 <T 1 , 5. Perform cubic convolution network. 6. k 1 ←k 1 + 1, algorithm 7. End 8. Input the feature map into the initial capsule layer. 9. Connect the initialized capsule to the digital capsule layer and use dynamic routing to update the parameters; see Algorithm 1 for specific steps. 10. Use the trained model to predict the test set. 11. Calculate OA, AA and Kappa.

Experimental Data Set
In order to validate the effectiveness and generalization of the proposed method, three current well-known hyperspectral data sets, namely the Indian Pines, University of Pavia and Salinas, were employed.  Figure 5a,b are the pseudo-color image and the real label map of the Indian Pines dataset, respectively. Table 1 shows the labels of each class and the number of training, verification and test samples of Indian Pines dataset in the experiment.

Experimental Data Set
In order to validate the effectiveness and generalization of the proposed method, three current well-known hyperspectral data sets, namely the Indian Pines, University of Pavia and Salinas, were employed.  Figure 5a,b are the pseudo-color image and the real label map of the Indian Pines dataset, respectively. Table 1 shows the labels of each class and the number of training, verification and test samples of Indian Pines dataset in the experiment.

University of Pavia
The data set of the University of Pavia is a hyperspectral image taken by the airborne reflection optical spectral imager (ROSIS-03) at the University of Pavia in Pavia, Italy, in 2003. Its size is 610 × 340, and the spatial resolution is 1.3 m per pixel, including 9 different features, and 115 bands with wavelengths from 0.43 microns to 0.86 microns. Twelve bands were removed due to the influence of noise. Experiments were conducted on the remaining 103 bands. Figure 5c,d are the pseudo-color image and label map of the University of Pavia data set, respectively. Table 2 shows the labels of each class of the University of Pavia data set in the experiment, as well as the training, verification and test samples.

University of Pavia
The data set of the University of Pavia is a hyperspectral image taken by the airborne reflection optical spectral imager (ROSIS-03) at the University of Pavia in Pavia, Italy, in 2003. Its size is 610 × 340, and the spatial resolution is 1.3 m per pixel, including 9 different features, and 115 bands with wavelengths from 0.43 microns to 0.86 microns. Twelve bands were removed due to the influence of noise. Experiments were conducted on the remaining 103 bands. Figure 5c,d are the pseudo-color image and label map of the University of Pavia data set, respectively. Table 2 shows the labels of each class of the University of Pavia data set in the experiment, as well as the training, verification and test samples.

Salinas
The Salinas Valley dataset is a hyperspectral image of the Salinas Valley, California, USA, taken by the Airborne Visual Infrared Imaging Spectrometer (AVIRIS). It has a size of 512 × 217 and a spatial resolution of 3.7 m/pixel. It contains 16 different ground features, including 224 bands. Twenty bands are excluded because of the influence of water vapor. Experiments were conducted on the remaining 204 bands. Figure 5d,e are pseudo-color images and label maps of the Salinas dataset, respectively. Table 3 shows the labels of each class of the Salinas dataset in the experiment and the number of training, verification and test samples.

Experimental Setup
To verify the superiority of the EMAP-Cubic-Caps network designed in this paper, we tested it on Indian Pines, University of Pavia and Salinas datasets. The number of training, verification and test samples is shown in Tables 1-3. The training samples of the Indian Pines data set extract a patch with a size of 15 × 15 × 108 centered on the pixel to be classified, and the University of Pavia and Salinas data sets extract a patch with a size of 23 × 23 × 108 centered at the pixel to be classified. The experiments run on Windows 10, and the deep learning platform is Python3.5+ tensorflow1.14.0+ keras2.1.5. The CUP is Intel i7-4790 k with a memory of 24G and the graphics processor of NVIDIA GeForce GTX 1080Ti. Moreover, the overall accuracy (OA), average accuracy (AA) and Kappa coefficient (kappa) are employed as the quantitative indicators to assess the classification performance.

Experiment and Analysis
This section shows the comparison results between the proposed EMAP-Cubic-Caps network and several representative hyperspectral image classification methods to validate the effectiveness. The selected four comparison methods are the support vector machine with EMAP features (EMAP-SVM) [21], the diverse region-based CNN (DR-CNN) [34], the spatial-spectral residual convolutional network (SSRN) [40] and three-dimensional convolutional capsule network based on EMAP preprocessing (3D-Caps) [43]. Among them, for the EMAP-SVM classifier, the hyperspectral data is first preprocessed by PCA, and the first three components are used to extract the EMAP features via three morphological attributes for classification. The DR-CNN method uses a convolutional neural network to extract local features in the upper, lower, left and right directions of the training samples, and merge them with global features for classification. For the SSRN method, a residual network is constructed using 3D convolution kernels with the size of 3 × 3 × 128 to extract the spatial and spectral information of hyperspectral images. For the 3D-Caps method, the EMAP features extracted from the first three principal components are used as the input to the network. 3D-Caps contains two three-dimensional convolutional layers and three capsule layers. The first layer is the initial capsule layer, the last layer is the digital capsule layer and the vectors in the latter two layers of capsules are updated using dynamic routing algorithms. For our proposed method, we named the network without EMAP feature extraction on the original hyperspectral image as Cubic-Caps. It first performs PCA and one-dimensional convolution of the original hyperspectral image to reduce the dimensionality and then uses the cubic convolution network to extract the spectral features and spatial-spectral features, finally sending them to the capsule network for classification. Table 4 lists the classification accuracies of six algorithms on the Indian Pines dataset. Among them, the result of EMAP-SVM is lower than the accuracy of the neural networkbased methods, and the OA is lower than the EMAP-Cubic-Caps proposed in this paper by more than 20%. Compared with the two algorithms DR-CNN and SSRN, the OA of the proposed EMAP-Cubic-Caps also increased up to 98.20%, 2.55% and 8% higher than DR-CNN and SSRN, respectively. Due to the relatively simple part of its convolutional network, the 3D-Caps method has insufficient ability to extract features of hyperspectral images, resulting in its classification accuracy lower than DR-CNN and SSRN methods, but its classification accuracy on the 5th, 12th and 13th categories was higher than other methods, which proves that it has certain advantages. In terms of the average accuracies, the proposed EMAP-Cubic-Caps method obtained the highest AA values, which proves that the classification results of each class obtained by EMAP-Cubic-Caps are satisfactory. Specifically, in the Indian Pines data set, the samples of the 1st, 7th, 9th and 16th categories were extremely unbalanced, which is a big challenge for the classifiers. In the 1st and 7th unbalanced categories with small samples (for both categories, only one sample is selected for training), the proposed EMAP-Cubic-Caps achieved the best class accuracies. In the 9th and 16th categories, the performance of EMAP-Cubic-Caps classifier was relatively general, and its class accuracies of those two categories were only higher than those of the EMAP-SVM method. The reason may be that, in the process of EMAP feature extraction, for very few unbalanced and unevenly distributed samples, the patch-wise-based extraction may weaken the distinguishability of these samples, thus decreasing the class accuracies of those categories. In addition, the Kappa coefficient of the proposed EMAP-Cubic-Caps method was also increased to the highest 0.9765. The algorithm Cubic-Caps without EMAP feature extraction also obtained good classification results, and its OA exceeded CNNbased method by more than 1%. It is slightly inferior to the proposed EMAP-Cubic-Caps method, which indicates that the capsule network based on cubic convolutional network still has good performance, but the EMAP feature can promote the network to extract richer discriminative features, thereby making the classification accuracy higher. Moreover, among the results of a single class, the proposed EMAP-Cubic-Caps method achieved the highest classification accuracy in the 1st, 4th, 7th, 8th and 13th categories. To summarize, the proposed EMAP-Cubic-CAPs has the best performance among all competitors.  Figure 6 shows classification maps of the six algorithms mentioned in this article. It is obvious that the classification map of EMAP-SVM is the most unsatisfactory, and the image has the most noise, because it only uses the combination of EMAP and SVM, and only the shallow features are fused for classification. Two algorithms based on convolutional neural network, i.e., DR-CNN and SSRN, achieved relatively satisfactory results, but there is still some noise in their classification maps. Due to the relatively simple part of the 3D-Caps convolutional network, the classification map obtained by it is poorer than DR-CNN and SSRN methods. The proposed Cubic-Caps and EMAP-Cubic-Caps have the least noise on the whole maps and get the highest degree of consistency with the distribution of land-covers, especially for the proposed EMAP-Cubic-Caps method.
Remote Sens. 2021, 13, x FOR PEER REVIEW 14 of 21 Figure 6 shows classification maps of the six algorithms mentioned in this article. It is obvious that the classification map of EMAP-SVM is the most unsatisfactory, and the image has the most noise, because it only uses the combination of EMAP and SVM, and only the shallow features are fused for classification. Two algorithms based on convolutional neural network, i.e., DR-CNN and SSRN, achieved relatively satisfactory results, but there is still some noise in their classification maps. Due to the relatively simple part of the 3D-Caps convolutional network, the classification map obtained by it is poorer than DR-CNN and SSRN methods. The proposed Cubic-Caps and EMAP-Cubic-Caps have the least noise on the whole maps and get the highest degree of consistency with the distribution of land-covers, especially for the proposed EMAP-Cubic-Caps method. Benefitting from the spatial geometric features of EMAP, the results of EMAP-Cubic-Caps on the Alfalfa and Buildings-grass-trees-drive classes are significantly better than other methods, which proves that the features of the vector encapsulated into capsules better represent the information of the land cover, so the classification results are better Benefitting from the spatial geometric features of EMAP, the results of EMAP-Cubic-Caps on the Alfalfa and Buildings-grass-trees-drive classes are significantly better than other methods, which proves that the features of the vector encapsulated into capsules better represent the information of the land cover, so the classification results are better than other competitors. It can be drawn from the above that the conclusions obtained from the classification result maps are highly consistent with the results of the quantitative evaluation. Table 5 shows the classification accuracies of six algorithms on the University of Pavia dataset. Among them, the results of EMAP-SVM are not satisfactory, and the OA is lower than the methods based on the convolutional neural network by about 20% and lower than the proposed EMAP-Cubic-Caps method by 22.36%. Compared with the two convolutional neural network methods, i.e., DR-CNN and SSRN methods, the OAs of our EMAP-Cubic-Caps method increased from 94.40% and 95.15% to 98.81%, an increase of more than 3%. The 3D-Caps algorithm based on the three-dimensional convolutional capsule network had only two convolutional layers because of its convolutional network part. The ability to extract low-level features of HSI is poor, so the OA of it was only 88.30%. In terms of the average accuracy (AA), the proposed EMAP-Cubic-Caps method achieved the highest value of 98.49%, which validates that the algorithm in this paper has great discrimination in classifying most of the instances. In addition, for class-accuracies, the EMAP-Cubic-Caps method basically achieved the best performance. In particular, in the University of Pavia dataset, the 1st, 2nd, 3rd, 5th, 6th, and 8th category, the proposed EMAP-Cubic-Caps method achieved the highest accuracy. Moreover, the Kappa coefficient of EMAP-Cubic-Caps method increased from the other highest 0.9364 (SSRN method) to 0.9842. It indicates that the proposed classifier has better performance in inner-class consistency. Even though the proposed Cubic-Caps method does not perform EMAP feature extraction, it also obtained good results, which are slightly inferior to the results of EMAP-Cubic-Caps classifier. Again, it further illustrates the advantages of EMAP-Cubic-Caps method in processing hyperspectral image classification tasks. Table 5. Classification accuracies of the competing six methods on University of Pavia dataset (the optimal results are shown in bold).  Figure 7 illustrates the classification maps of six algorithms on the University of Pavia dataset. Among them, the classification results of the EMAP-SVM algorithm still contain a lot of noise. The other five algorithms based on neural networks have results in classifying the corresponding land-covers. Among those five network-based classifiers, the performance of 3D-Caps is slightly worse. Many pixels belonging to the Meadows category are classified into the Bare Soil category. The classification map of the proposed EMAP-Cubic-Caps method has the least noise and better restores the distribution of corresponding land-covers. In addition, the results of EMAP-Cubic-Caps on the Self-Blocking Bricks and Bare Soil classes are significantly better than other methods. It proves that the ability of vector-encapsulated capsules to represent land cover information is stronger than that of scalar neurons, which further illustrates the advantages of the capsule network. Figure 7 illustrates the classification maps of six algorithms on the University of Pavia dataset. Among them, the classification results of the EMAP-SVM algorithm still contain a lot of noise. The other five algorithms based on neural networks have results in classifying the corresponding land-covers. Among those five network-based classifiers, the performance of 3D-Caps is slightly worse. Many pixels belonging to the Meadows category are classified into the Bare Soil category. The classification map of the proposed EMAP-Cubic-Caps method has the least noise and better restores the distribution of corresponding land-covers. In addition, the results of EMAP-Cubic-Caps on the Self-Blocking Bricks and Bare Soil classes are significantly better than other methods. It proves that the ability of vector-encapsulated capsules to represent land cover information is stronger than that of scalar neurons, which further illustrates the advantages of the capsule network.  Table 6 shows the quantitative classification accuracies of six algorithms on the Salinas dataset. Similarly, the results of EMAP-SVM are far from those neural networkbased algorithms, especially in the categories of Grapes_untrained and Vineyard_untrained. The performance of EMAP-SVM was poor, and the overall accuracy (OA) was the lowest. Compared with the two algorithms based on CNN, i.e., DR-CNN and SSRN methods, the proposed EMAP-Cubic-Caps classifier improved OA indicators by 5.4% and 3.11%, respectively, and increased by 2.6% and 2.1% on AA indicators, respectively. Based  Table 6 shows the quantitative classification accuracies of six algorithms on the Salinas dataset. Similarly, the results of EMAP-SVM are far from those neural network-based algorithms, especially in the categories of Grapes_untrained and Vineyard_untrained. The performance of EMAP-SVM was poor, and the overall accuracy (OA) was the lowest. Compared with the two algorithms based on CNN, i.e., DR-CNN and SSRN methods, the proposed EMAP-Cubic-Caps classifier improved OA indicators by 5.4% and 3.11%, respectively, and increased by 2.6% and 2.1% on AA indicators, respectively. Based on the 3D-Caps method of the 3D convolutional capsule network, the OA of EMAP-Cubic-Caps increased by nearly 10%, and the AA increased by nearly 5%. Moreover, the Kappa coefficient was also improved. It verifies that the proposed EMAP-Cubic-Caps method can achieve stable classification results with more categories and has better generalization performance in the classification of multi-category land cover. The experimental results of the proposed Cubic-Caps method are slightly worse than the proposed EMAP-Cubic-Caps classifier, which again validates that, with the help of EMAP features, the accuracy of classification can be effectively improved. Through comparison, it is obvious that the proposed Cubic-Caps and EMAP-Cubic-Caps method have superior advantages. Figure 8 shows the classification maps of the above-competing methods on the Salinas data set. Among them, the classification result map based on the EMAP-SVM algorithm has the worst performance. There is also some noise in the DR-CNN and SSRN methods. In the classification results of the Fallow and Grapes_untrained classes, there is more noise, and the classification maps of other classes perform better. 3D-Caps performs a little bit worse than DR-CNN and SSRN methods, and the noise is more than other neural-networkbased algorithms. The classification map of the proposed EMAP-Cubic-Caps method has the least noise and better restores the real distribution of land-covers. The results on the two classes of Fallow and Grapes_untrained are significantly better than other methods. It again validates that the ability of using vector-encapsulated capsules to represent land cover information is stronger than that of scalar neurons. It also once again proves the advantages of the proposed algorithm.   Table 7 shows the comparison of the time complexity of training and testing of different algorithms on the three data sets. Among them, the DR-CNN based on two-dimensional convolutional network has a more complex network structure and consumes the most training and testing time. The SSRN of the residual network has a good performance in time complexity. The 3D-Caps based on the 3D convolutional capsule network is relatively simple because of its convolutional network part, so it consumes the shortest training and testing time. Although the time complexity of Cubic-Caps is higher than that of 3D-Caps, it has a better performance on classification accuracy than the latter.   Table 7 shows the comparison of the time complexity of training and testing of different algorithms on the three data sets. Among them, the DR-CNN based on two-dimensional convolutional network has a more complex network structure and consumes the most training and testing time. The SSRN of the residual network has a good performance in time complexity. The 3D-Caps based on the 3D convolutional capsule network is relatively simple because of its convolutional network part, so it consumes the shortest training and testing time. Although the time complexity of Cubic-Caps is higher than that of 3D-Caps, it has a better performance on classification accuracy than the latter. By the ablation experiment (that is, experiments of Cubic-Caps and EMAP-Cubic-Caps methods), Cubic-Caps has high time complexity due to the use of raw data for training. In contrast, the proposed EMAP-Cubic-Caps method can obtain satisfactory accuracy under good time complexity, which fully proves the superiority of the EMAP-Cubic-Caps method. Figure 9a shows the comparison of the classification overall accuracy over the number of convolutional kernels in each layer of the Cubic Network part of the EMAP-Cubic-Caps method. It is clear that, when the number of convolution kernels is 3, the result is not satisfactory, and the network s ability to extract image information is insufficient. When the number of convolution kernels is selected as 6 and 12, the classification accuracy continuously improves. When the number of convolution kernels is selected as 24, the classification accuracy tends to converge. Therefore, we used 12 in each convolution layer to learn the characteristics of the image and it results in optimal results. Remote Sens. 2021, 13, x FOR PEER REVIEW 18 of 21 not satisfactory, and the network′s ability to extract image information is insufficient. When the number of convolution kernels is selected as 6 and 12, the classification accuracy continuously improves. When the number of convolution kernels is selected as 24, the classification accuracy tends to converge. Therefore, we used 12 in each convolution layer to learn the characteristics of the image and it results in optimal results.  Figure 9b shows the comparison of the OA over the length of vector of each digital capsule in the digital capsule layer. It can be observed that, when the length of the vector is set to 6 and 9, the classification OA continuously rises. When the length is set to 12 and 15, the classification accuracy changes from rising to converging. In order to ensure the efficiency of the algorithm proposed in this chapter, we chose 12 as the length of the vector in the digital capsule layer.

Conclusions
In the paper, a cubic capsule network with EMAP features (EMAP-Cubic-Caps) is  Figure 9b shows the comparison of the OA over the length of vector of each digital capsule in the digital capsule layer. It can be observed that, when the length of the vector is set to 6 and 9, the classification OA continuously rises. When the length is set to 12 and 15, the classification accuracy changes from rising to converging. In order to ensure the efficiency of the algorithm proposed in this chapter, we chose 12 as the length of the vector in the digital capsule layer.

Conclusions
In the paper, a cubic capsule network with EMAP features (EMAP-Cubic-Caps) is proposed to classify hyperspectral images, which can effectively alleviate the defect of insufficient spatial-spectral feature extraction of hyperspectral images by most convolutional neural networks. The EMAP-Cubic-Caps network is composed of EMAP feature extraction, a cubic convolutional network, an initial capsule layer and a digital capsule layer. EMAP first extracts three geometric structural features from the three principal components of the original hyperspectral image. The function of the cubic convolutional network is to extract the spatial-spectral features of the image from three planes of the cube. The two capsule layers further use vector-encapsulated capsules to extract richer and more accurate deep features, thereby improving the classification accuracy of HSIs. Through experimental comparison, it is verified that the performance of the proposed EMAP-Cubic-Caps method is better than several state-of-the-art CNN-based methods. In addition, compared with 3D-Caps, the proposed EMAP-Cubic-Caps method has improved significantly in all terms of accuracies. Specifically, the advantage of the proposed EMAP-Cubic-Caps method is that it can fully extract geometric morphological features and integrate them into the capsule network to better express the features of the ground cover. It has a good performance in classification for the scenes with fewer samples and rich geometric details (for example, a local area with rich detailed information and diverse shapes). In the ablation experiment, that is, in comparison with the 3D-Caps method, the proposed EMAP-Cubic-Caps network fully extracted the low-level features of the hyperspectral image before the capsule layer, which verifies that the performance of the model trained using EMAP features is better than the original data.