An Advanced Spectral–Spatial Classiﬁcation Framework for Hyperspectral Imagery Based on DeepLab v3+

: DeepLab v3+ neural network shows excellent performance in semantic segmentation. In this paper, we proposed a segmentation framework based on DeepLab v3+ neural network and applied it to the problem of hyperspectral imagery classiﬁcation (HSIC). The dimensionality reduction of the hyperspectral image is performed using principal component analysis (PCA). DeepLab v3+ is used to extract spatial features, and those are fused with spectral features. A support vector machine (SVM) classiﬁer is used for ﬁtting and classiﬁcation. Experimental results show that the framework proposed in this paper outperforms most traditional machine learning algorithms and deep-learning algorithms in hyperspectral imagery classiﬁcation tasks.


Introduction
With the development of remote sensing technology in the late 20th century, the emergence of hyperspectral imagery (HSI) technology has attracted widespread attention in various fields. Unlike natural images and multispectral images, the resolution of hyperspectral images in the spectral range is very high, generally in the order of 10 −2 λ. A hyperspectral imager can record hundreds of channels of spectral band information from ultraviolet to midinfrared. This rich spectral band information has become the focus of many fields such as land monitoring, crop growth, petroleum exploration, and military target recognition.
Hyperspectral imagery classification (HSIC) is more difficult than natural image classification for two primary reasons [1]: (a) The spectral bands of hyperspectral images have high dimensionality, which increases the computational complexity. Moreover, according to the Hughes phenomenon, as the dimension number increases, the classification accuracy first increases and then decreases. (b) The spatial resolution of hyperspectral images is very low, and one pixel typically represents a distance of tens of meters. Therefore, the number of pictures in each dataset is very small, and extracting spatial features is more difficult.
At present, there are several different ideas to solve the problem of excessive dimensionality of hyperspectral datasets. First, the dimensionality of the spectral band can be reduced using one of the two broad strategies: projection and manifold learning. Principal component analysis [2] is a typical linear dimensionality reduction method in projection. The purpose is to map high-dimensional data to a low-dimensional space through a certain (1) We utilize DeepLab v3+ as the neural network structure to extract spatial domain features and merge them with spectral domain features. DeepLab v3+ is one of the fourth-generation DeepLab series of semantic segmentation networks developed by Google and has the best comprehensive performance so far. We are the first to apply the latest version of the DeepLab network to the hyperspectral image classification task. (2) We use the PCA method to reduce the dimensionality of the original hyperspectral image. In the spatial feature extraction stage, we select the first three principal components and the first principal component as training data and labels [16] to solve the dimensionality problem of the hyperspectral imagery. (3) We select different classifiers such as SVM, KNN, etc., to complete the task of hyperspectral image classification. We test and compare the classification accuracy under different conditions.

DeepLab v3+
DeepLab v3+ [17] is the latest work in the DeepLab series neural networks of semantic segmentation. Its predecessors include DeepLab v1, v2, and v3. The DeepLab series of neural networks use deep convolutional neural networks (DCNNs), which have excellent translation invariance so that they have good image-level classification capabilities. However, it is difficult for DCNNs to deal with the pixel-level classification problem, which is also an urgent problem for semantic segmentation. Therefore, in light of the following main problems, the author [17] has proposed solutions in different versions of DeepLab.
Continuous convolution and pooling in DCNNs inevitably lead to a decrease in the resolution of feature maps, which affects the final segmentation accuracy. Inspired by the paper [18], the author of DeepLab used the atrous convolution in the v2 version. The mathematical reasoning process of the hole convolution is described in detail [18] elsewhere; we discuss it briefly. Figure 1 represents the comparison of convolution principles in two different ways. (3) We select different classifiers such as SVM, KNN, etc., to complete the task of hyperspectral image classification. We test and compare the classification accuracy under different conditions.

DeepLab v3+
DeepLab v3+ [17] is the latest work in the DeepLab series neural networks of semantic segmentation. Its predecessors include DeepLab v1, v2, and v3. The DeepLab series of neural networks use deep convolutional neural networks (DCNNs), which have excellent translation invariance so that they have good image-level classification capabilities. However, it is difficult for DCNNs to deal with the pixel-level classification problem, which is also an urgent problem for semantic segmentation. Therefore, in light of the following main problems, the author [17] has proposed solutions in different versions of DeepLab.
Continuous convolution and pooling in DCNNs inevitably lead to a decrease in the resolution of feature maps, which affects the final segmentation accuracy. Inspired by the paper [18], the author of DeepLab used the atrous convolution in the v2 version. The mathematical reasoning process of the hole convolution is described in detail [18] elsewhere; we discuss it briefly. Figure 1 represents the comparison of convolution principles in two different ways. Let : → be a discrete function, and Ω = − , ⋂ 0 . Let the convolution kernel be : Ω → , then the size of the convolution kernel is (2r+1) 2 . The definition of discrete convolution * can be obtained from the above conditions: when considering the atrous convolution, let be the hole coefficient. The atrous convolution * with the atrous coefficient can be defined as: Secondly, another major problem with DCNNs is that due to the fixed parameters of the fully connected layer at the end of the network, the input images need to have a fixed size. The usual solution is to crop or scale the input image. However, in the process of cropping and scaling, the target object in the picture is deformed, resulting in a decrease in recognition accuracy. To address this problem, atrous spatial pyramid pooling (ASPP) Let F : Z 2 → R be a discrete function, and Ω r = [−r, r] 2 ∩ Z 2 (r > 0). Let the convolution kernel be k : Ω r → R , then the size of the convolution kernel is (2r + 1) 2 . The definition of discrete convolution * can be obtained from the above conditions: when considering the atrous convolution, let l be the hole coefficient. The atrous convolution * l with the atrous coefficient l can be defined as: Secondly, another major problem with DCNNs is that due to the fixed parameters of the fully connected layer at the end of the network, the input images need to have a fixed size. The usual solution is to crop or scale the input image. However, in the process of cropping and scaling, the target object in the picture is deformed, resulting in a decrease in recognition accuracy. To address this problem, atrous spatial pyramid pooling (ASPP) is added to the DeepLab series neural network combining the design ideas of atrous convolution and spatial pyramid pooling (SPP) [19], which are represented in Figure 2.
is added to the DeepLab series neural network combining the design ideas of atrous convolution and spatial pyramid pooling (SPP) [19], which are represented in Figure 2. The spatial pyramid pooling structure is generally placed before the last fully connected layer of the network. When the feature map of the size , output by the previous layer is passed into the SPP, it is divided multiple times. The same image is divided into × image blocks of size / , / by the division scale , . Each image block in each divided image is put into the pooling layer separately and finally linked to a fully connected layer after joining. The biggest advantage of the structure is that it solves the size problem of the input picture. Due to the flexibility of the input size, the network can extract the convergence features under different pooling window sizes to improve the classification accuracy. Similarly, each feature map is separately convolved by convolution kernels with different dilation rates and then concatenated together in an atrous spatial pyramid pooling structure.
In the v3+ version, in order to integrate multiscale information, the v3 network is designed as an Encoder, and then the spatial resolution is restored in the Decoder structure. The Encoder-Decoder model is commonly used in semantic segmentation. During the test, DeepLab v3+ showed good segmentation performance. In the PASAL VOC 2012 dataset, it produced an mIOU index of 89.0, which is the best performance so far. Therefore, we choose DeepLab v3+ as the main network for feature extraction in our research. Figure 3a shows the network structure of DeepLab v3+. The spatial pyramid pooling structure is generally placed before the last fully connected layer of the network. When the feature map of the size (w, h) output by the previous layer is passed into the SPP, it is divided multiple times. The same image is divided into m × n image blocks of size (w/m, h/n) by the division scale (m, n). Each image block in each divided image is put into the pooling layer separately and finally linked to a fully connected layer after joining. The biggest advantage of the structure is that it solves the size problem of the input picture. Due to the flexibility of the input size, the network can extract the convergence features under different pooling window sizes to improve the classification accuracy. Similarly, each feature map is separately convolved by convolution kernels with different dilation rates and then concatenated together in an atrous spatial pyramid pooling structure.
In the v3+ version, in order to integrate multiscale information, the v3 network is designed as an Encoder, and then the spatial resolution is restored in the Decoder structure. The Encoder-Decoder model is commonly used in semantic segmentation. During the test, DeepLab v3+ showed good segmentation performance. In the PASAL VOC 2012 dataset, it produced an mIOU index of 89.0, which is the best performance so far. Therefore, we choose DeepLab v3+ as the main network for feature extraction in our research. Figure 3a shows the network structure of DeepLab v3+. is added to the DeepLab series neural network combining the design ideas of atrous convolution and spatial pyramid pooling (SPP) [19], which are represented in Figure 2. The spatial pyramid pooling structure is generally placed before the last fully connected layer of the network. When the feature map of the size , output by the previous layer is passed into the SPP, it is divided multiple times. The same image is divided into × image blocks of size / , / by the division scale , . Each image block in each divided image is put into the pooling layer separately and finally linked to a fully connected layer after joining. The biggest advantage of the structure is that it solves the size problem of the input picture. Due to the flexibility of the input size, the network can extract the convergence features under different pooling window sizes to improve the classification accuracy. Similarly, each feature map is separately convolved by convolution kernels with different dilation rates and then concatenated together in an atrous spatial pyramid pooling structure.
In the v3+ version, in order to integrate multiscale information, the v3 network is designed as an Encoder, and then the spatial resolution is restored in the Decoder structure. The Encoder-Decoder model is commonly used in semantic segmentation. During the test, DeepLab v3+ showed good segmentation performance. In the PASAL VOC 2012 dataset, it produced an mIOU index of 89.0, which is the best performance so far. Therefore, we choose DeepLab v3+ as the main network for feature extraction in our research. Figure 3a shows the network structure of DeepLab v3+.
(a)  Figure 3b shows the framework we proposed of hyperspectral image classification. The framework can be roughly divided into three parts: PCA dimensionality reduction, feature extraction and fusion, and the use of classifiers to achieve pixel-level classification. The specific details and mathematical principles of the framework are as follows.

Proposed Framework
The unprocessed hyperspectral image is regarded as a series of samples. Assume that the matrix size represented by the hyperspectral image is × × , where × represents the resolution of the image and represents the number of spectral channels. If ∈ is represented by a column vector, then the dimensionality reduction goal is to reduce the dimensionality of . The process of principal component analysis is as follows. Let be the − principal component. The optimization is: and the projection length of on : Let ‖ ‖ = 1, and we already know = , ,•••, × . Hence, it can be shown: The matrix ∈ × is positive definite. Therefore, is positive definite quadratic form, and it has a maximum value. Under the constraint of = 1, we obtain =  by Lagrange multiplier method. α is the eigenvalue of the matrix , and the objective function takes the maximum value when takes the eigenvector corresponding to α. We can get: where is the − principal component. Taking the eigenvectors of the first k eigenvalues and the corresponding matrix, = , ,•••, ∈ × . Finally, a dataset reduced from n dimensions to k dimensions can be obtained by  Figure 3b shows the framework we proposed of hyperspectral image classification. The framework can be roughly divided into three parts: PCA dimensionality reduction, feature extraction and fusion, and the use of classifiers to achieve pixel-level classification. The specific details and mathematical principles of the framework are as follows.

Proposed Framework
The unprocessed hyperspectral image is regarded as a series of samples. Assume that the matrix size represented by the hyperspectral image is h × w × n, where h × w represents the resolution of the image and n represents the number of spectral channels. If x i ∈ R n is represented by a column vector, then the dimensionality reduction goal is to reduce the dimensionality of X ∈ R n×h×w composed of h × w samples toX ∈ R k×h×w (k ≤ n). The process of principal component analysis is as follows.
Let u i be the i − th principal component. The optimization is: and the projection length of x j on u i : Let ||u i || = 1 , and we already know X = [x 1 , x 2 , · · · , x h×w ]. Hence, it can be shown: The matrix XX T ∈ R n×n is positive definite. Therefore, u i T XX T u i is positive definite quadratic form, and it has a maximum value. Under the constraint of u i T u i = 1, we obtain XX T u i = αu i by Lagrange multiplier method. α is the eigenvalue of the matrix XX T , and the objective function takes the maximum value when u i takes the eigenvector corresponding to α. We can get: where u i is the i − th principal component. Taking the eigenvectors of the first k eigenvalues and the corresponding matrix, U = [u 1 , u 2 , · · · , u k ] ∈ R n×k . Finally, a dataset reduced from n dimensions to k dimensions can be obtained bŷ Using an established method [16], the dimensionality of original hyperspectral dataset is reduced by PCA. As a method of dimensionality reduction, PCA only reduces the dimensionality of the data while there is very little information loss. The introduction of PCA is to avoid the use of three-dimensional convolution in the feature extraction process which is computationally intensive. If we do not use PCA, three-dimensional convolution has to be used in all data, including training set and test set generally. Therefore, the introduction of PCA theoretically does not lead to biased results.
We take the first three principal components, namely k = 3, as the dataset, and take the first principal component, namely k = 1, as the label. All data processed by PCA are converted into uint8 format; therefore, in theory, they are all integers in the range of [0, 255]. The main reason for using the first principal component as the label is that the data after dimensionality reduction remove the redundant information in the spectral domain and only a single channel is retained. The process retains the feature relationship of the object represented by each pixel in space, which is also an essential element for extracting spatial features.
To expand the number of samples so as to extract the spatial feature information abundantly, we use an established method [20] to crop the reduced-dimensional images and labels. The size of the cropping window is 45 × 45 pixels, the stride is 5 pixels, and the pictures and labels are cropped from left to right and top to bottom. Then, the cropped pictures and corresponding labels are used as the training set.
In the spatial feature extraction stage, we choose resnet50 as the backbone of DeepLab v3+ and set the label number to 256. The crossentropy loss function is the optimization target, as follows: where p is the expected output and q is the actual result. Crossentropy is used to evaluate the difference between the current training probability distribution and the true distribution. Therefore, it is reasonable to reduce the loss function so that the probability distribution of the actual output and the probability distribution of the expected output are as close as possible. Other related training parameters are given in Table 1. After the DeepLab v3+ network is trained, the dimensionality-reduced and not cropped picture with the size of W × H × 3 is input into the network, and the spatial feature with the size of W × H × 256 can be obtained without resolution loss. The pixel matrix with the size of W × H × K without dimensionality reduction is considered as a spectral domain feature and fused with spatial features. Since the ultimate purpose is to classify each pixel, it is necessary to fuse the spatial and spectral features of each pixel. The number of pixels in the picture is W × H. After normalizing the spatial and spectral features, respectively, the two features are stitched and integrated in the third dimension to complete the feature fusion process. In addition, the size of fusion feature is W × H × (256 + K). Finally, we select different classifiers, such as SVM, KNN, etc., and input the fusion features into the classifier for fitting and comparing.
In terms of evaluation, hyperspectral imagery classification generally uses overall accuracy (OA), average accuracy (AA), and κ coefficients, which are all calculated based on the confusion matrix. Among them, OA refers to the ratio of the number of correctly classified category pixels to the total number of pixels, AA refers to the average value of the recall of each category, and κ coefficient is an indicator of consistency.

Experimental Results
We selected three hyperspectral datasets: Pavia University [21], Kennedy Space Center [22], and HyRANK [23]. These hyperspectral datasets are public and open source. Pavia University is part of the hyperspectral data imaged in 2003 in the Italian city of Pavia. The size of the hyperspectral image is 610 × 340 pixels, with a total of 115 bands. After removing water vapor noise, there are 103 bands, in the wavelength range of 0.43-0.86 µm. The data contain nine types of labels, a total of 42,776 pixels. The Kennedy Space Center dataset was captured by the AVIRIS sensor on 23 March 1996. It contains 176 spectral bands after removing water vapor noise and 13 types of labels. The HyRANK dataset was acquired by the Hyperion sensor, and each image has 176 spectral bands. To test the generalizability of this feature extraction framework, two pictures with real labels in the dataset, Loukia and Dioni, were selected. The former includes 14 types of labels, and the latter includes 12 types of them. We randomly selected 5% of the pixels of each category in Pavia University and Kennedy Space Center, and 10% in HyRANK as training set, and used the rest as the test set.
Each pixel in the hyperspectral remote sensing image represents a ground distance from about tens to hundreds of meters. In other words, each pixel can be regarded as a sample, and an image contains tens of thousands of such samples. Each sample contains the information of spectral band, and there is also a spatial dimension relationship between them. Therefore, the data of hyperspectral image are sufficient.
The influence of different classifiers and feature types on the classification results are verified separately by controlling variables. In this process, we use the Pavia University dataset and the Kennedy Space Center dataset and compare the experimental results with other literature. To verify whether the proposed framework is equally effective for the new dataset in the feature extraction stage, we use the image Loukia in the HyRANK dataset as training set of DeepLab v3+ and the image Dioni as test set.

Comparison Between Different Classifiers
In this study, five machine learning classifiers, KNN, logistic regression, decision tree, naive Bayesian model, and SVM were selected, and the established classification algorithms [14,16], such as SRC-T (SRC classifier with diagonal weight matrix T), ELM, SVM-RBF (SVM classifier with radial basis function), SAE-LR (stacked AutoEncoders with logistic regression), and CNN, were compared with our framework. Tables 2 and 3, respectively, show the classification accuracy of different classifiers for each category for the Pavia University dataset and Kennedy Space Center dataset. The results show that compared to the other four classifiers, SVM classifier performs better in both single category and overall indicators. Different classifiers in machine learning have different theoretical basis, and their classification accuracy is also restricted by many external factors. For example, random forest classifier does not perform well in fitting data with many categories of features. In addition, the naive Bayes algorithm is often used in text classification, which is very sensitive to the expression form of the input data; therefore, it performs poorly in the task of HSIC. SVM classifier is different from other classification algorithms, since it does not involve probability measures and law of large numbers (LNN). The support vector plays a decisive role in the classification. Therefore, its own optimization goal is to minimize structured risk rather than experience risk, avoiding the traditional process from induction to deduction. By grabbing key samples to complete its inference process, SVM shows superior classification performance and robustness in hyperspectral imagery classification tasks. SVM is selected as the optimal classification algorithm of the framework proposed in this paper, and the classifier variables are fixed in the following comparative experiments.
To show the superiority of the framework, a horizontal comparison experiment is needed. Tables 4 and 5, respectively, represent the differences in classification accuracy between this framework and some previous algorithms [14,16] for the Pavia University datasets and Kennedy Space Center datasets. Within the known range, the accuracy of the algorithm in this paper has reached the state of the art.

Comparison of Spectral Features, Spatial Features and Fusion Features
One of the important reasons why the framework can achieve high classification accuracy is the idea of feature fusion. To verify the impact of different features on hyperspectral imagery classification, we use spectral features, spatial features, and fusion features as feature vectors, respectively, to train the SVM classifier and check the differences in classification accuracy. Figure 4 shows the classification results obtained by different feature types for the Pavia University datasets and Kennedy Space Center dataset. When the fusion feature is used as a feature vector, the best classification effect can be achieved. However, when there are only spectral features or spatial features, the classification accuracy decreases to varying degrees. Generally, it is reasonable to believe that when the feature information of a certain category is more fully mined, the resulting higher-dimensional feature vector has a more positive impact on the classification and therefore improves the classification accuracy. For different types of label, the classification accuracy of fusion feature has less fluctuation, which enhances the robustness of the framework.

Comparison of Spectral Features, Spatial Features and Fusion Features
One of the important reasons why the framework can achieve high classification accuracy is the idea of feature fusion. To verify the impact of different features on hyperspectral imagery classification, we use spectral features, spatial features, and fusion features as feature vectors, respectively, to train the SVM classifier and check the differences in classification accuracy. Figure 4 shows the classification results obtained by different feature types for the Pavia University datasets and Kennedy Space Center dataset. When the fusion feature is used as a feature vector, the best classification effect can be achieved. However, when there are only spectral features or spatial features, the classification accuracy decreases to varying degrees. Generally, it is reasonable to believe that when the feature information of a certain category is more fully mined, the resulting higher-dimensional feature vector has a more positive impact on the classification and therefore improves the classification accuracy. For different types of label, the classification accuracy of fusion feature has less fluctuation, which enhances the robustness of the framework.   The spatial features extracted by the DeepLab v3+ network have a positive impact on subsequent classification. The classification framework based on DeepLab v3+ surpassed CNN, ELM, and other algorithms in this experiment. It extracted the explicit or implicit relationship in space between each category of pixels and other pixels. The spatial features obtained satisfy the requirements of classification tasks, which also indirectly reflects the powerful spatial feature extraction capabilities of DeepLab v3+ and its wide applicability.

Generalization Verification
In an ideal situation, the image after the principal component extraction would be fed to the DeepLab v3+ neural network, and the spatial feature information would be obtained after training. However, in real scenes, many data to be classified are unlabeled, which requires that the trained neural network can also extract spatial feature from unseen images. Therefore, two labeled pictures in the HyRANK dataset are used to verify the generalization of the feature extraction network. Among them, the picture Loukia is used as a known dataset, and the picture Dioni is used as an unknown dataset. After feature extraction and fusion, the classification results are obtained. Table 6 represents the classification result of the known dataset Loukia and the unknown dataset Dioni using different classifiers. The results show that the classification accuracy of Loukia is relatively mediocre. The reason is that the same feature category is scattered and each area is small. Different types of feature categories are mixed together, which brings great difficulty to the classification. Apart from this, the difference in spectral characteristics between different types of ground features is not obvious enough, which is also a reason for the poor classification.
However, although Dioni is a new dataset for the feature extraction network, the framework can still perform the classification task excellently. Facing the same environmental conditions, the framework has a strong classification ability for hyperspectral data collected by the same sensor. Therefore, the framework has strong generalization ability and application prospects.

Visualization
To intuitively understand the impact of different factors on the classification, we also have done a series of data visualization. First, we use the visualization tool TSNE to reduce the dimensionality of the feature vector of each pixel to two dimensions, and project it to a 2-D coordinate system. By observing the degree of dispersion in each category, we can judge the difficulty of classification. Figure 5 represents the distribution effects of pixels with spectral, spatial, and fusion features for the Pavia University dataset, Kennedy Space Center dataset, and HyRANK dataset. As we can see from the figure, the pixels with fusion features have the best degree of aggregation in each category, and the overlapping area between different categories is the least, making it is easy to be classified. This shows the importance and efficiency of the feature fusion method in hyperspectral imagery classification. Similarly, the degree of aggregation of Dioni's pixel distribution is better than that of Loukia.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 11 of 14 classification map. Figure 8 represents the prediction classification maps for Loukia and Dioni. Compared to Dioni, the distribution of categories in Loukia is more scattered and complex; therefore, it is more difficult to be classifier.  Figures 6 and 7, respectively, represent the prediction classification maps for the Pavia University dataset and Kennedy Space Center dataset with different factors. After the spatial feature extraction of the DeepLab v3+ network, the feature fusion of spatial and spectral features, and the use of SVM classifiers for fitting and prediction produce the best classification map. Figure 8 represents the prediction classification maps for Loukia and Dioni. Compared to Dioni, the distribution of categories in Loukia is more scattered and complex; therefore, it is more difficult to be classifier.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 11 of 14 classification map. Figure 8 represents the prediction classification maps for Loukia and Dioni. Compared to Dioni, the distribution of categories in Loukia is more scattered and complex; therefore, it is more difficult to be classifier.

Conclusions
In this paper, a hyperspectral imagery classification framework based on DeepLab v3+ is proposed. In the framework, DeepLab v3+ neural network is used as the spatial

Conclusions
In this paper, a hyperspectral imagery classification framework based on DeepLab v3+ is proposed. In the framework, DeepLab v3+ neural network is used as the spatial feature extraction method, and the spatial and spectral features are used for feature fusion. Finally, the SVM classifier is selected as the classification method among several classifiers. Compared with traditional machine learning algorithms and convolutional neural network algorithms in the same kind of dataset, our proposed framework not only significantly improves the classification accuracy but also improves the classification efficiency. Experimental results show that DeepLab v3+ has excellent spatial feature extraction capabilities and applicability in hyperspectral imagery classification, and the feature fusion method can effectively improve the classification accuracy of hyperspectral imagery. Experiments have shown that the framework proposed in this paper has good generalization, and the classification accuracy is better than other traditional machine-learning algorithms and deep-learning algorithms.