Multiscale Deep Spatial Feature Extraction Using Virtual RGB Image for Hyperspectral Imagery Classification

In recent years, deep learning technology has been widely used in the field of hyperspectral image classification and achieved good performance. However, deep learning networks need a large amount of training samples, which conflicts with the limited labeled samples of hyperspectral images. Traditional deep networks usually construct each pixel as a subject, ignoring the integrity of the hyperspectral data and the methods based on feature extraction are likely to lose the edge information which plays a crucial role in the pixel-level classification. To overcome the limit of annotation samples, we propose a new three-channel image build method (virtual RGB image) by which the trained networks on natural images are used to extract the spatial features. Through the trained network, the hyperspectral data are disposed as a whole. Meanwhile, we propose a multiscale feature fusion method to combine both the detailed and semantic characteristics, thus promoting the accuracy of classification. Experiments show that the proposed method can achieve ideal results better than the state-of-art methods. In addition, the virtual RGB image can be extended to other hyperspectral processing methods that need to use three-channel images.


Introduction
The rapid development of remote sensing technology in recent years has opened a door for people to more profoundly understand the earth. With the development of imaging technology, hyperspectral remote sensing has become one of the most important directions in the field of remote sensing. Because of their rich spectral information, hyperspectral images have been widely used in environmental monitoring, precision agriculture, smart city, information defense, resource management and other fields [1][2][3]. Hyperspectral classification is an important research branch of hyperspectral image processing, which assigns each pixel its corresponding ground category label [4].
Since the sample labeling of hyperspectral images is very difficult, how to use finite samples to obtain higher classification accuracy becomes the main problem in the field of hyperspectral classification [5]. Researchers have conducted in-depth research on this issue. At present there are mainly two directions, one is to extract more expressive features from hyperspectral images [6], groups of images for network training and realizing the classification of hyperspectral images. In this method, the connection between the bands of each spectrum is ignored, which may cause loss of spectral information.
Aiming at the above problems, this paper proposes a multiscale deep spatial feature extraction using a virtual RGB image method (MDSFV). Hyperspectral images are constructed into virtual RGB images which makes the distribution of its color features more similar to the natural images used in training network. The image is fed into the trained FCN model and multiscale features are extracted. By combining these features in a skip-layer way, the semantic information in deep features and edge and detail information in shallow features will be taken into account at the same time, which is more conducive to the pixel-level classification of hyperspectral images. In this process, we adopt a layer-by-layer normalization combining, in order to balance information from different layers. Finally, spatial spectral fusion features are sent into the classifier for classification to obtain the final objects' distribution.
The main contributions of this paper are summarized as follows:

1.
A new three-channel image is constructed to overcome the input limitation of FCN models. The hyperspectral bands corresponding to RGB wavelength are selected out. By simulating the Gaussian effect of photographic sensing on the RGB band, we introduced Gaussian weights to combine the corresponding bands. Compared with the simple three-dimensional extraction of the principal component, virtual RGB image is more suitable for the needs of the trained network, and hopefully extracts more useful features. In addition to benefiting the depth model feature extraction in this paper, this method can also be widely applied to any hyperspectral processing algorithm that needs to construct three-channel images.

2.
Based on fully convolution networks, a multi-layer feature fusion method is proposed. Compared with the previous methods in which the spatial feature extraction is directly based on the feature fusion of the network itself, the proposed method directly operates on the features of different layers thus reduce the loss of classify-related information. After the multi-layer spatial features extracted by FCN, features of different scales are combined via upsampling, cropping, combining, etc. Deep features will provide more semantic information, which is more beneficial to the discrimination of categories, shallow features will provide more edge and detail information, which facilitate to the expression of contour information of ground objects. The cross-layer jointing can preserve both sementic and detail information, making the feature more expressive.

3.
For the characteristics of different layers, different dimensions and semantic scales, a new joint of features is applied. After the scales of the features are unified, instead of simply concatenating the features, the method unifies their dimensions by extracting the principal components and then the different layer features are normalized and added together. The principal component analysis (PCA) based feature change can retain as much information as possible. Then the features are normalized and added layer-by-layer. Combining their corresponding layers directly can avoid the drawbacks of the feature dimension increasing caused by directly concatenate them, is more conducive to accurate and rapid classification.
The rest of this paper is organized as follows. In Section 2, we give a detailed description about the proposed method. In Section 3, experiments on three popular datasets are provided. In Section 4, we analyze the parameters involved in the algorithm. We conclude this paper in Section 5.

Materials and Methods
The labeled hyperspectral image data is very limited. In addition, the imaging conditions of different hyperspectral images, the number of spectral bands and the ground objects are significantly different, which make different hyperspectral data that cannot be trained together like natural images and other remote sensing images. The fully convolutional network can be used to classify hyperspectral images because its task is to perform pixel-level class determination of the entire image, which is Remote Sens. 2020, 12, 280 4 of 25 consistent with the goal of achieving classification of hyperspectral images. There are many parameters of the FCN, thus the single hyperspectral image cannot complete the update of all network parameters. By constructing a three-channel virtual RGB image, this paper simulates the trained network model on natural images for the pixel-level segmentation process, and better adapts to the characteristics of existing models. In this way, multi-layer and multiscale spatial features are extracted, and multiscale features joint is realized through various feature processing techniques, which enhances the feature expression ability. Finally, the spatial and spectral features are fused together to realize the classification of hyperspectral images. The procedure of the method is shown in Figure 1, mainly concluding three-channel image construction, multi-layer multiscale feature extraction and jointing, the process of the feature fusion and classification.  Figure 1. The procedure of the multiscale deep spatial feature extraction using a virtual RGB image method (MDSFV). The corresponding bands of RGB wavelength are selected and combined into a virtual RGB image, then the images are fed into the trained fully convolutional network (FCN) model to extract the multiscale features. The blue box shows the structure of the FCN convolution section and the orange box shows the skip-layer feature fusion section which is detailed in the next figure.
The multiscale features are joined to obtain the spatial feature and the spectral feature is fused in the last for classification.

Virtual RGB Image Construction
Hyperspectral imaging spectrometers can form approximately continuous spectral curves for tens or even hundreds of bands, including red, green and blue bands of visible light and some near-infrared bands. In recent years, many scholars have used the well-trained networks on natural images to extract features from hyperspectral images, such as CNN [50] and FCN [47]. The common way is to perform PCA on the whole spectrum of hyperspectral images, and then select the first three principal components to form a three-channel image into the networks [47,49], detailed process can be find in [47]. In this way, the difference between the bands of hyperspectral images vanishes, and the advantages of wide spectral range and narrow imaging band may be lost. Here, considering that the existing model is trained on the natural image of RGB (three-band), it may be more suitable for the spatial feature extraction of RGB corresponding wavelength bands. So we construct a virtual Then we can get the weight of the band s k is The resulting synthesized band reflection value is

c.
In order to ensure that the gray value range of the natural image is consistent, the R, G and B gray values of all the pixels are adjusted to the range 0-255.
Similarly, we can get the equivalent gray value of G band and B band. So far, we have obtained a virtual RGB image that simulates the RGB image, which will be used as the basis for spatial feature extraction.

Spatial Feature Extraction and Skip-Layer Jointing
The method acquires pixel-wise spatial features by deep and shallow feature fusion. Here we propose a method for extracting spatial features from models trained on natural images for hyperspectral classification. We select FCN for feature extraction. The advantage of FCN is that it has the same target with the hyperspectral image classification, aiming at pixel-wise classification. We reasonably guess that compared to a CNN, features from the FCN are more useful. We applied a well-trained network on natural images to extract multi-layer, multiscale features. Shallow features contain more edge and detailed information of the image, which is especially important for distinguishing the pixel categories of different objects intersections in hyperspectral images, while deep features contain more abstract semantic information, which is important for the determination of pixel categories. Therefore, we extracted both shallow edge texture information and deep semantic structure information, and combined them to obtain more expressive features. We selected VGG16 to Remote Sens. 2020, 12, 280 6 of 25 extract spatial features from the virtual RGB images. The parameters were transferred from the FCN trained on ImageNet. In Figure 1, the blue box shows the structure of the convolution part.
During the pooling operation of the fully convolutional network (FCN), the down-sampling multiples of spatial features increased gradually and the semantic properties of features were more and more abstract. The fc7 layer provided semantic information and the shallower layers provided more detailed information. So we chose both deep and shallow features, we extracted the detailed features of the pool3, pool4 and fc7 layers and combined them.The down-sampling multiples from the original image were 8, 16 and 32 times, respectively. We used a layer-by-layer upsampling skip-layer joint to combine the extracted features of the three layers and obtained the final spatial features.
The joint is shown in Figure 2. The method mainly through the upsampling, cropping and two layer feature maps joint to realize layer-by-layer joint expecting improve the ability of expressing spatial characteristics. It can better preserve the depth of semantic information in the fc7 layer, and simultaneously combine the edge texture information of the shallow features to improve the ability of feature expression. Since the FCN adds a surrounding zero padding operation to ensure the full utilization of the edge information during the convolution process, mismatch of the feature map and the edge of the original image is caused. Therefore, we focused on the number of pixels that differ between the different layer feature maps when combining skip-layer features. It seriously affects the correspondence between different pixel information of each feature map, which is very important for the pixel-level hyperspectral classification task. The edge pixels of the feature map corresponding outside the reference map are usually defined as offset, and as is known to all, when passing the pooling layer, the offset halves, and when through the convolution layer, the offset caused is The process is mainly divided into the three operations: upsampling, cropping and the deep features skip-layer joint. The upsampling is mainly based on bilinear interpolation, and in order not to lose the edge information, there is a surrounding padding = 1 to the map. The shallow feature maps occurred after the layer-to-layer convolution from the zero padded original image. The original image only corresponds to a part of its center, and the number of pixels in other areas are the offset. So in the cropping, we crop the maps by half of the multiple surrounding them. Since FCN's upsampling and alignment operations are well known to many scholars, we will not cover them further here. We mainly introduce the skip-layer feature map jointing operation.
By cropping, we can guarantee that the two feature maps are equal in size and position, but the dimensions of the two layers are different and cannot be directly added. If the two layers of features are directly concatenated, the feature dimension is multiplied, which greatly affects the computational efficiency and classification performance. Here we used principal component analysis (PCA), reducing the dimension of the feature map with a high dimension to make the dimension of the two maps the same and adding them layer by layer. In order to ensure the spatial information contribution of the two-layer feature is equivalent, each dimension of the two layer feature map was normalized before the addition. In the process of dimensionality reduction, by the characteristics of the PCA itself, our available principal components are less than w × h − 1, w × h is the size of feature to PCA. So if the dimension of the shallower is m and that of the deeper is n (m ≤ n), the size of the deeper map is w × h, the dimension of the two maps is min(w × h − 1, m). Combined with the convolution process of FCN, we conducted the following analyses for the feature joint.
Offset during the convolution process.
A series of operations of the FCN in the convolution process were analyzed to obtain the scale relationship and positional association before each feature map. According to (5), the parameters of the convolution between the layers of the FCN are shown in Table 1. In Table 1, we can find the offset of fc7 and the original image is 0, so we used the fc7 layer as the benchmark when combining skip-layer features, and the baseline was selected as the deeper feature map when cropping.
Offset calculation between feature maps. We discuss the calculation of the offset between the two layers of feature maps before and after the pooling and other convolution operations. We assume that the offset of the deep feature relative to the original image is O d , the offset generated by the convolution layer is O c and during the upsampling padding = 1, the offset is 1, then the offset of a feature layer former relative to the latter layer O ds , is calculated as follows: where k represents the downsampling times of the two layers. There is a pooling layer between them, so k = 2. The detailed offset between the layers through different feature levels are shown in Table 2.

Unified dimension of deep and shallow feature maps.
We discuss the number of principal components retained by the deeper feature map after PCA and the dimension of the shallower feature map. We take the feature of the upsampled PCA to reduce the dimension, and finally take the dimension as the minumum of the shallower feature and the dimension of the deeper feature after PCA. If the shallower's dimension is the larger, PCA will be also used on it to change the dimension to that of the deeper feature after PCA.
The detailed process of skip-layer feature joint is as follows:

Joint fc7 and pool4
First we upsampled the fc7 layer. The size of the fc7 and pool4 layer maps was different. Because of the padding operation, the relative offset between the two layers was generated. From Table 1, we knew that offset between fc7 and the original image is 0. In Table 2, the relative offset between the two layers was 5, the pool4 layer was cropped according to the offset. PCA was performed on the upsampled fc7 layer and we selected the same dimension of the upsampled fc7 and pool4. Then, we added their features layer-by-layer to obtain the feature map after fusion. We named it fuse-pool4 layer, which was offset from the original image by 5.

Joint fuse-pool4 and pool3
We combined the fuse-pool4 layer and the pool3 layer in the same way. We upsampled the fuse-pool4 and selected the feature map after upsampling according to the uniform rules of the deep and shallow feature dimensions. The relative offset between the two layers was 9 calculated, as shown in Table 2. The pool3 layer was cropped according to the offset to obtain two layers of the same size. Then they were added layer-by-layer to get the feature map fuse-pool3 after the fusion.

Upsample to image size
The fuse-pool3 layer was 8 times downsampled relative to the original image. We applied the upsampling process to directly upsample the layer by 8 times, and then calculated the offset between it and the original image. The offset between the upsampled feature map and the original image was 31 and we cropped the feature map after upsampling according to offset = 31, and the spatial features corresponding to the pixel level of the original image were obtained.
So far, we have obtained the spatial features corresponding to the original image, which combines the features of 8 times, 16 times and 32 times downsampling of deep neural network. They not only include the edge and detail information required for hyperspectral pixel-level classification, but also contains semantic information needed to distinguish pixel categories. The feature map corresponds to the original image as much as possible, which can effectively reduce the possibility of generating classification errors in the two types of handover positions. In the process of deep and shallow feature fusion, we adopt the uniform of the dimension of two layer feature maps, and then add them layer by layer. Compared with directly concatenating the features, the feature dimensions are effectively reduced, and the ability of feature expression of the layers are maintained.

Spatial-Spectral Feature Fusion and Classification
We combined the hyperspectral bands of RGB-corresponding wavelengths to construct a virtual RGB image, and then use FCN to extract multi-layer, multiscale features of the image. Through the skip-layer joint of these features, the spatial features that characterize the spatial peculiarity of the pixel and the surrounding distribution are obtained. However, in the process of extracting the feature, we ignored the other hyperspectral bands and the close relationship between the bands. Therefore, we extracted the spectral features associated with each pixel's spectral curve and fused it with the spatial features for classification.
Because the hyperspectral band is narrow and the sensitization range is wide, the number of hyperspectral bands is huge. To ensure that the feature dimension is not too high during the classification process and the expression ability of the feature is not affected as much as possible, we carried out the spectral curve for PCA. After the PCA, we selected the former masters of the composition as a spectral feature of the pixel.
We combined the spatial and spectral features of the corresponding pixels. For the characteristics of different sources, the common joint was to directly concatenate the different features. Here, we considered that the range of values and the distribution of data were different between the spatial features obtained by the network extraction and the spectral characteristics represented by the spectral reflection values. We normalized the features and combined them according to the equations shown in Equations (8) and (9) to fuse spatial features and spectral features. Suppose X spe is a spectral feature, which is obtained from the original spectrum PCA taking the first s e principal component, and X spa is the deep spatial feature with a dimension of s a , so we know where w × h is the size of the features to fuse. First, we do the following for X spe , X spa to normalize them in Equation (8) In Equation (8), We perform the normalization operation of the layers of the spectral and spatial features, that is, subtract the average then divide by the variance operation on the corresponding features of each pixel, so as to achieve the uniformity of each feature dimension. Then we combine features by concatenating them, the size of the fused feature is shown in Equation (9).
where s e , s a is discussed in Section 4; the occurred X f is the feature after fusion, we fed it into the classifier to implement classification.

Results
In this section, we evaluate the performance of the proposed MDSFV by comparing it with some state-of-art methods. The experiments were performed on three datasets, mainly including data from two hyperspectral imaging sensors, the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) and Reflective Optics System Imaging Spectrometer (ROSIS-3). Meanwhile, we also carried out comparative experiments on different feature joint ways to verify the effectiveness of multiscale skip-layer features and different three-channel image to verify the effectiveness of virtual RGB image.

Data Description and Experiment Setup
We selected the Indian Pines dataset, the Pavia University dataset and the Kennedy Space Center dataset, which we will cover separately.
The Indian Pines dataset was collected in 1992 by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) at the test site in northwestern Indiana. The dataset has a spatial dimension of 145×145 pixels with a spatial resolution of 20 m/pixel and a total of 220 wavelength reflection bands covering a wavelength range of 0.4-2.5 µm with a nominal spectral resolution of 10 nm. After removing the influence band of noise and water absorption ([104-108], [150-163], 220), the remaining 200 bands were used for experiments. The scene is mainly agriculture and forests, including a small number of buildings. The dataset contains a total of 16 classes. The sample size varies greatly among classes, the minimum is only 20 and the largest class has 1428 samples. When selecting training samples, we selected according to the ratio of the total number of samples. In this experiment, we selected each class of 10% for training, and other samples for testing. The PCA first three components image, the virtual RGB image and the corresponding objects label map are as shown in Figure 3.  Pavia University dataset was taken in Pavia, Italy by the Reflective Optics System Imaging Spectrometer (ROSIS-3). The sensor has a total of 115 spectral channels covering a range of 0.43-0.86 µm. After removing noise and water absorption bands, there were 103 hyperspectral bands remaining. The spatial dimension is 610×340, the spatial resolution is 1.3 m/pixel, and a total of 42,776 samples were included, containing nine types of ground objects such as grass, trees and asphalt. Since each type of labeled sample has a large amount, we chose 50 samples from each class to train and all remaining samples to test. In Figure 4, the three components image, the virtual RGB image and label of Pavia University are displayed.
Kennedy Space Center (KSC) dataset was obtained in 1996 by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) at the Kennedy Space Center (KSC). A total of 224 bands, covering the wavelength range of 0.4-2.5 µm, the KSC dataset is available at a height of approximately 10 km with a spatial resolution of 18 m/pixel. After removing the water absorption and low SNR (Signal Noise Ratio) bands, there are 176 bands for analysis, including 13 objects coverage categories. The categories of the dataset are relatively balanced and the number is several hundred, so we used 20 samples of each type for training. Figure 5 shows the three-channel image by PCA, the virtual RGB image and the label map of KSC.

282
We selected the Indian Pines dataset, the Pavia University dataset and the Kennedy Space Center 283 dataset, which we will cover separately. Our experiments were based on the fully convolutional network of the caffe framework to extract spatial features [51], using Matlab to configure LIBSVM (A Library for Support Vector Machines) by Chih-Chung Chang and Chih-Jen Lin [52] and LIBLINEAR (A Library for Large Linear Classification) [53] for feature fusion and classification calculations. Under the same conditions, we compare the proposed MDSFV method with the method only using the deepest features upsampling it 32 times and without changing dimension (DSFV). Simultaneously, four CNN or FCN based methods are chosen to compare, including DMS 3 FE [47] and FEFCN [49] based on FCN and two CNN [50] and CNN PPF [54] based on CNN. All the above methods were run 10 times with randomly selected train or test samples, and the average accuracies and the corresponding standard deviations are reported.
We selected overall accuracy (OA), average accuracy (AA) and kappa coefficient (κ) to evaluate the performance of these methods, and the accuracies of each class shown in the tables are calculated by the proportion of correctly classified samples to total test samples.

Feature Jointing and Fusion Strategies
In this section, we will compare the effectiveness of our strategy in the feature joint and fusion, including the depth of the spatial feature acquired from FCN, and report on how the deeper and shallower combined and the efficiency of the fusion of spectral-spatial feature. We selected the Pavia University dataset to compare the results of each strategy.

The Depth of the Spatial Feature
In this section, we show the results of the joint between the different feature layers extracted by FCN. From the convolution process of FCN, we know that the downsampling multiples of pool2, pool3, pool4 and fc7 layers relative to the original image are 4, 8, 16 and 32, respectively. Here we compare the three levels' feature sets, the features used by each are shown in Table 2. For example, the joint of Two-Layer is to combine the fc7 layer feature with the pool4 layer feature, and is upsampled 16 times to obtain the spatial feature map corresponding to the original image. Shallow features will make the classification more detailed, deep features will carry more semantic information; the question of how to balance the relationship between them to obtain accurate and precise classification results is a problem. Figure 6 shows the classification results of three levels. We can see that the Four-Layer method has some subtle misclassifications due to the joint of the characteristics of the shallow pool2. The relative reduction of shallow information in the Three-Layer method improves these misclassifications. When the shallow feature is further reduced, the Two-Layer method is adopted, since the feature image is directly sampled by 16 times, it is inaccurate in the range of 16 pixels, thus causing some other misclassifications. Table 3 demonstrates the accuracy of each class of the three levels' joint, we can further observe the advantages of the Three-Layer method. Combining the above results to balance the relationship between the deep and shallow layers, we use the joint of the Three-Layer deep spatial features.

The Combination of Deeper and Shallower Features
In the method section of this paper, we mentioned that for different layer feature maps, the dimension-unified features are normalized and added. In this part, we compare the results of three combination ways, which are: the different layer features are directly concatenated (Concat); applying PCA to features of the same dimension and then adding them layer-by-layer (No Normalization); and the PCA after features are normalized before combining (Normalization). Table 4 demonstrated the accuracy and time-consumption of the three combination ways. When the two layer features are merged in the concatenation, the final feature dimension is very large, reaching 4896 dimensions, resulting in significant time-consumption in the process of upsampling bilinear interpolation. The memory footprint is more than 32 GB. Thus, we decided not to choose this combination mode when the classification result is equivalent. The main comparison is whether the normalization has an effect on the accuracy when the training samples are identical. We can see that the normalization addition is, relatively, a little inferior to the non-normalization direct addition on time consumption. In terms of classification effect, the normalization addition is better than the non-normalized. In Section 2.2 we also analyzed this method, because the method is implemented to make the contribution of two layers' information the same, so we chose the method of normalization and addition. Figure 7 shows the results with and without normalization.

Fusion of Spectral and Spatial Feature
We know that spatial and spectral features are both important for accurate classification of pixels. Spatial features provide information about the neighborhood around the pixel. Spectral features provide unique, discerning and accurate spectral curves that are unique to the pixel. Spatial information is conducive to the continuity of pixel classification. Spectral information is important to the accuracy of specific pixel classification, so if you want to accurately classify, spatial and spectral features are indispensable. This section will compare the classification results using only spectral features, using only spatial features and fusion spatial and spectral features. Figure 8 exhibits the results of using spectral or feature, respectively, and that of the fusion feature. When using the spectral feature only, the continuity of the spatial distribution of ground features is affected-pixels in the shadow class are classified very scattered. Nevertheless, when the spatial feature is the only uniquely usable feature, even if the classification result is better in the area of the piece, in the slenderly distributed area, especially the road, the result is not ideal (divided into sections). This can be interpreted as the result of missing spectral information only by using spatial information. When applying spatial information and spectral information at the same time, we can find that the classification result is better, while avoiding the scattered distribution of pixel classes and slender object mistakes. The phenomenon of segmentation is basically close to the groundtruth. Table 5 gives the accuracies of each species in three ways. The high accuracy is shown in bold, and it can be seen that the effectiveness of the spatial-spectral fusion is very obvious. Version December 16, 2019 submitted to Journal Not Specified 14 of 24    Table 5 gives the accuracies of each species in three ways. The high accuracy is shown in 379 bold, and it can be seen that the effectiveness of the spatial-spectral fusion is very obvious.

Effectiveness of Virtual RGB Image
In this part, we compare the performance of different three-channel images. PCA the hyperspectral data directly, the average of RGB corresponding bands and Gaussian combination of the bands are compared. For PCA, we select the first three principal components as the RGB channel intensity. The other two methods select the bands have same wavelength range with the RGB, and combine them in different methods. One is averaging the bands (average bands) and another is synthesizing the bands by Gaussian weights (virtual RGB).
In Table 6 and Figure 9, we can see that the RGB corresponding bands can achieve a better classification result than the PCA method, which is because the RGB bands are more similar to the natural images training model. Since the Gaussian weights performance is better on this account, we selected the virtual RGB images for feature extraction.  different methods. One is averaging the bands (average bands) and another is synthesizing the bands 386 by Gaussian weights (virtual RGB).

387
In Table 6 and Fig. 9, we can see that the RGB corresponding bands can achieve better classification 388 result than PCA method, which is because the RGB bands are more similar to the natural images 389 training model. Since the Gaussian weights performance better on this account, we select virtual RGB 390 images for feature extraction. In this part, we experiment with the sample size and the spatial spectral feature fusion dimension.

Classification Performance
In this section, we will compare the proposed MDSFV method with spatial feature extraction without multiscale features' skip-layer fusion (DSFV) method as baseline and the other four state-of-art methods. The result figures and detailed average accuracy and variance are shown in Figures 10-12 and Tables 7-9. Figure 10 and Table 7 show the classification performance of the six methods on Indian Pines, the experiment were repeated ten times. Figure 10 demonstrates the classification result of each pixel, we can see that of the MDSFV method is closer to groundtruth, and the rest of the methods have more or less mis-segmentations of some regional edges. Table 7 shows the average accuracy and standard deviation per class. Distinctly, our methods are more optimal in terms of various class accuracies. Since CNN PPF selects three different spectral curves in one class and pairs them with other classes, the number of training samples in the ninth class should be 2, so we set the number of samples of our ninth class 3. When comparing with Two CNN method, we did not train and migrate the model on similar datasets. Therefore, the accuracy of this method is significantly lower than the other methods, which also confirms that the training of deep learning networks using hyperspectral data needs a data foundation. The impact of the reduced sample size on the accuracy is enormous. In general, our method achieved an overall accuracy of 98.78%, and DSFV refers to the classification result without skip-layer feature fusion. It can be seen that using multiscale features can improve the recognition accuracy. The comparison to other methods shows it is superior to other deep learning methods.      Figure 11 and Table 8 show the classification result of various methods on the Pavia University dataset. From Figure 11, it can be seen that except for the MDSFV method in this paper, the other methods have serious segmentation errors in the meadows. It should be noted, the result of the deepest feature upsampling (DSFV) is inferior to that of DMS 3 FE proposed in [47]. This may be due to the upsampling multiple of the spatial feature is 32, which means that the feature is inaccurate within the range of 32 pixels. But in general, our method performs well in terms of accuracy to the existing deep learning methods. Table 8 shows the accuracies per class. In the class which has major samples, MDSFV performances significantly better than the others. In terms of overall accuracy, it exceeds the DMS 3  upsampling (DSFV) is inferior to that of DMS 3 FE proposed in [47]. This may be due to the upsampling 459 multiple of the spatial feature is 32, which means that feature is inaccurate within the range of 32 pixels.

460
But in general, our method performs well in terms of accuracy to the existing deep learning methods.    Figure 12 and Table 9 demonstrate the results of various comparison algorithms on the Kennedy Space Center dataset. The CNN PPF introduced by [54] cannot obtain convergence on this dataset, even though we tried a lot of ways, including changing the training sample numbers. It may be caused by the large correlation between the spectral curves of various types of objects in the dataset, resulting in poor separability, and the accurate classification cannot be performed after the pairing algorithm. In Figure 12 when the skip-layer multiscale spatial features were not introduced, there were many mis-segments of a region. After it was introduced, the mis-classification vanished and the phenomenon of scattered mistakes was reduced. The accuracies of classes are shown in Table 9, the overall accuracy has a 0.8% raise when compared to other state-of-art methods. Based on the above results, we can prove that the proposed skip-layer fusion of multiscale spatial features is very effective for accurate classification. The proposed unification of dimensions by PCA and normalization before the combination of features can effectively reduce the dimension of the combined features and improve the separability of features. The virtual RGB image can better fit the imaging conditions of natural images, and is more conducive to the training of the model to extract spatial features. Compared with the simple and crude three principal components, it can effectively express the pixel-spatial relationship in hyperspectral images.

Discussion
In this section, we carry out parameter analysis. We experimented with the sample size and the spatial spectral feature fusion dimension. For the sample size, we conducted experiments on the three datasets, and selected the sample numbers according to the actual situation of each dataset. For the spatial spectral feature fusion dimension, we experimented with three datasets and found for each dataset, the optimal spatial and spectral fusion feature dimensions are basically invariant, so the results of each dataset are integrated, thus, the selections of spatial and spectral feature dimension are unified.

Sample numbers
The number of training samples will affect the accuracy of classification. The more training samples, the higher the accuracy. However, because the sample size is limited and the time cost of training the classifier is considered, the number of samples should be appropriate.
Since the Indian pines dataset is very uneven between each class, the minimum number of samples is only 20, so we selected the training samples proportionally. The blue line in Figure 13 shows the accuracy varies with the proportion of samples on Indian Pines. We can see that the overall accuracy shows an upward trend as the number of samples increases. However, when the sample ratio reaches 10%, the overall accuracy reaches about 98%, and as the number of samples increases, the accuracy increases no longer. We determine the final sample size by 10% of each sample. For the Pavia University dataset, there are nine types of samples, and the number of samples per class is relatively large, so we select according to the number of samples. The red line in Figure 13 shows the overall accuracy varies from the number of training samples each class on the Pavia University dataset. It can be obtained that when the number of samples per class is 50, the classification accuracy reaches a level close to 98%, and the accuracy does not change significantly with the increase of the number of samples, so the number of samples is finally determined to be 50 per class.
For the KSC dataset, the number of samples per class is small and relatively balanced, so we select the number of samples per class, and the overall accuracy of classification varies with the number of samples per class as shown in Figure 13 by the green line. It can be seen that the initial improvement of the classification accuracy with the training samples is very obvious. When the number of samples per class is 20, as the number of samples increases, the classification accuracy rate increases slowly, so the final sample size is 20 per class.

Dimension of features
In the process of spectral-spatial feature fusion, the spectral features are obtained from the original spectral PCA, and the spatial features are obtained by combining the first few dimensions of the principal components after the spatial feature PCA. The dimensions of both can be changed, i.e., in Equation (7) s a , s e is changeable. We performed the experiments on three datasets and combined the results of them to obtain the final spatial and spectral feature dimension. The results on the datasets are shown in Figure 14.  range, the classification accuracy does not change significantly. This proves that the parameters are allowed in this range and our spatial and spectral feature dimensions are robust to some extent.
In this section, we discussed the accuracy varies with the training sample numbers. Meanwhile, relevant parameters of feature dimension is selected in the designed experiment.

Conclusions
We propose a hyperspectral classification method based on multiscale spatial feature fusion. We introduce a new three-channel image combination method to obtain virtual RGB images. In these images, the hyperspectral corresponding bands are synthesized by simulating the RGB imaging mechanism of natural images. The image is used to extract multiscale and multi-level spatial features in a network trained on natural images, which can better fit the model parameters trained on natural images and obtain more effective spatial features. By combining the multiscale spatial features, the semantic information of the deep features can be utilized simultaneously to ensure the accuracy of the feature classification and the edge detail information of the shallow features can ensure the regularity and continuity of the edge classification of the feature. The proposed skip-layer feature combination method can avoid the problem that the feature dimension increases which is caused by the traditional concatenation method, the long time-consumption in classification and the separability decreases. Experiments showed that our method performs well compared to the previous deep learning methods and achieves a higher classification accuracy rate. In a future work, we will further study the time performance and complexity of the algorithm. In addition, the virtual RGB image we introduced provides a new solution for all algorithms involving the synthesis of three-channel images. This solution can avoid PCA's simple and crude reduction of data, which can better adapt to the characteristics of the deep learning networks on natural images. It can bridge the gap between hyperspectral data and natural images.