1. Introduction
Hyperspectral imaging technology can simultaneously obtain 3D spatial and spectral information of land features. Thus, it has a prominent advantage in the fine-grained land cover classification of remote sensing images and has been widely used in agriculture, forestry, military, mineral recognition, and marine research [
1,
2,
3,
4,
5]. The sematic segmentation of hyperspectral remote sensing images (HSRSI) faces several technical challenges such as a complex data structure, massive computation, and high information redundancy [
6,
7]. The traditional machine learning classification method that needs to manually design features can no longer meet the needs of hyperspectral data [
8]. Therefore, there is an urgent need for an efficient and intelligent classification technique for HSRSI.
With the rapid development of deep learning technology, the algorithm of convolutional neural networks (CNN) has been widely used in many fields, including image classification, sematic segmentation, and video understanding [
9,
10,
11,
12,
13,
14], and has become a research hotspot in the land cover classification of HSRSI. In 2015, Hu et al. constructed a 1D-CNN model of “convolutional-pooling-fully connected” structure to extract the spectral information of HSRSI and obtained higher classification accuracy than the support vector machine (SVM) and deep neural networks (DNN) [
15,
16,
17]. However, due to the phenomena of “different objects with the same spectrum” and “different spectra for the same object”, only extracting spectral information limits the performance of the CNN classifier. At the same time, the method used in the field of computer vision for extracting the spatial features of images has been used in several studies to extract the spatial information of HSRSI by constructing a 2D-CNN based on 2D convolution [
18]. However, the “dimension disaster” caused by the small sample sizes and high dimensions of HSRSI limits the performance of the 2D-CNN classifier [
6]. To solve this problem, the principal component analysis (PCA) is usually used to reduce the dimension to improve the classification accuracy [
19,
20,
21,
22,
23,
24]. However, neither the 1D-CNN nor the 2D-CNN makes full use of the 3D information of HSRSI. Therefore, using a CNN classifier to extract the spatial and spectral joint features simultaneously has become the mainstream research direction. Currently, two methods are often used to extract the spatial and spectral joint features: one is to use the 3D-CNN based on 3D convolution to directly extract the spatial and spectral features of the hyperspectral images [
25,
26,
27,
28,
29,
30]; the other is to use different combinations of the 1D-CNN, 2D-CNN, and 3D-CNN to develop models for this purpose [
31,
32,
33]. The CNN models constructed with these two methods have better performance in the classification of HSRSI than the CNN models that only extract features of a single dimension. The CNN models based on a “convolutional-pooling-fully connected” structure have made positive progress in the classification of HSRSI, but there are still issues that need to be further explored.
Firstly, to make full use of the annotation information in the hyperspectral dataset of small samples, most researchers use a sliding window with a stride of 1 to segment the images into patches and transmit them into the model. However, Nalepa et al. experimentally verified that partitioning the dataset in this way will lead to information leakage between the training set and the test set, resulting in overly optimistic classification results. Therefore, Nalepa et al. proposed a dataset partition method based on random patches. Randomly extracted multiple patches of m × n five times from the image were to be used as training data and the rest used as test data, effectively avoiding information leakage [
34]. Zou et al. used the sliding window of n × n with a stride of n for non-overlapping dataset partitioning, and divided the dataset into the training set, test set, and unlabeled patches, which is simple to implement and avoids information leakage at the same time [
35]. Qu et al. proposed a dataset partition method that divided the dataset into non-overlapping training, leakage, validation, and test areas. The model performance was evaluated through the training and the test areas, and the severity of information leakage was evaluated through the leakage and the test areas [
36]. Although the above-mentioned studies solved the problem of information leakage, there are still some unresolved problems, such as not including all land cover classes in the training set, the lack of randomness in data distribution, and data redundancy. In addition, the labeling quality of the data is ensured by discarding the unlabeled background pixels. However, the interference of the background in practical applications cannot be avoided.
Secondly, sample sizes of HSRSI are small, and it is difficult for the CNN classification models based on a “convolutional-pooling-fully connected” structure to fully utilize the annotation information [
35]. In order to improve the utilization of annotation information, Long et al. proposed fully convolutional networks (FCN) [
10] based on semantic segmentation by replacing the fully connected layer in the VGG-16 [
9] network with the convolution layer and using the transposed convolution to restore the image resolution, which successfully extended the classification of CNN from image-wise to pixel-wise. Zou et al. proposed the SS3FCN network and applied the FCN for the classification of the HSRSI for the first time [
35]. Qu et al. proposed the TAP-Net network that used three attention mechanisms and four parallel subnetworks to enhance the extraction capacity for features of the HSRSI [
36]. Although the above-mentioned models achieved good classification accuracy, due consideration has not been given to the small sample sizes and high dimensions of HSRSI in the algorithm structure. The UNet model proposed by Ronneberger et al. has achieved excellent results in the semantic segmentation of medical images that also have small sample sizes and high-resolution remote sensing images [
12,
37,
38,
39,
40]. The 3D-UNet network proposed by Çiçek et al. has been successfully applied to the semantic segmentation of high-dimensional 3D medical images [
41]. However, the UNet-based approaches are rarely used in the semantic segmentation of HSRSI. Moreover, the algorithm structure in UNet-based approaches still has room for improvement.
Finally, small sample sizes and high dimensions of HSRSI lead to the Hughes phenomenon [
42]. Most researchers use PCA to reduce the dimensions of HSRSI to avoid the curse of dimensionality. However, there is no scientific method to define the number of principal components after dimensionality reduction. Some researchers selected three principal components by referring to RGB images [
20,
21], while others defined the number of principal components by experience [
19,
22,
32]. The above-mentioned studies have all avoided overfitting caused by small sample sizes and high dimensions, but dimensionality reduction can be very subjective and cannot provide a reference for future research. Xu et al. analyzed the classification accuracy of HSRSI with its dimensionality reduced to 1 with eight principal components [
24]. However, only the first few principal components are not comprehensive enough for HSRSI with hundreds of bands. Therefore, it is necessary to further analyze how the land cover classification accuracy of HSRSI changes from a low dimension to a higher dimension.
In summary, the current HSRSI semantic segmentation faces the following three challenges:
- Although existing dataset partition methods avoid problems of information leakage, they still suffer from two inadequacies: not including all land cover classes in the training set and discarding the unlabeled background pixels. 
- The UNet-based approaches for sematic segmentation of HSRSI, mostly directly employing the standard UNet [ 43- , 44- ], are not optimized for the characteristics of the HSRSI and still have room for improvement. 
- The PCA can overcome the impact of the curse of dimensionality on segmentation accuracy, but researchers tend to subjectively choose the number of dimensions and cannot provide a reference for future research. 
In order to overcome the above challenges, firstly, this paper introduces the patch allocation scheme based on the non-overlapping sliding window strategy commonly used in computer vision into the sematic segmentation of HSRSI, and combines a judgment mechanism to make up for the disadvantage that not all classes can be included in the training set after the patches are randomly allocated. Secondly, this paper proposes a new PSE-UNet model for semantic segmentation of HSRSI. Compared with the method of directly using standard UNet [
43,
44], PSE-UNet considers the characteristics of HSRSI, combines UNet with PCA and the attention mechanism, reduces the performance loss caused by dimensional disasters, and enhances the expression of spectral information. In addition, considering the small number of HSRSI samples, the influence of downsampling times, different downsampling and upsampling methods, and different activation functions on segmentation performance are discussed, and the most appropriate PSE-UNet variant is determined. Finally, the cumulative variance contribution rate (CVCR) is introduced as the dimensionality reduction index to study the Hughes phenomenon and comprehensively analyze how the land cover classification accuracy of HSRSI changes from a low dimension to a higher dimension. The main contributions of this paper can be summarized as follows:
- The non-overlapping sliding window method combined with the judgment mechanism can effectively avoid information leakage, overcome the shortcomings of existing dataset partition methods, and provide a fair comparison between models. 
- The proposed PSE-UNet is based on the “encoder-decoder” structure, considers the small sample sizes and high dimensions of the HSRSI, and improves the HSRSI semantic segmentation accuracy. 
- The Hughes phenomenon in HSRSI semantic segmentation is comprehensively analyzed, which can provide a reference for determining the dimension of HSRSI dataset. 
  4. Conclusions
Currently, finding efficient and intelligent methods for the classification of HSRSI is one of the research focuses in remote sensing. The research on semantic segmentation of HSRSI is not deep enough and there is still much room for improvement in the algorithm structure. Therefore, considering the successful application of the UNet algorithm in the classification of 3D medical images, this paper improves the dataset partitioning strategy in the classification of HSRSI based on the non-overlapping sliding window strategy. This paper introduces the CVCR as the standard for PCA dimensionality reduction and discusses how classification accuracy of HSRSI changes with different dimensions. The symmetrical structure of “encoder-decoder” is introduced into the classification of the HSRSI, based on which a new semantic segmentation algorithm PSE-UNet is proposed for classification. In addition, the effects of downsampling times, different downsampling and upsampling methods, and different activation functions on the performance of the proposed PSE-UNet model are discussed. Experiments are carried out based on the Salinas dataset, and the results show that:
- Based on the non-overlapping sliding window strategy, the judgment mechanism is introduced to improve the patch allocation scheme, which can overcome the disadvantage that not all classes can be included in the training set after the patches are randomly allocated, effectively avoiding information leakage; 
- When selecting different cumulative contribution rates for dimensionality reduction with PCA, the segmentation accuracy shows a trend of first increasing and then decreasing with the increase of the dimension of the dataset used in the experiments. The segmentation results are the best when the CVCR is 99.99%, indicating that choosing the appropriate dimension can effectively weaken the influence of Hughes phenomenon on the classification accuracy of HSRSI; 
- The segmentation performance of the PSE-UNet algorithm is better than the other four popular semantic segmentation algorithms, showing better segmentation accuracy and visualization effect, and less misclassification of land cover classes. Two times downsampling, convolution and transposed convolution for downsampling and upsampling, respectively, and PReLU as the activation function can effectively improve the segmentation accuracy of the PSE-UNet algorithm in semantic segmentation of the Salinas dataset. 
In the semantic segmentation experiments with the Salinas dataset, the approach proposed in this paper shows excellent segmentation performance and can be applied to other semantic segmentation tasks of HSRSI. Different from some existing studies, the dataset partitioning strategy used in this paper retains the background pixels, which is more in line with the actual application scenarios. The comprehensive study of the Hughes phenomenon in this paper can provide a reference for the determination of the dimension of the dataset. The proposed PSE-UNet model considers the characteristics of small sample sizes and multiple dimensions of the HSRSI. The symmetrical structure of “encoder-decoder” and the channel attention mechanism adopted in the proposed model have significant application potential in the semantic segmentation of HSRSI. However, the proposed model still has some problems which need to be further studied in the future, such as low segmentation accuracy of low-frequency land cover features, parameter redundancy, and unvalidated generalization ability.