Semantic Segmentation of Hyperspectral Remote Sensing Images Based on PSE-UNet Model

With the development of deep learning, the use of convolutional neural networks (CNN) to improve the land cover classification accuracy of hyperspectral remote sensing images (HSRSI) has become a research hotspot. In HSRSI semantics segmentation, the traditional dataset partition method may cause information leakage, which poses challenges for a fair comparison between models. The performance of the model based on “convolutional-pooling-fully connected” structure is limited by small sample sizes and high dimensions of HSRSI. Moreover, most current studies did not involve how to choose the number of principal components with the application of the principal component analysis (PCA) to reduce dimensionality. To overcome the above challenges, firstly, the non-overlapping sliding window strategy combined with the judgment mechanism is introduced, used to split the hyperspectral dataset. Then, a PSE-UNet model for HSRSI semantic segmentation is designed by combining PCA, the attention mechanism, and UNet, and the factors affecting the performance of PSE-UNet are analyzed. Finally, the cumulative variance contribution rate (CVCR) is introduced as a dimensionality reduction metric of PCA to study the Hughes phenomenon. The experimental results with the Salinas dataset show that the PSE-UNet is superior to other semantic segmentation algorithms and the results can provide a reference for HSRSI semantic segmentation.


Introduction
Hyperspectral imaging technology can simultaneously obtain 3D spatial and spectral information of land features. Thus, it has a prominent advantage in the fine-grained land cover classification of remote sensing images and has been widely used in agriculture, forestry, military, mineral recognition, and marine research [1][2][3][4][5]. The sematic segmentation of hyperspectral remote sensing images (HSRSI) faces several technical challenges such as a complex data structure, massive computation, and high information redundancy [6,7]. The traditional machine learning classification method that needs to manually design features can no longer meet the needs of hyperspectral data [8]. Therefore, there is an urgent need for an efficient and intelligent classification technique for HSRSI.
With the rapid development of deep learning technology, the algorithm of convolutional neural networks (CNN) has been widely used in many fields, including image classification, sematic segmentation, and video understanding [9][10][11][12][13][14], and has become a research hotspot in the land cover classification of HSRSI. In 2015, Hu et al. constructed a 1D-CNN model of "convolutional-pooling-fully connected" structure to extract the spectral information of HSRSI and obtained higher classification accuracy than the support vector machine (SVM) and deep neural networks (DNN) [15][16][17]. However, due to the phenomena of "different objects with the same spectrum" and "different spectra for the same object", only extracting spectral information limits the performance of the CNN are rarely used in the semantic segmentation of HSRSI. Moreover, the algorithm structure in UNet-based approaches still has room for improvement.
Finally, small sample sizes and high dimensions of HSRSI lead to the Hughes phenomenon [42]. Most researchers use PCA to reduce the dimensions of HSRSI to avoid the curse of dimensionality. However, there is no scientific method to define the number of principal components after dimensionality reduction. Some researchers selected three principal components by referring to RGB images [20,21], while others defined the number of principal components by experience [19,22,32]. The above-mentioned studies have all avoided overfitting caused by small sample sizes and high dimensions, but dimensionality reduction can be very subjective and cannot provide a reference for future research. Xu et al. analyzed the classification accuracy of HSRSI with its dimensionality reduced to 1 with eight principal components [24]. However, only the first few principal components are not comprehensive enough for HSRSI with hundreds of bands. Therefore, it is necessary to further analyze how the land cover classification accuracy of HSRSI changes from a low dimension to a higher dimension.
In summary, the current HSRSI semantic segmentation faces the following three challenges: • Although existing dataset partition methods avoid problems of information leakage, they still suffer from two inadequacies: not including all land cover classes in the training set and discarding the unlabeled background pixels.

•
The UNet-based approaches for sematic segmentation of HSRSI, mostly directly employing the standard UNet [43,44], are not optimized for the characteristics of the HSRSI and still have room for improvement.

•
The PCA can overcome the impact of the curse of dimensionality on segmentation accuracy, but researchers tend to subjectively choose the number of dimensions and cannot provide a reference for future research.
In order to overcome the above challenges, firstly, this paper introduces the patch allocation scheme based on the non-overlapping sliding window strategy commonly used in computer vision into the sematic segmentation of HSRSI, and combines a judgment mechanism to make up for the disadvantage that not all classes can be included in the training set after the patches are randomly allocated. Secondly, this paper proposes a new PSE-UNet model for semantic segmentation of HSRSI. Compared with the method of directly using standard UNet [43,44], PSE-UNet considers the characteristics of HSRSI, combines UNet with PCA and the attention mechanism, reduces the performance loss caused by dimensional disasters, and enhances the expression of spectral information. In addition, considering the small number of HSRSI samples, the influence of downsampling times, different downsampling and upsampling methods, and different activation functions on segmentation performance are discussed, and the most appropriate PSE-UNet variant is determined. Finally, the cumulative variance contribution rate (CVCR) is introduced as the dimensionality reduction index to study the Hughes phenomenon and comprehensively analyze how the land cover classification accuracy of HSRSI changes from a low dimension to a higher dimension. The main contributions of this paper can be summarized as follows:

•
The non-overlapping sliding window method combined with the judgment mechanism can effectively avoid information leakage, overcome the shortcomings of existing dataset partition methods, and provide a fair comparison between models.

•
The proposed PSE-UNet is based on the "encoder-decoder" structure, considers the small sample sizes and high dimensions of the HSRSI, and improves the HSRSI semantic segmentation accuracy.

•
The Hughes phenomenon in HSRSI semantic segmentation is comprehensively analyzed, which can provide a reference for determining the dimension of HSRSI dataset.

Research Methodology
The overall framework contains four steps: dataset partitioning, a training model, evaluating model performance, and predicting the segmentation map, which can be seen in Figure 1. First, the dataset is randomly divided into a training set, validation set, and testing set using the non-overlapping sliding window strategy combined with a judgment mechanism. Then, the PSE-UNet model is trained and evaluated. Finally, the best model is used to predict the segmentation map. taset.

Research Methodology
The overall framework contains four steps: dataset partitioning, a training model, evaluating model performance, and predicting the segmentation map, which can be seen in Figure 1. First, the dataset is randomly divided into a training set, validation set, and testing set using the non-overlapping sliding window strategy combined with a judgment mechanism. Then, the PSE-UNet model is trained and evaluated. Finally, the best model is used to predict the segmentation map.

Dataset Partition Method
In order to solve the problems of existing dataset partition methods, this paper introduces the patch allocation scheme based on the non-overlapping sliding window strategy commonly used in computer vision into the sematic segmentation of HSRSI, and combines a judgment mechanism to make up for the disadvantage that not all classes can be included in the training set after the patches are randomly allocated. Considering the actual applications, the background pixels are retained and information leakage can be effectively avoided at the same time. The method can be used to fairly compare the segmentation performance of different models. The basic idea for dataset partitioning with the method is shown in Figure 2, and the specific steps are as follows: Step 1: The non-overlapping sliding window strategy is used to cut the hyperspectral remote sensing dataset into patches of n × n in size.
Step 2: The patches are randomly assigned to the training, validation, and test sets according to the common allocation ratio of 6:2:2 for small-scale datasets.
Step 3: Repeat Step 2 until each set contains all land cover classes.

Dataset Partition Method
In order to solve the problems of existing dataset partition methods, this paper introduces the patch allocation scheme based on the non-overlapping sliding window strategy commonly used in computer vision into the sematic segmentation of HSRSI, and combines a judgment mechanism to make up for the disadvantage that not all classes can be included in the training set after the patches are randomly allocated. Considering the actual applications, the background pixels are retained and information leakage can be effectively avoided at the same time. The method can be used to fairly compare the segmentation performance of different models. The basic idea for dataset partitioning with the method is shown in Figure 2, and the specific steps are as follows:

PSE-UNet Model
This paper proposes a PSE-UNet model based on the "encoder-decoder" structure for semantic segmentation of HSRSI, as shown in Figure 3. The PSE-UNet model is composed of a PCA module, C-SE modules, skip connections, and Softmax. The C-SE module is the basic unit of the PSE-UNet model for extracting the features. Considering the high dimensions of HSRSI, PCA and the channel attention mechanism are adopted in PSE-UNet; PCA is used to avoid the curse of dimensionality, and the channel attention mechanism is used to learn the interdependence between the feature channels. Considering the small sample sizes of HSRSI, the standard UNet can easy cause overfitting. Therefore, in PSE-UNet, the number of channels in each stage is reduced, and the appropriate downsampling times, downsampling and upsampling methods, and activation functions Step 1: The non-overlapping sliding window strategy is used to cut the hyperspectral remote sensing dataset into patches of n × n in size.
Step 2: The patches are randomly assigned to the training, validation, and test sets according to the common allocation ratio of 6:2:2 for small-scale datasets.
Step 3: Repeat Step 2 until each set contains all land cover classes.

PSE-UNet Model
This paper proposes a PSE-UNet model based on the "encoder-decoder" structure for semantic segmentation of HSRSI, as shown in Figure 3. The PSE-UNet model is composed of a PCA module, C-SE modules, skip connections, and Softmax. The C-SE module is the basic unit of the PSE-UNet model for extracting the features. Considering the high dimensions of HSRSI, PCA and the channel attention mechanism are adopted in PSE-UNet; PCA is used to avoid the curse of dimensionality, and the channel attention mechanism is used to learn the interdependence between the feature channels. Considering the small sample sizes of HSRSI, the standard UNet can easy cause overfitting. Therefore, in PSE-UNet, the number of channels in each stage is reduced, and the appropriate downsampling times, downsampling and upsampling methods, and activation functions are selected.

PSE-UNet Model
This paper proposes a PSE-UNet model based on the "encoder-decoder" structure for semantic segmentation of HSRSI, as shown in Figure 3. The PSE-UNet model is composed of a PCA module, C-SE modules, skip connections, and Softmax. The C-SE module is the basic unit of the PSE-UNet model for extracting the features. Considering the high dimensions of HSRSI, PCA and the channel attention mechanism are adopted in PSE-UNet; PCA is used to avoid the curse of dimensionality, and the channel attention mechanism is used to learn the interdependence between the feature channels. Considering the small sample sizes of HSRSI, the standard UNet can easy cause overfitting. Therefore, in PSE-UNet, the number of channels in each stage is reduced, and the appropriate downsampling times, downsampling and upsampling methods, and activation functions are selected. The PCA is used for dimensionality reduction of the HSRSI before they are input into the encoder, and the CVCR is selected as the dimensionality reduction standard to analyze the impact of different CVCRs on the segmentation accuracy. The encoder consists of two C-SE modules and two downsampling units. The decoder consists of two C-SE modules and two upsampling units. The skip connections are used to combine the shallow and deep features to avoid the loss of spatial information caused by downsampling. After all convolution operations in the model occur, the batch normalization (BN) [45] module is connected and the parametric rectified linear unit (PReLU) [46] is used as the activation The PCA is used for dimensionality reduction of the HSRSI before they are input into the encoder, and the CVCR is selected as the dimensionality reduction standard to analyze the impact of different CVCRs on the segmentation accuracy. The encoder consists of two C-SE modules and two downsampling units. The decoder consists of two C-SE modules and two upsampling units. The skip connections are used to combine the shallow and deep features to avoid the loss of spatial information caused by downsampling. After all convolution operations in the model occur, the batch normalization (BN) [45] module is connected and the parametric rectified linear unit (PReLU) [46] is used as the activation function. Finally, Softmax is used for pixel-wise classification, and the output of Softmax is the final segmentation results of land cover classes in the network.

PCA
In PSE-UNet, we use PCA to reduce the dimensions of HSRSI to avoid the curse of dimensionality. For the input HSRSI data X, it was converted into the corresponding principal component matrix Y by PCA, and then the number of principal components to be retained is selected by the CVCR to obtain the reduced dimension data. PCA is one of the most widely used data dimensionality reduction methods, which transforms input data into linear independent variables and retains most of the information in the original data, and the specific steps are as follows: Firstly, input data standardization is done to obtain matrix X so that the mean value of each row element is zero, and a new matrix X is constructed using the following equation: In the above equation, the mean value of each column of matrix X is zero, and n is the sample size.
Secondly, the truncated singular value decomposition (SVD) of matrix X is processed to obtain three matrices: U, Σ, and V. The equation is as follows: Finally, the first k columns of matrix V are used to constitute the k sample principal components, and the principal component matrix Y can be obtained with the following equation: In addition, CVCR is used to select the retained principal component number, which is calculated using the following equation: where, η i is the variance contribution of the i-th principal component, λ i is the eigenvalue of the i-th principal component, k is the number of selected principal components, and m is the total number of principal components.

C-SE Module
The C-SE module consists of convolutions, BN, PReLU, and an SE (Squeeze and Excitation) module [47], as shown in Figure 4. Since the HSRSI is multi-dimensional, an SE module, a lightweight channel attention mechanism, is introduced after the convolution module in the C-SE module to learn the interdependence between the feature channels through the "Squeeze-and-Excitation" structure. During the squeeze, the global average pooling (GAP) layer is used to compress the input 2D feature map into 1D real numbers, and in the excitation, two fully connected layers and a rectified linear unit (ReLU) [48] are used to build a model to fit the nonlinear relationship between the feature channels. Finally, the channel weight normalized by the sigmoid function is multiplied by the input feature map to enhance the related features and suppress the unrelated features. Hence, the C-SE module can extract more discriminative semantic features and obtain better segmentation results.

Downsampling and Upsampling
Downsampling (or subsampling) is a way to reduce the resolution of images. Th purpose of downsampling is to reduce the amount of calculation and increase the recep tive field. The most commonly used downsampling method is max pooling. In thi

Downsampling and Upsampling
Downsampling (or subsampling) is a way to reduce the resolution of images. The purpose of downsampling is to reduce the amount of calculation and increase the receptive field. The most commonly used downsampling method is max pooling. In this method, an operation is performed at the maximum value with the input images in a window of size n × n with a stride of n, and the maximum value of each window is taken as the pixel value of the corresponding position of the output image. In this paper, a convolution layer with a stride of 2 and a convolution kernel of size 2 × 2 is used to replace the pooling layer for downsampling the input images. The image size is reduced to half of the original after each downsampling.
Contrary to downsampling, upsampling is a way to restore image resolution. Bilinear interpolation and transposed convolution are commonly used for upsampling. In bilinear interpolation, the coordinate values of the points to be interpolated are linearly interpolated in X-and Y-axes to restore the image resolution. The transposed convolution is the reverse process of convolution. It decodes the features extracted by convolution to restore the image resolution. In this paper, a transposed convolution layer with a stride of 2 and a convolution kernel of size 2 × 2 is used for upsampling. The image size is doubled after each upsampling.

Activation Function
An activation function is an important part of the CNN model, which is used to increase the nonlinear expression capacity of the CNN model. In this paper, PReLU [46], an improved version of ReLU [48], is used as the activation function of the new model, which can adaptively learn the parameters from the data. The PReLU has the characteristics of fast convergence and low error rate. The calculation formula is as follows: where a i represents the parameter of a learnable rectified unit and i stands for different channels.

Loss Function
The function of weighted cross-entropy loss is used in this paper to reduce the impact of class imbalance in the hyperspectral dataset on the accuracy of the model. First, the overall sample size is divided by the sample size of a single class to obtain the reciprocal of the proportion of the sample size of a single class in the overall sample size. Then, the logarithm of the result of the previous step is obtained with 10 as the base and considered as the weight of each land cover class. Finally, the final loss function using the cross-entropy is obtained as: where K is the number of classes, y and p are the real and the predicted values, respectively, T is the overall sample size, and t is the sample size of a single category.

Parameter Setting of the Network
All experiments in this paper are completed under the framework of Keras opensource deep learning. The experimental hardware is configured as NVIDIA GeForce RTX 2080Ti GPU with a video memory of 11 GB. Before training, the training data is enhanced by rotating and flipping, and the network weight parameters are initialized by a He-normal distribution initializer [46]. The network training is carried out based on the function of weighted cross-entropy loss and Adam optimizer [49]. The batch size, the initial learning rate and the weight attenuation rate are set as 256, 0.001 and 0.00001, respectively. When the loss of the validation set does not decrease after 10 iterations, the learning rate is adjusted to half of the initial value until the loss of the validation set tends to be stable to end the training.

Evaluation Metrics
Five metrics, namely, Kappa coefficient (Kappa), Mean Intersection over Union (mIoU), weighted average precision (WAP), weighted average recall (WAR) and weighted average F1-score (WAF) are used to validate the performance of the proposed model. Kappa is used to measure the consistency between the predicted and the real values of the multiple classification models. The mIoU is the standard measure in the field of semantic segmentation. These two common metrics will not be listed here. The WAP, WAR and WAF are used to measure the performance of multiple classification models with serious class imbalance. The calculation formulas are: where K is the number of classes, T is the overall sample size, t k is the sample size of the class k, while TP k , FP k , and FN k are the true positive, the false positive, and the false negative of the class k, respectively.

Selecting the Dimension of the Salinas Dataset
In order to select the appropriate dimension, it is necessary to study the influence of the hyperspectral Hughes phenomenon on the accuracy of the model in semantic segmentation. In this paper, the CVCR is used as the dimension reduction standard of the PCA. The CVCRs of 99%, 99.9%, 99.99%, 99.999%, and 100% are adopted to reduce the dimension of the Salinas dataset and five groups of data are obtained. The five data groups are used as the input data of the new algorithm proposed in this paper to compare the segmentation performance of different models. The experimental results show that (Table 2), with the increase of dimension, the accuracy evaluation metrics of the five segmentation approaches all show a trend of first increasing and then decreasing. The optimal evaluation metrics are obtained when the dimension decreases to 31. When the CVCR is low, the dimensionality reduction loses too much information, resulting in poor segmentation performance. When the CVCR is high, the model learns too many nonlinear features from a small number of samples in the training set, and the "dimension disaster" causes the overfitting phenomenon, which affects the performance of the classifier. Compared with 31D, which shows the best performance, the five accuracy evaluation metrics of 3D and 204D are quite different. The results demonstrate that selecting the appropriate dimension can effectively reduce the impact of the Hughes phenomenon on the accuracy of land cover classification of HSRSI.  Figure 6 shows the visualized segmentation results with different CVCRs. It can be seen that in the segmentation map with a CVCR of 99%, the objects that have similar features, including Broccoli_green_weeds_1 and Broccoli_green_weeds_2, Fallow and Fallow_smooth, Grapes_untrained, and Vineyard_untrained, are seriously misclassified (marked with red circles in Figure 6); when the CVCR increases to 99.9% and 99.99%, the misclassification phenomenon decreases. However, when the CVCR reaches 99.999%, the obvious misclassification of Grapes_untrained and Vineyard_untrained appears again. The segmentation results without dimensionality reduction also show that the objects with similar features are misclassified (red circles shown in Figure 6). In general, misclassification mainly occurs among the classes with similar features. With the increase of dimension, misclassification shows a trend of first decreasing and then increasing. It can be observed from the visualized maps of segmentation performance and the segmentation results of the model with different CVCRs that the PSE-UNet model segmentation results are the best when the CVCR is 99.99% for dimensionality reduction of the Salinas dataset. Therefore, 31D is selected for dimensionality reduction with PCA in subsequent experiments of the Salinas dataset.

Comparison of experimental results of different semantic segmentation models
Taking the Salinas dataset as the basic data source, the PSE-UNet network proposed in this paper is compared with the FCN-8S [10], SegNet [11], UNet [12], 3D-UNet [38], and SS3FCN [35]. Among them, FCN-8S, SegNet, and UNet are three classical semantic segmentation networks, 3D-UNet performs well in semantic segmentation of medical hyperspectral images, and SS3FCN is an advanced method for HSRSI sematic segmentation. In order to ensure the objectivity of the experimental results of different models, the training adopted the same network parameter setting, the input data were patches of 32 × 32, and the CVCR was set to 99.99% (31D) for dimensionality reduction with PCA. Each model was tested five times independently, and the final results were the average of the five experimental results. The above-described five evaluation metrics were used to evaluate the accuracy of the experimental results. As shown in Table 3, compared with those of the other four segmentation models, the five metrics of the model proposed in this paper show the best accuracy, and the Kappa coefficient, WAP, WAR, WAF, and mIoU are 93.359%, 95.348%, 95.218%, 95.238%, and 88.508%, respectively. Compared with the suboptimal 3D-UNet algorithm, the Kappa coefficient, WAP, WAR, WAF, and mIoU of the new algorithm in this paper are increased by 1.943%, 1.352%, 1.402%, 1.434%, and 2.846%, respectively. In addition, in the networks of UNet, 3D-UNet, and PSE-UNet that also use the structure of "encoder-decoder", the number of parameters of the 3D-UNet network using 3D convolution is three times that of the UNet network, while the number of parameters of the network in this paper is only 4.5 M, less than two-thirds of that of the It can be observed from the visualized maps of segmentation performance and the segmentation results of the model with different CVCRs that the PSE-UNet model segmentation results are the best when the CVCR is 99.99% for dimensionality reduction of the Salinas dataset. Therefore, 31D is selected for dimensionality reduction with PCA in subsequent experiments of the Salinas dataset.

Comparison of experimental results of different semantic segmentation models
Taking the Salinas dataset as the basic data source, the PSE-UNet network proposed in this paper is compared with the FCN-8S [10], SegNet [11], UNet [12], 3D-UNet [38], and SS3FCN [35]. Among them, FCN-8S, SegNet, and UNet are three classical semantic segmentation networks, 3D-UNet performs well in semantic segmentation of medical hyperspectral images, and SS3FCN is an advanced method for HSRSI sematic segmentation. In order to ensure the objectivity of the experimental results of different models, the training adopted the same network parameter setting, the input data were patches of 32 × 32, and the CVCR was set to 99.99% (31D) for dimensionality reduction with PCA. Each model was tested five times independently, and the final results were the average of the five experimental results. The above-described five evaluation metrics were used to evaluate the accuracy of the experimental results. As shown in Table 3, compared with those of the other four segmentation models, the five metrics of the model proposed in this paper show the best accuracy, and the Kappa coefficient, WAP, WAR, WAF, and mIoU are 93.359%, 95.348%, 95.218%, 95.238%, and 88.508%, respectively. Compared with the suboptimal 3D-UNet algorithm, the Kappa coefficient, WAP, WAR, WAF, and mIoU of the new algorithm in this paper are increased by 1.943%, 1.352%, 1.402%, 1.434%, and 2.846%, respectively. In addition, in the networks of UNet, 3D-UNet, and PSE-UNet that also use the structure of "encoder-decoder", the number of parameters of the 3D-UNet network using 3D convolution is three times that of the UNet network, while the number of parameters of the network in this paper is only 4.5 M, less than two-thirds of that of the UNet network. Among the six algorithms, the algorithm proposed in this paper has achieved the best accuracy and outstanding segmentation performance.

Analysis of land cover classification results
In order to comprehensively analyze the recognition accuracy of the proposed algorithm for different land features, the segmentation effects of FCN-8S, SegNet, UNet, 3D-UNet, SS3FCN, and the proposed algorithm on 17 land cover classes are compared. The experimental results are shown in Table 4. The new algorithm put forward in this paper has the highest F1-score value in the recognition of 11 classes, slightly inferior to 3D-UNet in the classification accuracy of 3 classes, but the difference is no more than 1%, and a lower classification accuracy than SS3FCN on 3 classes. For easily distinguishable land features, such as Soil_vineyard_develop and Corn_senesced_green_weeds, the segmentation accuracy is slightly improved. However, the segmentation accuracy is improved for the easily misclassified classes, such as Lettuce_romaine in different periods. For the classes of Lettuce_romaine_5wk and Lettuce_romaine_6wk, the segmentation accuracy has been increased by 6.424% and 3.812%, respectively, compared with the suboptimal algorithm. The results show that the C-SE module used in the proposed algorithm integrates spatial and dimensional features more effectively, and can extract more discriminative features. In addition, the confusion matrix in Figure 7 shows that the algorithm proposed in this paper has good segmentation performance for most categories and less misclassification between different classes. For some patches affected by the interaction between the background and land features, there is relatively more misclassification, and the classification accuracy is slightly reduced. For example, the features of Fallow are similar to the features not labeled in the background, and the contour of Lettuce_romaine is a long strip and there are small sample sizes. Therefore, there is relatively more misclassification between these two classes and the background. background and land features, there is relatively more misclassification, and the classification accuracy is slightly reduced. For example, the features of Fallow are similar to the features not labeled in the background, and the contour of Lettuce_romaine is a long strip and there are small sample sizes. Therefore, there is relatively more misclassification between these two classes and the background.  Figure 8 shows the visualized semantic segmentation results with different algorithms. Compared with the other five algorithms, the results of the PSE-UNet algorithm show a clear contour of land cover features, less misclassification, and the best segmentation performance. The FCN-8S algorithm does not adopt the encoder-decoder structure, and the contour and texture of land cover objects are not as clear as those of the other four algorithms. The SegNet algorithm lacks skip connection to integrate deep features, resulting in the loss of detailed features after downsampling several times and a poor visual effect of segmentation. Compared with the FCN-8S and the SegNet, the SS3FCN algorithm generates a more accurate contour of segmentation maps but with more salt and pepper noise, while the UNet algorithm that uses the symmetrical structure of "encoder-decoder" for segmentation achieves excellent performance. Further comparing the experimental results of the three algorithms with the same structure, the UNet algorithm has poor performance on extracting the contour gap of Lettuce_romaine in different periods (marked with the ellipse in Figure 8). Misclassification can easily occur at the intersection of fine

3.
Comparison of visualized semantic segmentation results with different models Figure 8 shows the visualized semantic segmentation results with different algorithms. Compared with the other five algorithms, the results of the PSE-UNet algorithm show a clear contour of land cover features, less misclassification, and the best segmentation performance. The FCN-8S algorithm does not adopt the encoder-decoder structure, and the contour and texture of land cover objects are not as clear as those of the other four algorithms. The SegNet algorithm lacks skip connection to integrate deep features, resulting in the loss of detailed features after downsampling several times and a poor visual effect of segmentation. Compared with the FCN-8S and the SegNet, the SS3FCN algorithm generates a more accurate contour of segmentation maps but with more salt and pepper noise, while the UNet algorithm that uses the symmetrical structure of "encoder-decoder" for segmentation achieves excellent performance. Further comparing the experimental results of the three algorithms with the same structure, the UNet algorithm has poor performance on extracting the contour gap of Lettuce_romaine in different periods (marked with the ellipse in Figure 8). Misclassification can easily occur at the intersection of fine boundaries with the 3D-UNet network (marked with white circles in Figure 8). The PSE-UNet algorithm has less misclassification and the contour of land cover features and the real labels match well, mainly because the PSE-UNet adopts the C-SE module that can extract more discriminative features and lead to better segmentation results. boundaries with the 3D-UNet network (marked with white circles in Figure 8). The PSE-UNet algorithm has less misclassification and the contour of land cover features and the real labels match well, mainly because the PSE-UNet adopts the C-SE module that can extract more discriminative features and lead to better segmentation results.

Effects of downsampling times on model performance
In semantic segmentation, an overly small receptive field may lead to the loss of global information, and an overly large receptive field may degrade the segmentation performance of the model for small targets. Therefore, it is necessary to select appropriate

1.
Effects of downsampling times on model performance In semantic segmentation, an overly small receptive field may lead to the loss of global information, and an overly large receptive field may degrade the segmentation performance of the model for small targets. Therefore, it is necessary to select appropriate downsampling times to obtain better segmentation accuracy. Different downsampling schemes of 0, 1, 2, 3, and 4 times are set in the experiment to obtain different segmentation accuracies of the PSE-UNet model. As shown in Table 5, the performance of the model increases first and then decreases with the increase of downsampling times. The optimal segmentation accuracy is obtained with two times downsampling. To study the effect of using different downsampling and upsampling methods on the performance of the model, the segmentation performance of the model using max pooling and convolution as the downsampling methods, and bilinear interpolation and transposed convolution as the upsampling methods, is compared and analyzed. The experimental results are shown in Table 6. It can be seen that due to the learning ability of convolution operation during downsampling, better segmentation performance can be obtained using the convolution layer instead of the pooling layer. The model using transposed convolution for upsampling learns more nonlinear features than the model using bilinear interpolation, which improves the accuracy metrics. The model using convolution and transposed convolution for downsampling and upsampling, respectively, has better segmentation performance than the model using the other three methods.

Effects of different activation functions on model performance
To verify the impact of different activation functions on the model performance, PReLU [46] and ReLU [48] are selected as the activation functions for experiments to compare and analyze their impact on the segmentation performance of the model. The results in Table 7 show when each segmentation accuracy matrix of the model is higher when using PReLU as the activation function than that of ReLU as the activation function. By adding a small number of parameters, the PReLU function has overcome the problem that the gradient is 0 when the input of the ReLU function is negative and has improved the performance of the model.

Conclusions
Currently, finding efficient and intelligent methods for the classification of HSRSI is one of the research focuses in remote sensing. The research on semantic segmentation of HSRSI is not deep enough and there is still much room for improvement in the algorithm structure. Therefore, considering the successful application of the UNet algorithm in the classification of 3D medical images, this paper improves the dataset partitioning strategy in the classification of HSRSI based on the non-overlapping sliding window strategy. This paper introduces the CVCR as the standard for PCA dimensionality reduction and discusses how classification accuracy of HSRSI changes with different dimensions. The symmetrical structure of "encoder-decoder" is introduced into the classification of the HSRSI, based on which a new semantic segmentation algorithm PSE-UNet is proposed for classification. In addition, the effects of downsampling times, different downsampling and upsampling methods, and different activation functions on the performance of the proposed PSE-UNet model are discussed. Experiments are carried out based on the Salinas dataset, and the results show that:

1.
Based on the non-overlapping sliding window strategy, the judgment mechanism is introduced to improve the patch allocation scheme, which can overcome the disadvantage that not all classes can be included in the training set after the patches are randomly allocated, effectively avoiding information leakage; 2.
When selecting different cumulative contribution rates for dimensionality reduction with PCA, the segmentation accuracy shows a trend of first increasing and then decreasing with the increase of the dimension of the dataset used in the experiments. The segmentation results are the best when the CVCR is 99.99%, indicating that choosing the appropriate dimension can effectively weaken the influence of Hughes phenomenon on the classification accuracy of HSRSI; 3.
The segmentation performance of the PSE-UNet algorithm is better than the other four popular semantic segmentation algorithms, showing better segmentation accuracy and visualization effect, and less misclassification of land cover classes. Two times downsampling, convolution and transposed convolution for downsampling and upsampling, respectively, and PReLU as the activation function can effectively improve the segmentation accuracy of the PSE-UNet algorithm in semantic segmentation of the Salinas dataset.
In the semantic segmentation experiments with the Salinas dataset, the approach proposed in this paper shows excellent segmentation performance and can be applied to other semantic segmentation tasks of HSRSI. Different from some existing studies, the dataset partitioning strategy used in this paper retains the background pixels, which is more in line with the actual application scenarios. The comprehensive study of the Hughes phenomenon in this paper can provide a reference for the determination of the dimension of the dataset. The proposed PSE-UNet model considers the characteristics of small sample sizes and multiple dimensions of the HSRSI. The symmetrical structure of "encoder-decoder" and the channel attention mechanism adopted in the proposed model have significant application potential in the semantic segmentation of HSRSI. However, the proposed model still has some problems which need to be further studied in the future, such as low segmentation accuracy of low-frequency land cover features, parameter redundancy, and unvalidated generalization ability.