Modiﬁcations of the Multi-Layer Perceptron for Hyperspectral Image Classiﬁcation

: Recently, many convolutional neural network (CNN)-based methods have been proposed to tackle the classiﬁcation task of hyperspectral images (HSI). In fact, CNN has become the de-facto standard for HSI classiﬁcation. It seems that the traditional neural networks such as multi-layer perceptron (MLP) are not competitive for HSI classiﬁcation. However, in this study, we try to prove that the MLP can achieve good classiﬁcation performance of HSI if it is properly designed and improved. The proposed Modiﬁed-MLP for HSI classiﬁcation contains two special parts: spectral– spatial feature mapping and spectral–spatial information mixing. Speciﬁcally, for spectral–spatial feature mapping, each input sample of HSI is divided into a sequence of 3D patches with ﬁxed length and then a linear layer is used to map the 3D patches to spectral–spatial features. For spectral– spatial information mixing, all the spectral–spatial features within a single sample are feed into the solely MLP architecture to model the spectral–spatial information across patches for following HSI classiﬁcation. Furthermore, to obtain the abundant spectral–spatial information with different scales, Multiscale-MLP is proposed to aggregate neighboring patches with multiscale shapes for acquiring abundant spectral–spatial information. In addition, the Soft-MLP is proposed to further enhance the classiﬁcation performance by applying soft split operation, which ﬂexibly capture the global relations of patches at different positions in the input HSI sample. Finally, label smoothing is introduced to mitigate the overﬁtting problem in the Soft-MLP (Soft-MLP-L), which greatly improves the classiﬁcation performance of MLP-based method. The proposed Modiﬁed-MLP, Multiscale-MLP, Soft-MLP, and Soft-MLP-L are tested on the three widely used hyperspectral datasets. The proposed Soft-MLP-L leads to the highest OA, which outperforms CNN by 5.76%, 2.55%, and 2.5% on the Salinas, Pavia, and Indian Pines datasets, respectively. The obtained results reveal that the proposed models provide competitive results compared to the state-of-the-art methods, which shows that the MLP-based methods are still competitive for HSI classiﬁcation.


Introduction
Hyperspectral sensors are able to capture hyperspectral image (HSI) with abundant spectral and spatial information, which could accurately characterize and identify different land-covers.The valuable source of the rich information makes the HSI useful in a wide range of applications, including agriculture (e.g., crops classification [1] and detection of water quality conditions [2]), the food industry (e.g., characterizing product quality [3]), water and maritime resources management (e.g., water quality analysis [4] and sea ice detection [5]), forestry and environmental management (e.g., health of forests [6] and infestations in plantation forestry [7]), security and defense applications (e.g., identification of man-made materials [8]), and vision technology (e.g., 3D reconstruction [9] and image detection [10]).
HSI classification refers to a task of assigning a category to each pixel in the scene.Due to the fact that it is a basic procedure in many applications, HSI classification is a fundamental and hot topic in the remote sensing community [11].
A large number of methods have been proposed for HSI classification and most of them are supervised learning-based methods [12].There are two important parts for accurate HSI classification: discriminative feature extraction and robust classifier.For feature extraction of HSI, many morphological operations, including morphological profiles (MPs) [13], extended MPs (EMPs) [14], extended multi-attribute profile (EMAP) [15], and extinction profiles (EPs) [16], have been developed to extract the HSI features.For a robust classifier of HSI, due to its low sensitivity to high dimensionality, support vector machine (SVM) was widely used as a good classifier [17].
Recently, various studies applied for HSI classification have demonstrated the success of deep learning-based methods, such as stacked auto-encoder [18] and deep belief network [19].Moreover, recurrent neural network has promising performance in learning hyperspectral image with sequential data [20].Among the deep learning-based methods for HSI classification, convolutional neural network (CNN) with the local-connection and shared-weight architecture has become the mainstream approach for classifying HSI [21].
Depending on the input information of models, the HSI classification methods based on the CNN methods can be divided into three types: the spectral CNN, spatial CNN, and spectral-spatial CNN [22].The spectral CNN-based approaches adopted CNN on the spectral of HSI to extract discriminative spectral features.For instance, in [23], the spectral information of each pixel is extracted by the CNN with only five convolutional layers.Besides, Li et al. presented a novel pixel-pairs strategy for HSI classification by utilizing the deep CNN, which provides excellent performance [24].
The second type of CNN-based approaches for HSI classification is named the spatial CNN.Since there exist amounts of spatial information in the HSI, the 2D convolution layer is designed in many studies to extract the spatial features of HSI from a local cube in the spatial domain [25][26][27].For instance, in [27], principal component analysis was first applied to reduce the dimension of the HSI, then, 2D convolution layer was utilized to extract the spatial features of fixed neighborhood of each pixel.
The last type based on the CNN for classifying HSI is called the spectral-spatial CNN, which extracts the spectral and spatial HSI features in a uniform framework [28,29].Here, 3D convolutions have been used to extract the spectral and spatial features of HSI simultaneously [30][31][32][33][34].In [30], a 3D CNN was utilized in a range of effective spectralspatial representative band groups to extract spectral-spatial features.In addition, spectralspatial residual network is designed through identity mapping to improve accuracy [31].Because there are many parameters in the model, a light 3D CNN has been proposed to reduce the computational cost [34].Recent works have demonstrated that the CNNs have enabled many breakthroughs in HSI classification tasks, yielding great successes [35,36].
Although CNN-based methods have achieved good performance for HSI classification, questions still remain.Firstly, CNN is a kind of network which is inspired by biological visual system; therefore, it is proper for image feature extraction.However, HSI, which contains spectral-spatial information, is quite different from an "ordinary" image.Specifically, CNN uses local connection and shared weights to efficiently extract the features of images [37].The effectiveness of this mechanism for HSI processing should be investigated.Secondly, CNN has the ability to capture local structure with inductive bias, but there are no advantages of handling long-range interactions at any position in a single input, since the receptive fields are limited [38].Thus, in this study, MLP [39], which is a kind of neural network with fewer constraints, is investigated for HSI classification.
MLP is one type of basic neural networks.In recent years, CNNs usually obtain better classification performance compared with other types of deep learning.However, MLP has been proven as a promising machine learning technique [40][41][42].For example, in [41], an artificial neural network MLP architecture was presented with time optimization, which demonstrates the best time results.Furthermore, Kalaiarasi et al. proposed a frost filtered scale-invariant feature transformation-based MLP classification technique by applying the frost filtering technique and Euclidian distance between the feature vectors, which im-proved the classification accuracy with minimum time [42].However, the aforementioned studies are still in the traditional architecture of the MLP with a few hidden layers.
In this paper, we show that while convolutions are sufficient for good performance, they are not necessary.We present that CNN has inductive bias by localized processing, while the proposed MLP could learn long-range interactions of different patches by allowing various patches to communicate with each other.Instead of simply reusing the traditional MLP, the proposed MLP-based methods adopt new architecture and achieve significant performance, which leads to better results than CNN for HSI classification.
As a summary, the following are the main contributions of this study.
( The rest of this paper consists of three sections.Section 2 introduces the proposed Modified-MLP, Multiscale-MLP, Soft-MLP, and Soft-MLP-L.Afterward, the experimental settings, results, and analyses are presented in Section 3. Finally, Section 4 concludes this study.

The Proposed Modified-MLP for HSI Classification
Figure 1 shows the overview architecture of the proposed Modified-MLP for HSI classification.There are three core parts in the Modified-MLP: spectral-spatial feature mapping, spectral-spatial information mixing, and following clssification.We split each input HSI sample into fixed-size 3D patches.Then, each is linearly projected by the spectralspatial feature mapping layer, resulting in a series of spectral-spatial features.Next, these features are fed into the spectral-spatial information mixing for capturing the spectralspatial feature interactions of different patches.After that, the classification results are obtained by the fully connected layer.The detailed description is explained as follows.
Suppose the HSI dataset is of size H × W × nBand, where H and W indicate the spatial height and width, respectively, and nBand is the band number.Firstly, a single sample (i.e., I) is generated by processing each pixel in the HSI with a fixed window size, whose shape is h × w × nBand.Secondly, a sample of the HSI is split into l patches (i.e., P 1 , P 2 , • • • , P l ) without overlapping, P = {P 1 , P 2 , • • • , P l }, and l is calculated as Equation (1).Each patch has a shape of p × p × nBand, the size of the clipped patch (i.e., p) is determined empirically.For instance, if the size of a single HSI sample, i.e., I, is 32 × 32 × nBand (i.e., h = 32, w = 32), the fixed patch size p is set to 4, thus, l = 64.
Thirdly, all the patches are fed into a spectral-spatial mapping layer independently, here, each patch is linearly projected into an optimal hidden d-dimensional spectral-spatial feature space.Here, let X ∈ R l×d represent the output features obtained by the spectralspatial mapping layer of the Modified-MLP, where l is the number of patches and d indicates the hidden dimension of the proposed Modified-MLP.
Then, these obtained features X are passed through the well-designed Modified-MLP for spectral-spatial information mixing.There are B blocks with the same size and structure, and each block is composed of two different MLPs (i.e., MLP1 and MLP2).The number of B represents different depth of the model.Figure 1 shows the two blocks (i.e., B = 2) as an example.Each block is connected with a layer normalization at the beginning, and every block is followed by the residual connections.Every block can be defined as follows: where σ indicates the gaussian error linear unit (GELU) [43] activation function, U indicates the weight matrix of MLP 1 with d -dimensional output, and V refers to the weight matrix of MLP 2. Different from the classical MLP architecture, skip-connections and normalization layer [44] are considered in this study to ensure stable training.In particular, s(•) is the key element to capture the spectral-spatial interactions of different patches in the proposed Modified-MLP, which is built out of basic MLP layers with gating.To capture the crossed spectral-spatial information, it is necessary for s(•) to consist of operation over the spectral-spatial dimension.The simplistic option could be described as follows, which is a linear projection: where W ∈ R l×l , l indicates the number of patches, b is the bias.In this study, the crossed information is defined as follows: where indicates the element-wise multiplication.This operation is inspired by the Gated Linear Units (GLUs) [45], which defines s(•) as a spatial depth-wise convolution.Instead of using the convolution operation, we redefine the operation by element-wise multiplicative interaction to capture crossed spectral-spatial information.Similar to the LSTMs, these gates multiply each element of the weight matrix W and controls the information passed on.Here, Z is split into two independent elements formed (Z 1 , Z 2 ) along the channel dimension to capture the spectral-spatial relationships effectively.The shape of Z 1 and Z 2 is the same, which divide the channel dimension of Z equally.Thus, we set d = 2d to maintain the value of the input dimension.f (Z 2 ) can be viewed as the gating function, because each value of elements in Z 2 would be changed according to the f (Z 1 ) by the multiplicative gating.Therefore, the features have strong correction with each other.Finally, the proposed Modified-MLP applies a normalization layer to alleviate the vanishing problem in the training procedure.At last, average pooling layer and a fully connected layer are used for classifying HSI.
Specifically, the normalization layer is used to normalize the input features, which not only reduces the training time by normalizing neurons, but also alleviates the vanishing or exploding gradient problem [44].
For the average pooling layer, it sums out the spectral-spatial information and takes the average of each feature map.The advantage of the average pooling layer is that there is no parameter to optimize.
The outputs of the average pooling layer are vectorized and fed into fully connected layers and then a softmax layer is used to finish the HSI classification task.

The Proposed Multiscale-MLP for HSI Classification
The proposed Modified-MLP extracts the spectral-spatial features of each patch independently.However, Modified-MLP uses a fixed scale (i.e., p) to prepare training data.Therefore, Modified-MLP can be enhanced with different scales to fully extract the multiscale spectral-spatial features of HSI inputs.In this section, Multiscale-MLP is proposed, which aims to capture the spectral-spatial information in patches with different scales.Figure 2 shows the framework of the proposed Multiscale-MLP for HSI classification.Specifically, an input HSI sample is divided into fixed-size 3D patches with different patch sizes.Next, each patch is linearly projected by the spectral-spatial feature mapping layer.Then, the obtained features in different scales are fused in different scales so that the patch-level spectral-spatial features can be represented in a multiple scale manner.After that, these features are sent to the spectral-spatial information mixing and the following classification.The detailed steps are explicitly explained as follows.Firstly, a value of the patch size is set.Thus, there are l patches, which can be calculated by Equation (1).
Secondly, a spectral-spatial mapping layer is applied to project each P i into a fixed dimension.
where Mapping is the spectral-spatial mapping operation, t i refers to the output features of each P i , and T i indicates a mapping set T i = {t 1 , t 2 , • • • , t l } with l × d-dimension, whose length l and output hidden layer d-dimension are both determined by the experiments.Thirdly, we reset the patch size p with different values and repeat the above two steps, hence, there are several T i with n different scales (i.e., n = 3, and l = 4, 9, and 16) in Figure 2, which is determined by the number of different values of patch sizes.T i is fused to generate T, which can be defined as follows: Finally, T is fed into the spectral-spatial mixing and finish the HSI classification task.The proposed Multiscale-MLP could obtain abundant spectral-spatial information with multiple patches with different sizes.

The Proposed Soft-MLP for HSI Classification
To further improve the HSI classification performance and make full use of different scale information in the network, here, Soft-MLP is proposed with a key part named soft split operation, which not only structurizes the input to patches with variable length, but also fully uses the spectral-spatial information with an overlapping style.Figure 3 displays the detail description of the soft split.Each patch is produced by combining overlapped pixel vectors.Details of the proposed soft split operation are described as follows.
The input HSI sample I can be described as , where I i, (i = 1 : q) represents a pixel vector of the given I and q = h × w.Suppose a sample I has the shape of 32 × 32 × nBand, therefore, a single sample contains q (i.e., 1024) pixel vectors and all the pixel vectors can be reshaped with any combination.Next, a patch can be formed by combining different pixel vectors.In Figure 3, four pixel vectors (i.e., 1, 2, 33, and 34) in the input HSI sample are concatenated to form a new patch with size p × p × nBand.By applying the soft split, l can be calculated as follows:

Soft-split
where p is the patch size and h and w are the spatial height and width of a HSI sample, respectively.s refers to the stride (i.e., s = 1) in Figure 3. Since the proposed Soft-MLP splits the input HSI sample into patches with overlapping, each patch is correlated with surrounding patches to establish prior knowledge that there should be stronger correlations between surrounding patches.Thus, the local information can be aggregated from surrounding patches, which is helpful to extract features with multiscale information.Finally, all the patches are fed into the spectral-spatial mapping layer and spectralspatial mixing layer to finish HSI classification task, which is similar to the Multiscale-MLP.The proposed Soft-MLP can transform the input HSI sample with different orders and sizes with overlap; therefore, the Soft-MLP helps to make full use of the global relations of different pathes on the input HSI sample.

The Proposed Soft-MLP-L for HSI Classification
There always exists many learnable parameters in the MLP-based methods with limited samples, which results in an overfitting problem.In order to mitigate this problem, we introduce label smoothing to the proposed MLP-based methods.
For HSI classification, let x represent a HSI training sample, and its corresponding label y ∈ {1, 2, • • • , C}, where C indicates the number of classes.Here, y can be represented in a one-hot vector with C-dimension: where k = 1, 2, • • • , C, δ k,y indicates the discrete Dirac delta function, and δ k,y equals 1 for k = y and 0 otherwise.When we apply label smoothing for HSI classification, y k is used as the label instead of the original label y k .
where y k is a mixture of the original label and the fixed uniform distribution of the number of C − 1 classes and ε represents the smoothing factor [46].

Data Description
The three widely used datasets, including the Salinas, Pavia University (Pavia), and Indian Pines datasets, are used to test the effectiveness of the proposed methods.The detailed information of each dataset is presented as follows.
The Salinas dataset was obtained by the Airborne Visible-Infrared Imaging Spectrometer (AVIRIS) sensor over Salinas Valley.This dataset includes 512 × 217 pixels, and its spatial resolution is 3.7 m.The 20 noisy spectral bands have been removed, leading to 204 bands in the range of 0.2-2.4µm.The ground truth consists of 16 classes of interest.Figure 4 displays the false-color composite image and the available ground-truth map.Table 1 gives the number of samples for each class.The Pavia dataset was captured by the ROSIS-3 sensor over Pavia University, composed of 103 bands with 610 × 340 pixels after removing the low signal-to-noise ratio (SNR) bands.The spatial resolution of this dataset is 3.7 m, and nine classes were chosen in the ground truth in the experiments.The false-color composite image and the reference map are shown in Figure 5.The samples are reported in Table 2.The Indian Pines dataset was acquired by the AVIRIS sensor in 1992.This dataset is 145 × 145 pixels with 103 bands ranging from 0.2-2.4µm.The water absorption and low signal-to-noise ratio bands (bands 104-108, 150-163, and 220) lead to 200 bands.The ground truth contains 16 land cover types.Figure 6 displays the false-color composite image and the ground truth map.The detailed number of available pixels in each class is reported in Table 3.

Implementation Details
In the experiments, the dataset is separated into three parts, including the training samples, validation samples, and the test samples.We randomly choose 200 training samples from all the labeled samples, and 50 samples are randomly selected from the remaining as the validation samples; the remaining samples are considered as the test samples.The best model on the validate samples is used to evaluate on the test samples.
For each dataset, firstly, the input HSI is normalized into [−0.5, 0.5].Then, the neighbors of each pixel to be classified are set to 32 × 32 as the input of the well-designed model.Since the number of samples are different, the batch size for all the datasets is various.For the Salinas dataset, the bath size is set to 256; for the Pavia dataset, the batch size is set to 128; and for the Indian Pines dataset, the batch size is set to 100.Besides, for the proposed Modified-MLP, the numbers of training epochs are 500, 150, and 900 for the Salinas, Pavia, and Indian Pines datasets, respectively.The number of training epochs is determined by observing the curve of the loss function on the validated samples.To clearly display the training process, take the Modified-MLP as an example.The curves of the proposed Modified-MLP including the loss and accuracy of the training and test for all the datasets are shown in Figures 7-9.Thus, the number of epochs is set to 500, 150, and 900 for the Salinas, Pavia, and Indian Pines datasets, respectively, when the learning curve is stable.In addition, for the proposed Multiscale-MLP, Soft-MLP and Soft-MLP-L epochs are both set to 300, 300, and 200 for the Salinas, Pavia, and Indian Pines datasets, respectively.In order to update the training parameters rapidly, the Adam optimizer [47] and decay learning rate are utilized; specifically, initial learning rates are set to 1 × 10 −4 , 8 × 10 −4 , and 2 × 10 −4 for the Salinas, Pavia, and Indian Pines, respectively.The decay rate every 300 epochs is set to 1/3 for the proposed Modified-MLP.For the Soft-MLP, the learning rate is 8 × 10 −5 , 8 × 10 −5 , and 6 × 10 −5 for the Salinas, Pavia, and Indian Pines datasets, respectively.To measure the performance of all the methods, overall accuracy (OA), average accuracy (AA), and kappa coefficient (K) are used as the evaluation indexes.

Parameter Analysis
In order to find the optimal architecture, ablations on scaling different main parameters including the patch sizes, hidden dimensions, and number of blocks are run.The selection of these parameters plays an essential role in model size and the complexity of the proposed Modified-MLP, which should be specially discussed.
The first parameter is validated on the patch sizes.To capture the relationships of different patches, the input (e.g., 32 × 32 × nBand) of the proposed Modified-MLP is split into fixed-size patches (e.g., 4 × 4 × nBand, 8 × 8 × nBand, and 16 × 16 × nBand) for all the datasets, which represent spectral-spatial information at different positions of the input HSI.Meanwhile, the dimensions and number of layers of the proposed Modified-MLP are kept the same.The second parameter is analyzed on the number of blocks of the Modified-MLP, which controls the depth of the model, which we call depth for short.The value of the depth is varied from three to five of the Modified-MLP to find the appropriate value of depth.In addition, the third main parameter is named the dimension of the proposed Modified-MLP.Here, the dimension of the proposed Modified-MLP indicates the hidden dimension in the MLP.With the increment of the model dimension, it is easy for the model to encounter the overfitting problem; thus, different numbers of dimension (i.e., 64, 128, and 256) are chosen to search the optimal value for HSI classification.
The performance of the proposed Modified-MLP of different parameters with 200 training samples is shown in Figures 10-12 for the Salinas, Pavia, and Indian Pines datasets, respectively.In the experiments, we use the control variable method for all the datasets.For the case of the model depth, the value of patch size and dimension of the model are fixed, set to 4 and 128, respectively.For the case of patch size, the value of model depth and dimension of the model are also fixed, set to 3 and 128, respectively.Similarly, for the case of dimension of the model, the value of patch size and model depth are fixed, set to 4 and 3, respectively.As shown in Figures 10-12, it can be easily seen that for the Salinas and Pavia datasets, with increasing the value of patch size, the accuracies are decreased.In terms of the depth of the proposed Modified-MLP, for the Pavia and Indian Pines datasets, when the value of depth is larger, there are bigger improvements.For the Salinas dataset, when the value of depth is three, Modified-MLP achieves the best classification accuracy.Because there are more bands in the Salinas dataset compared to other datasets, increasing the depth results in more parameters in the MLP, which makes it easier to generate the overfitting problem.In addition, by scaling the width of the hidden layer in the proposed Modified-MLP, Modified-MLP with 128 dimensions reaches the highest OA, AA, and K.These results suggest that the proposed Modified-MLP with five layers, dimension = 128, and a patch size of 4 for the Pavia and Indian Pines datasets obtain the highest accuracies, and with three layers, dimension = 128, and a patch size of 4 for the Salinas dataset achieves the best performance.With different window sizes, there exists a peak.The best spatial window size depends on different datasets.The proposed Multiscale-MLP achieves the best performance concerning the 28 × 28 window size on the Salinas dataset, and the 32 × 32 window size on the Pavia and Indian Pines datasets.If the spatial window size increases, more pixels are used to extract spectral-spatial features; thus, the OA of the proposed Multiscale-MLP increases, too.However, the accuracies begin to decline when the window size reaches a certain value, because more heterogeneous pixels confuse the feature extraction.In the experiments, all the proposed methods are evaluated with 32 × 32 window size to maintain the same setting.

Comparison of the Proposed Methods with the State-of-the-Art Methods
In the experiments, two classical methods including RBF-SVM [14] and EMP-SVM [48] and four state-of-the-art methods including the CNN [49], SSRN [31], VGG [50], and HybridSN [51] are considered for comparison.For the radial basis function (RBF)-SVM, the optimal value of C and γ are the key parameters, which are searched by the grid searched method in the range of {10 −3 , 10 −2 , . . ., 10 3 }.In addition, for the extended morphological profiles (EMP)-SVM, in order to extract the spatial information of the HSI, the morphological opening and closing operations are applied.We adopt the disk-shaped structuring element, whose sizes are increasing from two to eight.Then, the generated features are used as the input for the RBF-SVM to finish the final classification task.
The experimental performance of the proposed Modified-MLP, Multiscale-MLP, Soft-MLP, and Soft-MLP-L for hyperspectral classification with 200 training samples are reported in Tables 4-6.Compared with the RBF-SVM and EMP-SVM, the proposed Modified-MLP achieves the highest accuracies.For example, compared with the RBF-SVM, the proposed Modified-MLP has superior performance, according to OA, AA, and K. Besides, compared to the EMP-SVM, OA, AA, and K on the Salinas dataset are increased by 4.33%, 3.4%, and 4.81%, respectively.In addition, compared to the classical methods, the deep learningbased methods generally obtain better performance.The proposed Modified-MLP also has the superiority of the deep learning-based methods.Take the Salinas dataset as an example; compared to the CNN, the OA of the proposed Modified-MLP is increased by 3.52%.Besides, the proposed Modified-MLP exhibits the best OA with the improvement of 3.19% and 2.67% in comparison with the SSRN and VGG, respectively.The classification results on the Pavia and Indian Pines also have the similar situation.All the experimental results reveal the superiority of the proposed Modified-MLP.
Furthermore, different from the proposed Modified-MLP, the proposed Multiscale-MLP achieves 92.25%, 93.50%, and 91.33% in terms of OA, AA, and K on the Salinas dataset, respectively.The results demonstrate the superiority of the proposed Multiscale-MLP with multiscale patch sizes.In addition, the proposed Soft-MLP obtains the highest results.For the Salinas dataset, the proposed Soft-MLP is 1.56% and 1.23% higher than Modified-MLP and Multiscale-MLP, respectively.Besides, compared to the classical methods (i.e., RBF-SVM and EMP-SVM) and deep learning-based methods (i.e., CNN, SSRN, VGG, and HybridSN), the proposed Soft-MLP has obvious improvements.Specifically, compared to RBF-SVM and EMP-SVM, the proposed Soft-MLP increases accuracies by 10.39% and 5.89%; 12.96% and 3.59%; 8.62% and 6.84% in terms of OA on the Salinas, Pavia, and Indian Pines, respectively.In addition, for the Indian Pines dataset, compared to the CNN and VGG, the proposed Soft-MLP increases OA by about 2%; for the SSRN and HybridSN, the Soft-MLP has an improvement of nearly 5%.All the results indicate that the proposed Soft-MLP is an effective method.Likewise, the performance of the Pavia and Indian Pines datasets follows the similar trends as the Salinas dataset.The results demonstrate that the proposed Soft-MLP, which applies the soft split operation by separating the input HSI sample into fixed patches with overlapping, is effective.
In addition, the proposed Soft-MLP-L has the best performance of all methods on all datasets.Specifically, the proposed Soft-MLP-L boosts OA by 2.24% for the Salinas dataset, 1.83% for the Pavia dataset, and 2.13% for the Indian Pines dataset, compared to the proposed Modified-MLP with 200 training samples.The proposed Soft-MLP-L also exceeds HybridSN by 3.35%, 2.15%, and 6.23% on the Salinas, Pavia, and Indian Pines datasets, respectively.These results indicate that the proposed MLP-based methods are effective.In the future, we will explore more new ideas in deep learning methods to further improve the classification performance of the MLP-based method.

Comparison with Different Methods with Cross-Validation Strategy
Cross-validation is a popular strategy for model selection, which relies on a preliminary partitioning of the data into g subsamples, and each subsample can be selected as the validation sample.In the experiments, the fivefold cross-validation is used with the sum of 200 labeled samples on the three datasets.The results of the corresponding experiments are shown in Figure 16.One can see that the proposed MLP-based methods also yield the highest OA on the three datasets.Take the Salinas dataset as an example; compared to the other methods (i.e., EMP-SVM, CNN, SSRN, and VGG), the proposed Soft-MLP improves the OA by 6.88%, 4.59%, 3.56%, and 2.89%, respectively.The results demonstrate that the proposed methods (i.e., Modified-MLP, Multiscale-MLP, and Soft-MLP) have better performance compared with other comparison methods.In order to comprehensively analyze the proposed approaches and the state-of-the-art methods, here, computational cost of different methods is analyzed, which is reported in Table 7.All the experiments are conducted on a computer with an Intel Xeon Silver 4210R processor with 2.4 GHz, 128GB of DDR4 RAM, and an NVIDIA Tesla V100 graphical processing unit (GPU).From the view of the running time, the proposed Soft-MLP consumes more time compared to other deep learning-based methods on all the datasets, due to each input of the model being split into several fixed patches with overlapping.Thus, the number of the patches is larger than other methods, which adds the dimension of the input of the proposed Soft-MLP.For the proposed Modified-MLP method, compared to the SSRN, the training time is shorter.Take the Salinas dataset as an example; due to the fact that the number of epochs of Modified-MLP is smaller than SSRN, the training time reduces by 77.55 s.Compared to CNN and VGG, the training time of Modified-MLP takes longer on the three datasets.In addition, compared to the Modified-MLP, the training time and test time of the proposed Soft-MLP and Soft-MLP-L are longer, but these methods obtain better classification performance.In addition, the number of floating-point operations (FLOPs) [52] of different methods is also reported in Table 7.Compared to the VGG and SSRN, the proposed Modified-MLP has smaller FLOPs on the Salinas dataset.The proposed Modified-MLP needs fewer FLOPs compared to the VGG on the Pavia and Indian Pines datasets.Besides, the number of parameters of various approaches is also computed, which is shown in the last column in Table 7. From the results, compared to the VGG, there are fewer parameters in the proposed Modified-MLP, Multiscale-MLP, Soft-MLP, and Soft-MLP-L on all the datasets.Compared to the CNN and SSRN, the proposed methods have more total parameters because the depth and width of the proposed methods are larger, which results in increasing FLOPs and total parameters, but the accuracies of the proposed methods are the highest.Specifically, compared to the Modified-MLP, Multiscale-MLP increases the FLOPs by 75.89 Mbytes, 43.12 Mbytes, and 62.26 Mbytes, and increases the total parameters by 8.45 Mbytes, 4.32 Mbytes, and 7.98 Mbytes on the Salinas, Pavia, and Indian Pines datasets, respectively.Because the proposed Multiscale-MLP aims to capture the spectral-spatial information in patches with different scales and the proposed Soft-MLP transforms the input HSI sample with an overlap style, there are larger FLOPs and total parameters compared to the proposed Modified-MLP.For the proposed Soft-MLP-L, it is similar to the Soft-MLP in terms of FLOPs and total parameters on the three datasets, which demonstrates that Soft-MLP-L is an effective method for HSI classification.

Experimental Summary
Overall, compared to the traditional methods (i.e., RBF-SVM and EMP-SVM) and deep learning-based approaches (i.e., CNN, VGG, SSRN, and HybridSN), the overall accuracies of all the proposed methods (i.e., Modified-MLP, Multiscale-MLP, Soft-MLP, and Soft-MLP-L) have improvements on the three public HSI datasets and the corresponding classification maps have also demonstrated that the proposed methods achieve competitive performance.In addition, the total parameters, FLOPs, and time of the proposed methods are larger compared to other methods, but the accuracies of the proposed methods are the highest.To some extent, the proposed methods based on the well-designed MLPs have potential for HSI classification.

Conclusions
This study presented modified MLP-based methods for HSI classification, which demonstrated that the proposed MLPs obtained good classification performance compared with state-of-the-art CNNs.
Specifically, compared to the traditional MLP architecture, the Modified-MLP was composed of the spectral-spatial feature mapping and spectral-spatial information mixing with a new architecture (i.e., normalization layer, residual connections, and GELU operations).Modified-MLP learned long-range spectral-spatial feature interactions in and among different patches, which were useful for HSI classification.Moreover, the Multiscale-MLP was exploited to capture adaptive spectral-spatial information with multiple scales, which was conveyed by patch-level input sufficiently.Furthermore, another Soft-MLP was investigated to enhance classification results flexibly by applying soft split operation, which was able to change the fixed-size patch into variable size with overlapping at any length; therefore, each patch is correlated with its surrounding patches.Compared to the popular spectral-spatial extraction methods, the proposed MLP-based approaches (i.e., Modified-MLP, Multiscale-MLP, and Soft-MLP) led to better results with limited training samples.This study demonstrates that the solely MLP-based HSI classification methods have obtained impressive results without convolution operation.Most of all, this study opens a new window for further research of HSI classification, which demonstrates that the well-designed MLPs can also obtain remarkable classification performance of HSI.
Although it is an early try of MLP-based HSI classification, the modified MLPs obtained better classification performance compared to traditional methods (i.e., RBF-SVM and EMP-SVM) and many recently proposed deep learning-based methods such as CNN, SSRN, VGG, and HybridSN.
Recently, much progress has been achieved in the machine learning community and some of them, such as transfer learning, can be combined with MLP to further improve the performance of MLP-based HSI classification.

Figure 1 .
Figure 1.The overview architecture of the proposed Modified-MLP for HSI classification.

Figure 2 .
Figure 2. The framework of the proposed Multiscale-MLP for HSI classification.

Figure 3 .
Figure 3.An example of the soft split operation in the Soft-MLP.

Figure 3 .
Figure 3.An example of the soft split operation in the Soft-MLP.

Figure 9 .
Figure 9. Curves on the Indian Pines dataset.

Figure 10 .
Figure 10.Test accuracy of key parameters on the Salinas dataset.

Figure 11 .
Figure 11.Test accuracy of key parameters on the Pavia dataset.

Figure 12 .
Figure 12.Test accuracy of key parameters on the Indian dataset.Furthermore, the proposed Multiscale-MLP uses different scales (i.e., p scales) to fully extract the multiscale spectral-spatial features with different scales of the HSI inputs.To validate the impact of input size, different spatial window sizes (i.e., 26 × 26, 28 × 28, 30 × 30, 32 × 32, 34 × 34, 36 × 36, and 38 × 38) are used in the experiment, and the results of the proposed Multiscale-MLP on the three datasets are shown in Figure 13.

Figure 13 .
Figure 13.Results of Multiscale-MLP with different window sizes.
In addition, to validate the proposed methods are effective with different training samples, Figures14 and 15show the OA of different methods (i.e., EMP-SVM, CNN, SSRN, Modified-MLP, Multiscale MLP, and Soft-MLP) on the different datasets with 150 and 300 training samples, respectively.Specifically, the proposed Soft-MLP achieves the best performance on the three datasets, which reaches 90.64%, 90.55%, and 85.13% on the Salinas, Pavia, and Indian Pines datasets, respectively, with only 150 training samples.In addition, when 300 training samples are used in the experiments, the proposed MLP-based methods have also achieved better results compared to comparison methods.Especially, for the Indian Pines dataset, the proposed Soft-MLP improves the classification accuracy by 5.23%, 4.4%, and 3.07% compared to EMP-SVM, CNN, and SSRN, respectively.All the results demonstrate that the proposed methods are effective.

Figure 14 .
Figure 14.Test accuracy (%) comparisons under different methods on the three datasets with 150 training samples.

Figure 15 .
Figure 15.Test accuracy (%) comparisons under different methods on the three datasets with 300 training samples.

Figure 16 .
Figure 16.Results of different methods with cross-validation.

3. 6 .
Comparison the Running Time and Computational Complexity of the Proposed Methods with the State-of-the-Art Methods To facilitate visual comparison, the classification maps of different approaches for all the datasets with 200 training samples are displayed in Figures 17-19.As can be seen from these figures, compared to the classical method EMP-SVM and deep learningbased methods including CNN, SSRN, VGG, our proposed methods have more precise classification results.More specifically, there are noisier scatter points in the classification map of EMP-SVM compared to deep learning-based methods on the three datasets.Take the Pavia dataset as an example; for the class of Bitumen color in blue shown in Figure 17b-i, the proposed Modified-MLP appears to be more smoothing.Furthermore, metal sheets color in green depicts more errors compared to the proposed Modified-MLP, Multiscale-MLP, and Soft-MLP.All the figures reveal that the proposed methods (i.e., Modified-MLP, Multiscale-MLP, Soft-MLP, and Soft-MLP-L) stand out from other competitors.
Instead of simply applying the traditional MLP for HSI classification, a modified MLP is investigated for HSI classification with ingenious architecture in a unified framework (i.e., normalization layer, residual connections, and gaussian error linear unit operations).The Modified-MLP not only extracts discriminative features independently through spectral-spatial feature mapping, but also captures interactions with each patch by spectral-spatial information mixing for effective HSI classification.(2)To obtain the spectral-spatial information in each sample sufficiently, a simple yet effective Multiscale-MLP is proposed to classify HSI with various scales, which divides an HSI sample into various equal-sized patches without overlapping to aggregate multiple spectral-spatial interactions of different patches in HSI sample.(3) Another method, flexible Soft-MLP, is proposed with soft split operation to solve the limitations caused by the predefined patch size in MLP-based methods, which transforms a single sample of HSI with different orders and sizes with overlapping by applying soft split operation.Therefore, the Soft-MLP can model the global spectral-spatial relations of different patches to further boost the HSI classification performance.(4) Finally, label smoothing has been used for HSI classification combined with the proposed Soft-MLP, which leads to higher accuracies and indicates that the proposed MLP-based methods can be improved as CNN-based methods.

Table 2 .
Pavia University labeled sample counts.

Table 3 .
Indian Pines labeled sample counts.

Table 4 .
Classification results on the Salinas dataset.

Table 5 .
Classification results on the Pavia dataset.

Table 6 .
Classification results on the Indian Pines dataset.

Table 7 .
Time consumption and computational complexity on the three datasets.