SS-MLP: A Novel Spectral-Spatial MLP Architecture for Hyperspectral Image Classification

Convolutional neural networks (CNNs) are the go-to model for hyperspectral image (HSI) classification because of the excellent locally contextual modeling ability that is beneficial to spatial and spectral feature extraction. However, CNNs with a limited receptive field pose challenges for modeling long-range dependencies. To solve this issue, we introduce a novel classification framework which regards the input HSI as a sequence data and is constructed exclusively with multilayer perceptrons (MLPs). Specifically, we propose a spectral-spatial MLP (SS-MLP) architecture, which uses matrix transposition and MLPs to achieve both spectral and spatial perception in global receptive field, capturing long-range dependencies and extracting more discriminative spectral-spatial features. Four benchmark HSI datasets are used to evaluate the classification performance of the proposed SS-MLP. Experimental results show that our pure MLP-based architecture outperforms other state-of-the-art convolution-based models in terms of both classification performance and computational time. When comparing with the SSSERN model, the average accuracy improvement of our approach is as high as 3.03%. We believe that our impressive experimental results will foster additional research on simple yet effective MLP-based architecture for HSI classification.


Introduction
With the advance of hyperspectral imaging techniques, hyperspectral imagery (HSI) presents greater resolution in both spatial and spectral dimensions [1,2]. Thanks to the abundant spectral bands (typically, hundreds of narrow contiguous channels), more fine-grained ground objects distinguishment becomes possible with their subtle spectral difference. This outstanding characteristic promotes the wide application of HSI in many fields, such as precision agriculture [2], military defense [3], and environmental governance [4].
Classification is one of the major tasks in HSI processing, which aims at distinguishing the land-cover class for different pixels. For classifying HSI, a simple and intuitive way is directly feeding hyperspectral pixels (high-dimensional vectors) into classifiers such as random forest (RF) and support vector machine (SVM) [5]. However, some challenges, such as the spectral mixing, the highly correlated spectral bands, and the complex nonlinear structure of hyperspectral data, bring difficulties to the precise classification of HSI [6]. Additionally, high-spatial-resolution remote sensing HSI usually presents high diversity in content. However, the representation ability of traditional hand-crafted features based on domain knowledge may not be enough to discriminate classes with a subtle variation [7,8].
In recent years, extracting discriminative features from high-dimensional spectral signatures has achieved great success with the utilization of deep learning [9], and HSI classification accuracy has made promising improvements. For instance, Zhou et al. [10] combined stacked autoencoders (SAEs) with a local Fisher discriminant regularization to learn compact and discriminative feature mappings with high inter-class difference and intra-class aggregation. Mou et al. [11] and Hang et al. [12] proposed to treat spectral signatures as sequential data and employed recurrent neural networks (RNNs) to learn relationships from different spectral channels, e.g., spectral correlation and band-toband variability.
The convolutional neural network (CNN) is among the most popular networks adopted for HSI classification, which can capture contextual spatial information in an end-to-end and hierarchical manner [13,14]. Cao et al. [15] proposed a unified Bayesian framework in which a CNN coupled with Markov random fields are utilized to classify HSI. Liu et al. [16] proposed a content-guided CNN to reduce the misclassification of pixels, particularly those near the cross-classes regions. Jia et al. [17] proposed a 3D Gabor CNN in which CNN kernels are replaced with 3D Gabor-modulated kernels, to improve the robustness against the scale and orientation changes. In addition, some works proposed to integrate traditional spectral-spatial feature extraction method with CNNs, to lessen the workload of the network and mitigate the overfitting problem. For example, Aptoula et al. [18] fed stacked attribute filtered images into CNNs for spatial-spectral classification. Huang et al. [19] designed a dual-path siamese CNN to classify HSI, which uses both extended morphological profiles-based spatial information and raw pixel vector-based spectral information as inputs. Besides, considering that HSI is 3D data cube, researchers proposed to use 3D CNNs to extract discriminative features. Paoletti et al. [20] employed a 3D CNN to take full advantage of the structural characteristics of hyperspectral data and used a border mirroring strategy to effectively process border regions. Sellami et al. [21] developed a 3D convolutional encoder-decoder architecture to extract spectral-spatial features from the most informative spectral bands that are selected by an adaptive dimensionality reduction method. To reduce the model complexity of the 3D CNN, Roy et al. [22] proposed a hybrid model consisting of 2D CNN and 3D CNN. In addition, Wang et al. [23] decomposed 3D convolution kernel into three small 1D convolution kernels to reduce the number of parameters, preventing the 3D CNN from suffering the overfitting problem.
To further improve feature discrimination and HSI classification accuracy, some powerful deep networks have been developed. Li et al. [24] proposed a two-stream CNN architecture based on the squeeze-and-excitation concept, which can capture spectral, local spatial, and global spatial features simultaneously. Cao et al. [25] developed a novel residual network to promote the extraction of deep features, in which hybrid dilated convolutions are utilized to enlarge convolution kernels' receptive field without increasing the computational complexity. Dong et al. [26] proposed a cooperative spectral-spatial attention dense network, which can emphasize salient spectral-spatial features with two cooperative attention modules. Zhang et al. [27] proposed a 3D multiscale dense network to take full advantage of features at different scales for HSI classification. In addition, the capsule neural network (CapsNet) [28], generative adversarial networks (GANs) [29], and a graph convolutional network (GCN) [30] have also been applied for HSI classification and obtained competitive performance.
Recent studies motivate a reconsideration of the image classification process from a sequence data perspective to capture long-range dependencies [31,32]. He et al. [32] proposed a multihead self-attention mechanism-based transformer for HSI classification, which can capture dependencies between any two pixels in an input region. Tolstikhin et al. [33] proposed an MLP-mixer architecture based exclusively on multilayer perceptrons (MLPs), which can obtain the global receptive field by combining matrix transposition with tokenmixing projection and thus account for long-range dependencies. Subsequently, several MLP-based architectures [34,35] have been proposed. They demonstrated that neither convolutions nor self-attention are necessary for obtaining promising performance and a simpler MLP-based architecture can perform as well as the state-of-the-art convolutionbased models.
In this paper, inspired by the simple yet effective design in [33], we propose a pure MLP-based architecture, called spectral-spatial MLP (SS-MLP), for high-performance HSI classification, which does not use the attention mechanism or convolutions. The SS-MLP has a very concise architecture in which matrix transposition and MLPs are utilized to achieve a global receptive field, encoding spatial and spectral information effectively. In addition to MLPs, the standard architectural components like normalization layers and skip connections are integrated in our model, in order to achieve promising performance. Experimental results on four representative HSI datasets: University of Pavia, University of Houston, Indian Pines, and HYRANK are impressive. The proposed SS-MLP can obtain higher classification accuracies with less parameters compared with other state-of-the-art convolution-based models. Moreover, it is fast to execute.
The remainder of this paper consists of five Sections. Section 2 briefly reviews the classic MLP architecture. Section 3 describes the proposed SS-MLP. Section 4 presents the experimental results, followed by a discussion in Section 5. Finally, Section 6 concludes this article. Figure 1 shows a multilayer perceptron (MLP) architecture, which is made up of a series of fully connected layers [36]. As can be seen, there are three types of layers, namely, the input, output, and hidden layers. The data flows from the input layer to the output layer in a feed-forward fashion. Formally, for the lth layer, let a (l−1) denote the input, and its output a l can be calculated as follows:

MLP
where W l and b (l) are the weights and bias at layer l and δ refers to the nonlinear activation function (e.g., sigmoid and rectified linear unit). Compared with the convolution-based architecture, the MLP with a global capacity is better at capturing the long-range dependencies [37]. This is because each output node is related to all input nodes. More recently, MLP-based architectures have become an appealing alternative to CNNs in computer vision [33][34][35]38,39]. For instance, Chen et al. [38] proposed a MLP-like architecture, CycleMLP, for dense prediction tasks (e.g., instance segmentation and object detection), which can deal with images with variable scales. Yu et al. [39] proposed a spatial shift MLP architecture for image classification, where spatial shift operations are employed to achieve communications between different spatial positions.

Input Layer Hidden Layers Output Layer
The main drawback of MLP is that it usually involves a large number of parameters. Let n l denote the node number at layer l. The number of parameters within an MLP is the sum of the weights and the bias between all adjacent layers, i.e., ∑ L−1 i=0 (n l n (l+1) + 1), where L denotes the layer number. Therefore, for the task of HSI classification, the MLP is usually employed at the architecture tail to perform the final classification [40]. For instance, Yang et al. [41] implemented a deep CNN with two-branch architecture for HSI classification, in which low and mid-layers are pretrained on other data sources, with a twolayer MLP performing the final classification. Xu et al. [42] proposed a novel dual-channel residual network for classifying HSI with noisy labels, which employs a noise-robust loss function to enhance model robustness and utilizes a single layer MLP for classification. To overcome this drawback, we adopt a weight sharing strategy in the proposed MLP-based architecture, which can lead to significant memory savings and will be detailed in the following Section. Figure 2 shows the architecture of the proposed SS-MLP, which takes a neighbor region (context) that is centered at the target pixel as input. Like current transformer models, such as ViT [31] and HSI-BERT [32], the proposed SS-MLP processes the HSI cubes as sequential data to encode the spatial information. The extracted region is flattened into a pixel sequence, which is then linearly projected into a new vector space using pixel embedding. The sequence of embedding vectors serves as input to the rest of the network. Several consecutive SS-MLP blocks that consist of one spatial MLP (SaMLP) and one spectral MLP (SeMLP) are used to learn discriminative spectral-spatial representations. Finally, the learned features are fed into a global average pooling layer followed by a single fully connected layer for label prediction.

Pixel Embedding
Let X ∈ R P×P×C be the neighbor region of the target pixel, where P × P is the spatial size and C is the number of spectral bands. X is flattened into a pixel sequence in raster scan order [32,43]. We denote the obtained pixel sequence as X p ∈ R N×C , where N = P × P is the number of pixels.
Pixel embedding is employed to reduce the cost of computation, which transforms the sequence of pixels (high-dimensional spectral vectors) into a vector space with a smaller dimension, yielding X e ∈ R N×D , where D < C is a predefined dimension. It can be viewed as one layer of the whole network. Specifically, we use a trainable linear transformation to implement pixel embedding, which works independently and identically on each pixel and can be written as: where W ∈ R C×D is a trainable weight matrix and b ∈ R D is the bias term.

SS-MLP Block
After pixel embedding, the dimension-reduced pixel sequence, shaped as a "pixels × channels" (N × D) table, is directly fed into several SS-MLP blocks of identical architecture to learn spectral-spatial features.
The architecture of the SS-MLP block is shown in Figure 3. It simply contains two types of MLP: spatial MLP (SaMLP) and spectral MLP (SeMLP). SaMLP acts on each channel independently. It takes individual column of the table as input to capture representative spatial features. The SaMLP allows communication between pixels at different spatial locations, achieving a global receptive field in the region. In other words, each pixel is cognizant of every other pixel in the sequence. SeMLP operates on each pixel independently, allowing communication between different channels. It takes individual row of the table as input to extract discriminative spectral features. By integrating the SaMLP and SeMLP, discriminative spectral-spatial features can be extracted from HSI cubes. In addition, we adopt the skip connection mechanism of [44] to enhance information exchange between layers, which has been demonstrated to be an effective strategy for modern neural architecture design [45][46][47].  The SaMLP and SeMLP have similar architecture and both consist of two fully connected layers and a non-linear activation, as shown in Figure 4. We adopt the Gaussian error linear unit (GELU) [48] as the activation function, which acts on each row of its input tensor independently.
where erf(x) = x 0 e −t 2 dt and Φ(z) is the cumulative distribution function of Gaussian N (µ = 0, σ 2 = 1). In addition, the dropout regularization technique of [49] is used to prevent overfitting.  [48] activation function. In addition, a dropout layer [49] is added after each fully connected layer to prevent overfitting, with a dropout rate of 50%.
Each SS-MLP block takes an input of the same size. For the sake of simplicity, here we omit block index and denote the input of each block asX ∈ R N×D . The SaMLP operates on columns ofX (i.e., channels) and is shared across all columns, mapping R N → R N . Note that we apply the same SaMLP to each row of a transposed input tableX T to achieve the same result. The SeMLP operates on rows ofX (i.e., pixels), mapping R D → R D . It is shared across all rows to provide the positional invariance property. For the proposed model, sharing the parameters of the SaMLP/SeMLP within each block leads to significant memory savings. In addition, since every output point is related to every input point, the SaMLP and SeMLP obtain a global receptive field in the spatial and spectral domains, which can capture richer global context information. Mathematically, the computation process of the SS-MLP block can be written as: where m and n are the column and row indexes, respectively, σ is the GELU activation function, and O ∈ R N×D is the output of the SS-MLP block. The intermediate matrix Y obtained by the SaMLP has the same dimensions as the input and output matrices, X and O. LN refers to the Layer Normalization of [50], which is applied to speed-up the training of the model. W 1 ∈ R N 2 ×N and W 2 ∈ R N× N 2 are the weights of the two fully connected layers in the SaMLP. W 3 ∈ R 4D×D and W 4 ∈ R D×4D are the weights of the two fully connected layers in the SeMLP. The output of a SS-MLP block serves as the input of the next one, and so forth until the last block.

Classifying HSIs Using the Proposed SS-MLP
After processing by the last SS-MLP block, discriminative spectral-spatial features are extracted and are vectorized into a 1-D array using global average pooling and then fed into a single fully connected layer. Finally, a softmax function is attached for label prediction. Let f denote the feature vector that is fed into the softmax function. The conditional probabilities of each class can be calculated by: where L denotes the number of ground-truth classes. The label of the target pixel is determined by the maximum probability.
Our SS-MLP model relies only on matrix multiplications, scalar non-linearities, and changes to data layout (i.e., transpositions and reshapes). Since these operations are all differentiable, the proposed model can be optimized using the standard optimization algorithms. Specifically, the learnable parameters are optimized using the Adam optimizer for 100 epochs. The learning rate is initialized with 0.001 and gradually reduced to 0.0 following a half-cosine shape schedule. The batch size is fixed to 100 and the weight decay is set as 0.0001.

Datasets
To evaluate the effectiveness of our SS-MLP, we first conduct experiments on the University of Pavia (UP), University of Houston (UH), and Indian Pines (IP) hyperspectral benchmark datasets. To make the proposed SS-MLP fully comparable with other spectral-spatial classification approaches reported in the literature, we use the same fixed training and test sets that are adopted by other state-of-the-art methods [51][52][53][54][55]. In other words, the number of training and test samples and their spatial locations are exactly the same with those used in previous studies. Figures 5-7 depict the false color image and the spatial distribution of the fixed training and test samples for the UP, UH, and IP datasets, respectively. Tables 1-3 list the class name and the number of training and test samples on the three datasets.
Bare Soil Bitumen Bricks Shadows

Evaluation Metrics
The overall accuracy (OA), average accuracy (AA), Kappa coefficient, and F1-score are used for quantitative analysis. To demonstrate the stability of our results, each experiment is conducted five times across different seeds and the mean and standard variation of the scores are reported.

Parameter Analysis
The model complexity of the SS-MLP is controlled by the network depth, i.e., the number of SS-MLP blocks, and the embedding dimension D. Considering that low complexity leads to underfitting and high complexity may result in the waste of computational resources and overfitting, we aim to find the smallest model depth and embedding dimension without incurring underfitting.
The OA of SS-MLP with different model depths is summarized in Table 4. As can be seen, the best OAs are achieved when the model depth is set to 3, 2, and 1 for the UP, UH, and IP datasets, respectively. Table 5 lists the OA of SS-MLP with different embedding dimensions. Note that when the embedding dimension is set to 24, the OA index reaches the maximum value on the three datasets.   [60]: A spatial-spectral squeeze-and-excitation residual network which extracts distinguishable features through spatial and spectral attention mechanisms, emphasizing meaningful features and suppressing unnecessary ones in the spatial and spectral domains simultaneously.
For the compared networks, their default parameter configurations are used. The training details of the compared methods are summarized in Appendix A. To make a fair comparison between different approaches, the input 3D HSI patches' spatial size is fixed to 11 × 11, following the set up of [56,59,60]. All the networks are implemented on the PyTorch platform using a personal computer with a RTX 2080 GPU.

Comparison Results
Tables 6-8 present the quantitative classification results for the UP, UH, and IP datasets, respectively. As can be seen, the proposed SS-MLP consistently provides superior performances in terms of three overall indices: OA, AA, and Kappa, over the other methods applied to all three datasets.
With the IP dataset, our model shows 4.17%, 2.28%, 3.68%, 4.09%, and 4.65% improvements (in terms of OA) over DenseNet, FDMFN, MSRN, DPRN, and SSSERN, respectively. Note that the F1-score obtained by our SS-MLP is as high as 71.46%, which is 8.33% point higher than that of SSSERN (63.13%). The reason for the remarkable promotions may be that the proposed SS-MLP with a global receptive filed is able to achieve better reasoning over a longer context, which is suitable for processing IP scene with larger smooth regions (e.g., large area of farmland). As for the UP and UH datasets, they have more detailed regions and the local detail information is important. Therefore, we obtain limited improvements for these two datasets. Table 6. Classification accuracies (%) on the UP dataset. The input HSI patch size is fixed to 11 × 11 for different models. SS-MLP achieves higher OA score while spending less time than other compared methods. "M" and "s" indicate millions and seconds. The best results are highlighted in bold font. Regarding the computational complexity, we compare the number of parameters and runtimes for different networks. As can be observed from Tables 6-8 For the UP and UH datasets, the differences between the second-best model SSSERN and the proposed SS-MLP are 1.60 (94.63 ± 0.96 vs. 96.23 ± 0.51) and 0.87 (84.99 ± 0.45 vs. 85.86 ± 0.96), respectively. As for the IP dataset, the difference between the second-best model FDMFN and our SS-MLP is 2.28 (66.37 ± 2.78 vs. 68.65 ± 0.65). Although our SS-MLP's improvements are not very significant on the UH dataset, it requires the fewest number of parameters and takes the shortest time to achieve satisfactory accuracy, which demonstrates the efficiency of our method.

Class
DenseNet and DPRN have millions of parameters, which result in a high probability of incurring the phenomenon of overfitting. This is because DenseNet has a deep architecture which consists of 22 inner convolution blocks, while DPRN uses larger convolution kernels (i.e., 7 × 7 instead of the widely used 3 × 3) to increase the receptive field. In addition, during the feature extraction process, DenseNet adopts pooling operation to reduce data variance and computation complexity. However, the spatial resolution of learned feature maps is also reduced, resulting in the detail information loss. This is because HSI classification models (e.g., DenseNet) usually take image patches as input, which have a small spatial size (e.g., 11 × 11). Due to the spatial detail information loss and the high probability of overfitting, DenseNet performs relatively poor on the three datasets. FDMFN and MSRN can utilize contextual information at different scales for classification, achieving satisfactory performance. When learning spectral-spatial features, SSSERN keeps the spatial size of input hyperspectral data fixed to avoid spatial information loss. In addition, it uses spectral attention modules to emphasize useful bands for classification and suppress useless bands. Moreover, SSSERN utilizes spatial attention modules to emphasize pixels that are useful for classification (i.e., highlighting pixels from the same class as the center pixel) and suppress useless pixels. In this way, SSSERN is able to extract discriminative spectralspatial features from the HSI cubes and achieves promising classification performance. However, the CNN-based models (i.e., DenseNet, FDMFN, MSRN, DPRN, and SSSERN) have limited receptive field, which makes the learned features focusing more on local information and may result in the misclassifications inside objects. As for the proposed SS-MLP, it is constructed based on MLPs with global receptive field, which can capture long-range dependencies. In addition, to avoid detail information loss, we do not use any downsampling operation during the feature extraction phase. Moreover, due to the weight sharing strategy, our SS-MLP architecture is lightweight, which can alleviate the overfitting problem and is suitable for HSI classification task with limited training samples. Therefore, our model achieves competitive performance in comparison with other methods. Table 7. Classification accuracies (%) on the UH dataset. The input HSI patch size is fixed to 11 × 11 for different models. SS-MLP achieves higher overall accuracies while using fewer parameters than other compared methods. "M" and "s" indicate millions and seconds. The best results are highlighted in bold font. Figures 8-10 provide the classification maps generated by different approaches on the three datasets. As can be observed, the SS-MLP produces well-defined classification maps in terms of border delineation. For the UH dataset, the classification map obtained by our SS-MLP is more aligned with ground object boundaries, particularly for the "Railway" class. Figure 11 shows the classification maps for the "Railway" class obtained by different methods. As can be observed, the DenseNet, FDMFN, MSRN, and DPRN misidentify parts of the middle area of "Railway" as "Parking Lot 2" (denoted by blue color). For the SSSERN, it misidentifies parts of the middle area of "Railway" as "Road". The reason for the misclassifications may be that these five convolution-based methods have limited receptive filed and thus focus more on local information, resulting in the misclassifications inside large scale objects. However, the proposed SS-MLP with a global receptive filed can capture long-range spatial interactions, which is better at classifying objects from a global perspective. That may be the reason why the proposed SS-MLP achieves a better classification performance on the "Railway" class.

Class
In addition, for the IP dataset, the classification accuracies obtained by our method are similar to that achieved by SSSERN in most categories. However, our SS-MLP achieves significant improvements over SSSERN in the "Soybean-clean" category (58.98 ± 9.36 vs. 20.89 ± 6.41). The "Soybean-clean", "Soybean-notill", and "Soybean-mintill" categories are similar, which make accurate separation difficult. For "Soybean-clean" category, all methods obtain poor accuracy (lower than 60%). However, the proposed SS-MLP can achieve a better classification performance compared with other methods, possibly since global information is important for accurately classifying this category with large-scale areas. Without the help of global receptive field, pixels inside large objects are usually mistaken as other objects with high similarity. Figure 12 shows the features learned by SSSERN model and the proposed SS-MLP. Note that the final spectral-spatial features extracted before global average pooling are displayed. As can be seen, our SS-MLP tends to focus on pixels in different areas of the input HSI patch and hence can reason in an enlarged spatial range and from a global prospective. However, SSSERN pays more attention on local information. Therefore, we hypothesize that the success of detecting this category arises from the SS-MLP's characteristic of global receptive field.

Ablation Analysis of the Proposed SS-MLP
The proposed SS-MLP uses the skip connection mechanism, layer normalization, and a 50% dropout regularization for improving the training process. To demonstrate the effectiveness of our model, we construct a baseline network by eliminating skip connection, layer normalization, and dropout regularization from the SS-MLP. As can be seen from Table 9, the baseline network obtains poor classification performance. Specifically, the OA scores obtained by the baseline model are 56.61%, 79.92%, and 62.60% on the UP, UH, and IP datasets, respectively. Table 9. Ablation analysis of SS-MLP for understanding the contribution of different components in the architecture, including skip connection, layer normalization, and dropout regularization. We can find all these three components have positive contributions to the classification performance. The best results are highlighted in bold font.

Dataset
Skip To improve the classification performance of the baseline network, skip connection mechanism is first introduced, which can enhance the information exchange between layers and reduce the training difficulty [44]. As can be observed from Table 9, the OA scores' improvements obtained by utilizing skip connection mechanism are 33.73%, 2.23%, and 0.55% on the three datasets, which demonstrate that improving the information flow is useful to enhance the HSI classification accuracy. In addition, we further adopt layer normalization [50] to reduce the internal covariate shift during network training, which can speed up the training phase and benefit generalization. As can be seen, the increases of OA scores obtained by the combination of layer normalization and skip connection are 37.52% on the UP dataset, 4.05% on the UH dataset, and 4.25% on the IP dataset, which demonstrates that the utilization of layer normalization also plays a positive role in improving classification accuracy. Besides, dropout regularization is used to improve the training process. Specifically, during the training phase, it randomly deactivates a percentage of neurons, that is, setting the output of each neuron to zero with a probability. By dropping neurons randomly, diverse neural networks are formed in different training epochs, which can reduce the co-adaptation of hidden units and force the network to learn more robust features [49]. The most commonly used dropout rate is 50%. From the observation of Table 9, we can find that the network with dropout regularization can achieve better performance on all three datasets. This suggests that dropout regularization is beneficial to enhance the classification performance.
To sum up, the utilization of different techniques can effectively obtain different degrees of improvement in the performance. When using all these three techniques, the SS-MLP performs the best on all three HSI datasets. Compared with the baseline network, the OA scores' enhancements achieved by our SS-MLP are as high as 39.62% on the UP dataset, 5.94% on the UH dataset, and 6.05% on the IP dataset, which demonstrates that the SS-MLP architecture designed in this article is effective for the task of HSI classification.

Impact of SeMLP and SaMLP
Considering that each pixel in hyperspectral imagery covers a spatial region on the surface of the Earth, hyperspectral pixels tend to have mixed spectral signatures. The presence of mixed pixels and the environmental interferes like atmospheric and geometric distortions often lead to: (1) Spectral signatures that belong to the same landcover type may be different. (2) Spectral signatures belonging to different classes may be similar. Therefore, methods that focus only on the spectral information cannot provide satisfactory classification accuracy. By exploiting the spatial contextual information such as textures, geometrical structures and neighboring relationships, spectral-spatial methods have proven to be an effective way to reduce the classification uncertainty and increase the classification accuracies.
In this paper, the SeMLP is used to learn discriminative spectral features, and the SaMLP that can capture relationships between any two pixels in an input region is used to extract informative global spatial features. To demonstrate the effectiveness of the integration of SaMLP and SeMLP in our SS-MLP, we also test the networks that only consist of the SaMLPs and the ones that only contain SeMLPs.
Since the spectral representations learned by SeMLP are complementary to the spatial features learned by the SaMLP, the proposed SS-MLP with both SeMLP and SaMLP consistently obtain higher OA values than the networks with only SeMLP or SaMLP, as can be seen from Table 10. On the UP dataset, the OA of our SS-MLP is 96.23%, and it is 8.57% and 2.98% higher than the OA obtained by the network without SaMLP and the one without SeMLP, respectively. For the UH dataset, combination of both SaMLP and SeMLP could increase the OA value by 0.56% and 2.05% compared to the network without SaMLP or SeMLP. As for the IP dataset, removing SaMLP and SeMLP will result in a 5.50% and 2.06% decrease in OA score, respectively. These results demonstrate the importance of both the SaMLP and SeMLP in SS-MLP.

Impact of Activation Function
For the proposed SS-MLP model, we adopt the Gaussian error linear unit (GELU) [48] instead of the widely used rectified linear unit (ReLU) as the activation function. The reason is that the use of GELU activation function promotes SS-MLP's classification performance slightly on the UP and UH datasets, as can be seen from Table 11.  Figure 13 presents the learning curves of our SS-MLP, including the loss and accuracy of training and validation for all the three datasets. Here, 10% of samples per class are randomly selected from the training set as validation samples, and the rest 90% are used for network training. Note that in this paper we follow the widely-adopted training protocol and set the training epochs to 100. However, from Figure 13, one can see that our SS-MLP is converged almost around 50 epochs, which means that the time cost of our model can be further reduced by using fewer training epochs. ( Figure 13. Learning curves of the proposed SS-MLP on (a) UP, (b) UH, and (c) IP datasets. As can be observed, our model has the characteristic of fast convergence, which can converge at a stable minimum within as few as 50 epochs for these three datasets.

Analysis of General Applicability
In this section, we further investigate the general applicability and performance of the SS-MLP on a recently released HYRANK hyperspectral benchmark dataset. The HYRANK datasets contains five hyperspectral scenes: Dioni, Loukia, Erato, Nefeli, and Kiriki, where the ground reference maps of Dioni and Loukia scenes are available. Researchers usually use Dioni scene as training set and Loukia scene as test set. Both the Dioni and Loukia scenes were acquired by the Hyperion sensor on the Earth Observing-1 satellite with 176 spectral bands and a GSD of 30 m. The spatial size of Dioni scene is 250 × 1376. The spatial size of Loukia scene is 249 × 945. Both of them contain seven land-cover classes: Dense Urban Fabric, Non-Irrigated Arable Land, Olive Groves, Dense Sclerophyllous Vegetation, Sparse Sclerophyllous Vegetation, Sparsely Vegetated Areas, and Water. HYRANK benchmark dataset is challenging since it has spatially disjoint training and test sets. Besides, due to the limited spatial resolution (30 m), the highly mixed pixels also poses a great challenge to accurate classification of land-cover types. From Table 12, it can be seen that the proposed SS-MLP still obtains improved performance compared with other methods. In comparison with the second-best model (DPRN), our SS-MLP improves the OA by 1.08%, using approximately 53× fewer parameters. The HYRANK dataset and the classification maps obtained by different methods are displayed in the Appendix B. The experimental outcomes on the four benchmark datasets demonstrate the effectiveness of the proposed SS-MLP. It should be noted that owing to the weight sharing strategy, the number of parameters required by our model is considerably fewer than that needed by other deep CNN models. Take the UH dataset as an example, DenseNet (1.66 M) and DPRN (1.98 M) require millions of parameters, while our SS-MLP only needs 40 K parameters. In addition, although the SSSERN and the proposed model obtain similar classification accuraices on the UH dataset, our SS-MLP needs 4× fewer parameters, being approximately 2× faster. These results demonstrate that the proposed SS-MLP can achieve competitive performance compared with the state-of-the-art methods, but requiring fewer parameters to be adjusted and less running time.
Our SS-MLP uses matrix transposition and MLPs to achieve both spectral and spatial perception in global receptive field. However, the local features that can be captured by CNNs with local receptive field is important for distinguishing small scale objects. Therefore, how to effectively embed local information in our SS-MLP architecture requires further investigation.

Conclusions
In this article, a novel deep learning architecture based entirely on MLPs is presented for HSI classification. The proposed SS-MLP uses two consecutive MLPs, i.e., SaMLP and SeMLP, to learn spatial and spectral representations in the global receptive filed. These two types of MLPs are interleaved to enable information interaction between spectral and spatial domains. Furthermore, weight sharing within the SS-MLP block significantly enhances memory savings. Experiments conducted on four benchmark HSI datasets demonstrate that the proposed SS-MLP can yield competitive results with less parameters compared with several state-of-the-art approaches.
In the future, we will conduct additional experiments to investigate the general applicability and performance of the SS-MLP across many different HSI datasets. In addition, we will consider integrating band selection with the proposed SS-MLP, so as to suppress useless bands and emphasize informative ones for efficient HSI classification. In order to recover the results of the compared methods, we use the training protocols reported in the corresponding references. Table A1 summarizes the training details of the compared approaches. DenseNet [56] and SSSERN [60] follow the widely-adopted training protocol. The training procedure lasts for 100 epochs, using the Adam optimizer with a batch size of 100 samples. The learning rate is set to 0.001. Note that in accordance with [57], a half-cosine shape learning rate schedule is adopted for FDMFN, which starts from 0.001 and gradually reduces to 0.0. As for DPRN, the learning rate is set to 0.1 from epochs 1 to 149 and to 0.01 from epochs 150 to 200, according to the setup in [59]. Besides, it should be noted that MSRN's learning rate is set to 0.001 instead of the default 0.01 [58], because we found that MSRN can achieve better classification performance with a smaller learning rate.