Spatial-Adaptive Siamese Residual Network for Multi-/Hyperspectral Classiﬁcation

: Deep learning methods have been successfully applied for multispectral and hyperspectral images classiﬁcation due to their ability to extract hierarchical abstract features. However, the performance of these methods relies heavily on large-scale training samples. In this paper, we propose a three-dimensional spatial-adaptive Siamese residual network (3D-SaSiResNet) that requires fewer samples and still enhances the performance. The proposed method consists of two main steps: construction of 3D spatial-adaptive patches and Siamese residual network for multiband images classiﬁcation. In the ﬁrst step, the spectral dimension of the original multiband images is reduced by a stacked autoencoder and superpixels of each band are obtained by the simple linear iterative clustering (SLIC) method. Superpixels of the original multiband image can be ﬁnally generated by majority voting. Subsequently, the 3D spatial-adaptive patch of each pixel is extracted from the original multiband image by reference to the previously generated superpixels. In the second step, a Siamese network composed of two 3D residual networks is designed to extract discriminative features for classiﬁcation and we train the 3D-SaSiResNet by pairwise inputting the training samples into the networks. The testing samples are then fed into the trained 3D-SaSiResNet and the learned features of the testing samples are classiﬁed by the nearest neighbor classiﬁer. Experimental results on three multiband image datasets show the feasibility of the proposed method in enhancing classiﬁcation performance even with limited training samples.


Introduction
Remote sensing [1][2][3][4] has played an important role in Earth observation problems. The ever-growing number of multispectral and hyperspectral images acquired by satellite sensors facilitates a deeper understanding of the Earth's environment and human activities. Rapid advances in remote sensing technology and computing power have contributed to the application of multiband images in environmental monitoring [5], land-cover mapping [6,7], and anomaly detection [8]. Classification, which labels each pixel with a specific land-cover type, is one of the fundamental tasks in remote sensing analysis [9][10][11]. A large number of labeled training data are usually required to achieve satisfactory classification performance. However, it is time-consuming and expensive to label the remote sensing samples in practice.
In the remote sensing community, researchers have proposed various supervised approaches to classify the multiband images. Existing methods can be roughly divided into two categories. The first category is spectral classification methods, which classify each pixel independently by utilizing the spectral features. State-of-the-art methods include the support vector machine (SVM) [12] and its variations [13,14]. Although the spectral methods have proven to be useful for multiband images classification, they fail to fully explore the spatial information, and therefore result in salt and pepper-like errors. The second category is spectral-spatial classification methods, which use both spectral and spatial information for more accurate classification performance. In this category, additional spatial features extracted by diverse strategies are integrated into the pixel-wise classifiers. Widely-used spatial extraction methods include the morphological profiles [15], attribute profiles [16], gray-level cooccurrence matrix [17], Gabor transform [18], and wavelet transform [19]. Image segmentation can also be applied to obtain the spatial neighbors. For instance, superpixel segmentation groups pixels of multiband images into perceptually meaningful atomic regions which are more flexible than the rigid square patches [20,21]. Many classifiers are also proposed for spectral-spatial classification, such as the joint sparsity model (JSM) [22], multiple kernel learning (MKL) [23], and Markov random field (MRF) modeling [24]. Moreover, note that the remote sensing images can be modeled as three-dimensional (3D) cubes with one spectral dimension and two spatial dimensions. Researchers have developed some spectral-spatial methods under the umbrella of tensor theory, such as the tensor discriminative locality alignment (TDLA) method [25], tensor block-sparsity based representation [26], and local tensor discriminative analysis technique (LTDA) [27]. Spectral-spatial methods have proved to be more effective than the spectral methods as they can exploit the object shape, size, and textural features from the multiband images.
However, most of the above-mentioned methods cannot classify the multiband images in a "deep" fashion. In recent years, deep learning [28][29][30][31] has attracted extensive attention with the rapid development of artificial intelligence. Deep learning can hierarchically learn the high-level features, and therefore has achieved tremendous success in remote sensing classification. Typical deep architectures include the stacked autoencoder (SAE) [32], deep belief network (DBN) [33], residual network (ResNet) [34] ,and generative adversarial network (GAN) [35]. Among all deep learning-based methods being developed over the past few years, convolutional neural networks (CNNs) have become a preferable model in the current trend of multiband images analysis due to the fact that convolutional filters are powerful tools to automatically extract the spectral-spatial features. For instance, a one-dimensional CNN (1D-CNN) method was proposed in [36] for hyperspectral classification in the spectral domain. A spectral-spatial classification method was proposed in [37] by combining the principle component analysis (PCA), two-dimensional CNN (2D-CNN), and logistic regression. [38] improved the performance of remote sensing image scene classification by imposing a metric learning regularization term on the CNN features. The 3D-CNN was applied in [39] to learn the local signal changes in spectral-spatial dimension of the 3D cubes, while [40] explored the performance of deep learning architectures for remote sensing classification and introduced a 3D deep architecture by applying 3D convolution operations instead of using 1D convolution operations. In [41], residual blocks connected every 3D convolutional layer by identity mapping to gain improved classification performance. Moreover, the Siamese networks [42] composed of 2D-CNN/3D-CNN or 2D-ResNet/3D-ResNet [43][44][45][46] were also proposed for hyperspectral classification. The Siamese architecture differs from other deep learning-based methods in that it is directly trained to separate the similar pairs from the dissimilar in feature space. By simultaneously learning the similarity metrics, the Siamese networks are effective for distinguishing different land-covers in remote sensing data. However, all those Siamese networks performed pixel-based classification of hyperspectral data by choosing the neighborhoods of a sample as the original features. Although CNN and its variations have been successfully employed in classification of multiband images, training, or testing samples are usually fixed at the center of the neighborhood windows and the neighboring samples in each patch are assumed to be consisted of similar materials. However, in heterogeneous regions, especially around the boundaries, neighboring samples often belong to different classes, which go against the basic assumption of the deep learning-based methods.
In this paper, we propose a 3D spatial-adaptive Siamese residual network (3D-SaSiResNet) for classification of multiband images. The main steps of the proposed method are twofold, i.e., construction of 3D spatial-adaptive patches and Siamese residual network for multiband images classification. In the first step, we segment the multiband images into spatial-adaptive patches instead of choosing the patches centered on training/testing samples. In the second step, we design a 3D-SaSiResNet to extract high-level features for classification. The 3D-SaSiResNet is composed of two 3D ResNet channels with shared weights and the training samples are utilized to train the network in pairs. The testing samples are then fed into the trained Siamese network to obtain the learned features, which can be classified by the nearest neighbor classifier. The two steps are closely connected, i.e., in the first step, we generate 3D spatial-adaptive patches, which can be subsequently classified by the 3D-SaSiResNet built in the second step.
The main contributions of this work are listed below: • We obtain 3D spatial-adaptive patches of the multiband images by band reduction and superpixel segmentation. Instead of fixing the samples at the center of the patches, the 3D spatial-adaptive patches provide a more flexible way to fully consider the heterogeneous regions of the remote sensing data, making the neighbors within the same 3D patch more probable to have the same class label. • We propose a 3D-SaSiResNet method for classification of multiband images. The Siamese architecture of the network combined with 3D ResNet helps to extract discriminative spectral-spatial features from the remote sensing data cube and provides promising classification performance even with limited training samples.
The rest of this paper is organized as follows. In Section 2, detailed descriptions of the proposed method are provided. Section 3 reports experimental results on three multiband image datasets and conclusions are drawn in Section 4.

Proposed Method
In this section, we provide the detailed procedure of the proposed method, which contains two main parts: construction of 3D spatial-adaptive patches and 3D-SaSiResNet for multiband image classification (see Figure 1). Both parts of the proposed method are important for multiband image classification. First, compared to fixing the training or testing samples at the center of the patches, constructing spatial-adaptive patches can fully follow the spatial characteristics of the multiband image, making the pixels within the same patch more likely to be in the same category. It is notable that the hyperspectral image usually has more than one hundred contiguous bands, and band reduction is also essential to reduce the computational cost in the first steps. Second, the 3D-SaSiResNet facilitates the extraction of discriminant features and classification of multiband images.

Construction of 3D Spatial-Adaptive Patches
In the proposed method, we segment the multiband images into 3D spatial-adaptive patches, ensuring that the neighbors within the same patch are more likely to consist of the same materials. To reduce the computational cost, band reduction implemented by SAE [31, 47,48] is first performed on the original images. The reason for utilizing the SAE for band reduction is that the accuracy of superpixel segmentation results is important for the proposed method and the SAE can provide an effective nonlinear transformation and nonlinear band reduction. Letting I 0 ∈ R m 0 ×n 0 ×b 0 be the original multiband images data, where m 0 , n 0 , and b 0 represent the number of rows, columns, and spectral bands, respectively, the intrinsic dimensionality of I 0 can be estimated by the "intrinsic_dim" function in the "drtoolbox" [49], which is a public MATLAB toolbox for dimensionality reduction. Supposing the intrinsic dimensionality of I 0 is b, we design a SAE network to reduce the band of I 0 from b 0 to b. The structure of the SAE used in this paper is depicted in Figure 2, where the input layer I 0 ∈ R b 0 is the spectral signature of I 0 . As shown in Figure 2, SAE is a stacked structure of multiple autoencoders and each autoencoder is a single hidden layer feedforward network, whose output aims to reconstruct the input signal. In every single autoencoder, the hidden layer units are less in number than the input and output layer units. As such, the autoencoder network learns a compressed representation of the high-dimensional remote sensing data. Without loss of generality, supposing the input signal of a single autoencoder is X ∈ R b x , which is reduced to b y features representing high abstraction in the hidden layer, and then the X is reconstructed into Z ∈ R b x . The above-mentioned process can be formulated as where Y ∈ R b y refers to the compressed data in the encoding step, Z represents the reconstructed data in the decoding step, w y and w z denote the weights of input-to-hidden and hidden-to-output, respectively, β y and β z are the bias of the hidden and output units, respectively, and f y (·) and f z (·) denote the activation functions. The autoencoder can be trained by minimizing the following reconstruction error: The SAE used in this paper is a stacked series of autoencoders layer by layer, where the hidden layer output of the previous autoencoder is taken as the input of the next layer. The parameters of the SAE can be obtained by training each autoencoder hierarchically with the back propagation algorithm in an unsupervised fashion. Moreover, it is worthwhile to note that we use five hidden layers (the number of units is set to 100, 50, 25, 10, and b, respectively) for band reduction of hyperspectral image since the hyperspectral image usually contains hundreds of contiguous spectral bands, too many layers will increase the computing burden, while fewer layers will reduce the extraction effect. As to the multispectral image, only one hidden layer (the number of units is set to the intrinsic dimensionality) is needed since the spectral bands of multispectral image are usually less than 10, and, therefore, it is not necessary to adopt as many hidden layers as the hyperspectral image has.
Remote Sens. 2020, 12, 1640 Output layer Hidden layers Simple linear iterative clustering (SLIC) [20] is then performed on each of the reduced bands to obtain over-segmented maps. Letting I ∈ R m 0 ×n 0 ×b be the multiband image with reduced bands and I s ∈ R m 0 ×n 0 ×b be the superpixel segmentation results obtained by performing SLIC to each of the b band, the final superpixel map S can be generated by Algorithm 1. Moreover, Figure 3 gives an example about how to obtain S. There are three bands in I, and I s ∈ R 7×7×3 denotes the segmentation results of SLIC. In case the intersection of all the over-segmented regions where the pixel (p, q) located is not empty, the pixels located at that intersection will be given the same label. Otherwise, a single label is given to the (p, q)th pixel. The above-mentioned procedure is repeated with (p, q) changes from (1, 1) to (7,7), and the superpixel map S is finally generated by traversing all of the pixels in I s . Moreover, it is important to note that the labels in S denote different segmentation blocks rather than represent the class labels of land covers.
Given the superpixel map S, instead of fixing a certain sample s 0 at the center, the 3D spatial-adaptive patch of s 0 is determined by cropping the minimum bounding rectangle of the superpixel in which s 0 lies and extracting the pixels from the input multiband image according to the cropped minimum bounding rectangle. To ensure all the pixels in the same patch have the same land-cover class, those pixels, which are located in the minimum bounding rectangle but do not belong to the same superpixel, are substituted by the mean of the rest pixels in the minimum bounding rectangle. All of the 3D patches are resized to the same window size by bicubic interpolation for the convenience of model training. Specifically, Figure 4 depicts a schematic example of the patch generation procedure. The window size of the patch is set to 3 × 3. Figure 4a denotes the procedure of extracting a 3D patch by fixing a certain sample s 0 at the center, the materials, and land-covers of the neighbors may be quite different from s 0 , and the number of patches is equal to the number of samples. Figure 4b denotes the procedure of extracting a 3D spatial-adaptive patch, in which the neighbors are more likely to have the same labels as s 0 . Moreover, all the pixels that are contained within the same superpixel result in the same 3D spatial-adaptive patch, and, therefore, the number of generated 3D spatial-adaptive patches is equal to the number of superpixels. Taking each 3D spatial-adaptive patch as an object in multiband classification, the 3D-SaSiResNet is, in essence, an object-based method rather than a pixel-based method.

3D-SaSiResNet for Multiband Image Classification
We propose a 3D-SaSiResNet to extract high-level features for supervised classification of the multiband images. Supposing the training set comprises N cases and can be represented by {s i , l i }, i = 1, 2, . . . , N, where s i ∈ R b 0 refers to the spectral signature of the ith training sample and l i is the class label, the 3D spatial-adaptive patch D i ∈ R w×w×b 0 of s i is extracted from the original multiband images data I 0 ∈ R m 0 ×n 0 ×b 0 by reference to the superpixels generated in Section 2.1, where w is the window size of D i . The 3D patches of training samples are grouped in N refers to a combination of the N patches in mathematics, and it involves the number of times to select two patches from a set of N patches without repetition. The pairs {D i , D j } are then fed into the two subnetworks of the proposed method. In other words, N(N − 1)/2 pairs of 3D patches are compared when we train the network. The architecture of the 3D-SaSiResNet is illustrated in Figure 5, which contains two identical subnetworks with the same configurations and shared weights. The training procedure aims to train a model to discriminate between a series of pairs from same/different classes. As shown in Figure 5, the first subnetwork takes D i as input, followed by a sequence of 3D convolution [39], 3D batch normalization (3D-BN) [50], 3D rectified linear unit (3D-ReLU) [51], 3D max-pooling [52], and a 3D residual basic block [34] (or 3D ResNeXt basic block [53]) with skip connection, the output of the last pooling layer is flattened as a vector to connect with a fully connected layer, and the last vector f (D i ) is the features of D i produced by the above-mentioned subnetwork. The second subnetwork takes D j as input, followed by the same procedure as performed in the first subnetwork, and outputs the features f (D j ). In greater detail, the five kinds of layers involved in the 3D-SaSiResNet are described in the following: • 3D convolution layers: The 3D convolution layers [39] directly take 3D cubes (e.g., 3D patch D i ) as input and are more suitable for simultaneously extracting spectral-spatial features of the multiband images than the 2D ones. Each input 3D feature map of the convolution layer is convolved with a 3D filter and is then passed through an activation function to generate the output 3D feature map where D h−1 and D h are the feature maps of the (h − 1)th and hth layers, K h and β h represent the 3D filter and bias in the hth layer, respectively. • 3D-BN layers: The 3D-BN layers [50] reduce the internal covariate shift in the network model by normalization, which allows a more independent learning process in each layer. 3D-BN can regularize and accelerate the learning process, imposing a Gaussian distribution on the feature maps. Specifically, letting V be the original batch feature cube, 3D-BN produces the transformed outputV byV where MEAN(·) and VAR(·) are the mean and standard-deviation calculated per-dimension over the mini-batches, respectively, γ and η are learnable parameter vectors, and > 0 is a small number to prevent numerical instability. • ReLU layers: ReLU layers [51] are nonlinear activation functions applied to each element of the feature map to gain nonlinear representations. As a widely-used activation function in neural networks, ReLU is faster than other saturating activation functions. The formula of the ReLU function is • 3D pooling layers: 3D pooling layers [52] reduce the data variance and computation complexity by sub-sampling operations, such as max-pooling and average-pooling. The feature maps are first divided into several non-overlapping 3D cubes, and then the maximum or average value within each 3D cube are mapped into the output feature map. • 3D residual basic blocks/3D ResNeXt basic blocks: 3D residual basic blocks [34] (or 3D ResNeXt basic blocks [53]) are adopted in the proposed model to alleviate the vanishing/exploding gradient problem. Two types of connections are involved in the 3D residual basic blocks. The first is a feedforward connection that connects layer by layer, i.e., each layer is connected with the former layer and the next layer. The second is the shortcut connection that connects the input with its output and preserves information across layers. The outputs of those two types of connection are summed, followed by a ReLU layer whereD h−1 denotes the sum of the identity mapping I(D h−1 ) and residual function F (D h−1 , W h−1 ), W h−1 is the weight matrix, and D h−1 and D h are the input and output feature maps of the residual basic block. The structure of 3D ResNeXt basic blocks is similar to the 3D  Figure 5. Architecture of the 3D-SaSiResNet. The 3D residual basic block in this architecture can also be replaced by a 3D ResNeXt basic block, in which the 3D convolution layer highlighted with an orange dotted cube should be changed to grouped convolutions.
Given the features f (D i ) and f (D j ) extracted by the first and second subnetworks, the following Euclidean distance is adopted as the metric to compute the distance between D i and D j After the architecture of the 3D-SaSiResNet is constructed, we train the model for many epochs using the pairs {D i , D j }, i = j, i, j = 1, 2, . . . , N. The parameter θ of the 3D-SaSiResNet is updated through minimizing the contrastive loss function Q(θ) where (L, D i , D j ) is the training sample pair, L is a binary label assigned to the pair {D i , D j }, L = 0 if D i and D j have the same label and L = 1 if D i and D j have different labels, Q S and Q D are designed to minimize Q with respect to θ which will result in low values of E(D i , D j ) for similar pairs and high values of E(D i , D j ) for dissimilar pairs, and M > 0 is a margin. We set M = 2 in this study. Moreover, we use the adaptive moment estimation (Adam) [54] to minimize the loss function Q(θ) and obtain the optimal parameters of the network. LettingD ∈ R w×w×b 0 be the 3D patch of a testing samples, the class label ofs can be obtained by a nearest neighbor classifier. That is,D is independently and sequentially compared with each D i , i = 1, 2, . . . , N and the class label ofs is set to be the same as that of a specific training sample, whose features deserve the following minimal Euclidean distance: where f (D) and f (D i ), i = 1, 2, . . . , N are the features extracted by the trained network.

Dataset Description
To verify the performance of the proposed method, experiments are performed on three real-world multiband images datasets (i.e., Indian Pines data, University of Pavia data, and Zurich data).
The following provides a brief description of the three datasets:

1.
Indian Pines data: the first dataset was gathered by the National Aeronautics and Space Administration's (NASA) Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor in 1992 over an agricultural area of the Indian Pines test site in northwestern Indiana. The selected scene consists of 145 × 145 pixels and 220 spectral reflectance bands cover the wavelength range of 400 to 2500 nm. After removing 20 bands (i.e., bands 104-108, 150-163 and 220) corrupted by the atmospheric water-vapor absorption effect, 200 bands are considered for experiments. The spatial resolution of this dataset is about 20 m per pixel. The three-band false color composite image together with its corresponding ground truth is depicted in Figure 6. Sixteen different classes of land-covers are considered in the ground truth, which contains 10,366 labeled pixels ranging unbalanced from 20 to 2468, posing a big challenge to the classification problem. The number of training, validation, and testing samples in each class is shown in Table 1, whose background color corresponds to different classes of land-covers. Moreover, note that the total number of samples in class 9 (i.e., Oats) are only 20, and we set the number of validation samples to 5 in this class.

2.
University of Pavia data: the second dataset was collected by the Reflective Optics System Imaging Spectrometer (ROSIS) over the urban area surrounding the University of Pavia, south of Italy, on 8 July 2002. The dataset contains 103 hyperspectral bands covering the range from 430 to 860 nm and a scene made of 610 × 340 with a spatial resolution of 1.3 m per pixel is considered in the experiment. The three-band false color composite of this data and its corresponding ground truth are shown in Figure 7. We consider nine classes of land-covers in this dataset, and the number of training, validation, and testing samples in each class is shown in Table 2, whose background color distinguishes different classes of land-covers.

3.
Zurich data: The third dataset was captured by the QuickBird satellite in August 2002 over the city of Zurich, in Switzerland [55]. There are 20 images in the whole dataset and we choose the 18th image for experiments. The chosen image consists of four spectral bands spanning near-infrared to visible spectrum (NIR-R-G-B) and the size of each band is 748 × 800 with a spatial resolution of 0.61 m per pixel. The color composite image and the ground truth are plotted in Figure 8. Six classes of interest are contained in this dataset and the detailed number of training, validation and testing samples in each class is displayed in Table 3, whose background color represents different classes of land-covers.

Experimental Settings
The purpose of the experiments is to assess the effectiveness of the 3D spatial-adaptive patch construction method and the 3D-SaSiResNet-based classification method. To that end, we utilize the 2D-CNN, 2D-ResNet, 2D Siamese ResNet (abbreviated as 2D-SiResNet), 3D-CNN, 3D-ResNet, and the proposed 3D-SaSiResNet as classifiers, and compare the patches generated in a spatially adaptive fashion with the patches obtained by fixing the training (or validation, testing) samples at the center. Moreover, the SVM [13] by using spectral features is also compared with the 3D-SaSiResNet. As will be shown in Section 3.3, a total of 12 methods are given for comparative study by combining the patch construction and classification methods in pairs. In different methods, the abbreviations containing "Sa" denote spatial-adaptive-based methods, "-fix" denotes 2D or 3D patches are extracted by fixing a certain sample at the center of neighborhood windows, "Si" denotes Siamese-based methods, "−1" denotes a 2D or 3D residual basic block that is used in the ResNet-based methods, "−2" denotes a 2D or 3D ResNeXt basic block that is used in the ResNet-based methods.
The architectures of the above-mentioned deep learning methods are displayed in Tables 4-9. It is notable that small kernels are effective choices for 2D-CNN [56] and 3D-CNN [39], and we fix the kernel sizes of all the 2D convolutions to 3 × 3, and most of the 3D convolutions to 3 × 3 × 3. To speed up the 3D-based algorithms and consider more bands at a time, the kernel sizes of the first 3D convolution layers for Indian Pines data, University of Pavia data, and Zurich data are set to 100 × 3 × 3, 50 × 3 × 3 and 3 × 3 × 3, respectively. To make a fair comparison, all the deep learning methods are configured as similar to each other as possible. For instance, the subnetwork in the Siamese-based methods (i.e., 2D-SaSiResNet-1/2D-SaSiResNet-2/2D-SiResNet-2-fix and 3D-SaSiResNet-1/3D-SaSiResNet-2/3D-SiResNet-2-fix) are set up identical to the corresponding non-Siamese ones (i.e., 2D-SaCNN/2D-SaResNet-1/2D-SaResNet-2 and 3D-SaCNN/3D-SaResNet-1/3D-SaResNet-2), and the number of layers in the 2D-based networks are the same as the 3D-based ones. For the parameter settings, the intrinsic dimensionalities are calculated to be 5, 4, and 2 in the Indian Pines data, University of Pavia data, and Zurich data, respectively, and the number of superpixels are set to 589, 2602, and 19,543 in the three multiband images datasets, respectively. The patch size w of all the three experimental datasets is set to 9. In other words, each sample is represented by its 9 × 9 neighborhoods in 2D-SiResNet-2-fix and 3D-SiResNet-2-fix, while all the spatial-adaptive patches are resized to 9 × 9 so as to represent the objects of the multiband images. The reasons for resizing the patches to 9 × 9 are that, if w is too small (e.g., 3 × 3), the spatial information will not be adequately learned by the network; on the contrary, if w is too large (e.g., 15 × 15), the computational burden will be increased and the training process will slow down. The batch size and training epochs are taken as 20 and 100, respectively. Furthermore, the Adam solver is adopted to train the networks and the learning rate is initialized to 0.0001.  Moreover, the SVM and 3D spatial-adaptive patches generation procedure is performed by MATLAB R2019a (The MathWorks, United States) on a Windows 10 operating system (Microsoft, United States) and the deep learning-based classification methods are executed on a server with Nvidia Tesla K80 GPU platform (NVIDIA, United States) and Pytorch backend. In all three datasets, very limited labeled samples, i.e., 10 samples per class, are randomly chosen as training samples, 10 (or 5 samples in class 9 (i.e., Oats) of the Indian Pines data) per class are for validation and the rest are for testing. To study the influence of the number of training samples, we also choose 20 and 50 samples from each class for training. Since classes 1 (i.e., Alfalfa), 7 (i.e., Grass/pasture-mowed), and 9 (i.e., Oats) contain very few samples for the Indian Pines data, the number of training samples in those classes is constantly set to 10 in different cases. The number of validation samples is the same as that in Tables 1-3. The label of a training patch is set to be the same as that of most training samples in this patch. In case the training patches of a certain class are less than 2, we perform rotation augmentation, whose rotation angle is 90 degrees, to augment the number of training patches to 2. Three popular indices, i.e., overall accuracy (OA), average accuracy (AA), and kappa coefficient (κ), are computed to compare different methods quantitatively. All three of the indices are obtained by comparing the reference labels and the calculated labels of test samples. Each experiment is repeated ten times using a repeated random subsampling validation strategy to alleviate possible bias and the average results are displayed in the tables. Table 5. Detailed configuration of the 2D-SaResNet-1 and a subnetwork of the 2D-SaSiResNet-1. Bold type indicates the number of feature maps in each layer.

Layer
Input Size Kernel Size Stride Padding Cardinality Output Size Output of Max-pooling 1 + Output of BN 3 Output of Max-pooling 1 + Output of BN 3 Flatten 3 C denotes the number of land-covers in the multiband image.
Output of Max-pooling 1 + Output of BN 3 Flatten Output of Max-pooling 1 + Output of BN 3 Flatten 3 C denotes the number of land-covers in the multiband image.

Classification Results
The final superpixel segmentation results of the three experimental datasets are shown in Figure 9, from which we find that the proposed superpixel map generation method flexibly divides the multiband images data into naturally formed spatial areas with irregular sizes and shapes. Tables 10-12 report the class-specific accuracy, OA, AA, κ, and computation time of different methods, while the thematic maps are visually depicted in  According to the reported results, a few observations are noteworthy. It can be first seen that, in most cases, the spatial-adaptive patches outperforms the patches generated by fixing the training/validation/testing samples at the center. As an example, for the Indian Pines data, the OA and κ of the 2D-SaSiResNet-2 are respectively 8.24% and 9.00% higher than the 2D-SiResNet-2-fix (see Table 10), while the improvement in OA of the 3D-SaSiResNet-2 is at least 3% more than its corresponding fixed center version (i.e., 3D-SiResNet-2-fix). We can observe from Figure 10 that the classification maps of 2D-SaSiResNet-2/3D-SaSiResNet-2 are much less noisy than those of the 2D-SiResNet-2-fix/3D-SiResNet-2-fix. For the University of Pavia data (see Table 11) and Zurich data (see Table 12), although the improvement in classification accuracy is not as much as the Indian Pines data, the methods with spatial-adaptive patches still provide better results than those with a fixed center in most situations. It is shown in Figure 11 that 2D-SiResNet-2-fix/3D-SiResNet-2-fix fail to correctly classify the pixels in some regions (e.g., the Bricks (class 8) located at the middle-left region), while 2D-SaSiResNet-2/3D-SaSiResNet-2 give a better classification performance than 2D-SiResNet-2-fix/3D-SiResNet-2-fix. Moreover, as shown in Figure 12, the classification maps of 2D-SiResNet-2-fix/3D-SiResNet-2-fix assign more areas to Road (class 1). This makes the classification accuracy of the Road (class 1) higher than 2D-SaSiResNet-2/3D-SaSiResNet-2; nevertheless, it sacrifices the accuracy of other land-covers.
The reason for good results of the spatial-adaptive patch construction method is that they can take into consideration the spatial structures and region boundary areas of the remote sensing data while its counterpart forcibly divides the data into fixed center patches, ignoring the heterogeneous land covers, especially those at the border of different classes. We then compare the 3D-SaSiResNet-1/3D-SaSiResNet-2 against other methods. It can be observed from Tables 10-12 that SVM and 2D-SaCNN perform the worst, the 3D-SaCNN, 2D-SaResNet-1/2D-SaResNet-2, and 3D-SaResNet-1/3D-SaResNet-2 have better performances than the SVM and 2D-SaCNN but comparable or slightly inferior to the 2D-SaSiResNet-1/2D-SaSiResNet-2, and 3D-SaSiResNet-1/3D-SaSiResNet-2 yield superior performance to the competing methods. As given in Table 10, the OA of the SVM and 2D-SaCNN are at least 10% lower than the 3D-SaCNN, 2D-SaResNet-1/2D-SaResNet-2 and 3D-SaResNet-1/3D-SaResNet-2 for the Indian Pines data. It can be obviously seen from Tables 11 and 12 that the classification performance of the SVM and 2D-SaCNN is also worse than the 3D-SaCNN, 2D-SaResNet-1/2D-SaResNet-2, and 3D-SaResNet-1/3D-SaResNet-2 for the other two datasets. One reason why 3D-SaCNN performs better than SVM and 2D-SaCNN is that the 3D convolution can simultaneously extract the spectral-spatial features, and the reason why 2D-SaResNet-1/2D-SaResNet-2 outperform 2D-SaCNN is mainly because shortcut connections are adopted to improve the effectiveness of the training procedure. Moreover, the Siamese-based methods perform much better than their corresponding no-Siamese versions. Specifically, it is notable from Table 10 that the 3D-SiSaResNet-2 provides the best performance in all the algorithms under comparison. Classification results of the University of Pavia data and Zurich data also reveal similar properties. This implies that the Siamese architecture of the network combined with 3D ResNet can effectively improve the classification performance. The main reason why the Siamese architecture is suitable for classification of multiband images with very few training samples is because the network is trained by inputting the training samples in pairs and thus the separability of different classes are learned and the features extracted are especially suitable for the classification task.
An additional noticeable point is that the 3D-based classification methods achieve comparable or better performance than the 2D-based ones. As illustrated in Table 11, in comparison with the 2D-SaCNN, the OA of 3D-SaCNN is improved by 8.79% for the University of Pavia, while the OA of 3D-SaSiResNet-2 is 2.84% higher than its 2D scenario (i.e., 2D-SaSiResNet-2). Similarly, it can be found from Tables 10-12 that the 3D-SaCNN, 3D-SaResNet-1/3D-SaResNet-2, 3D-SaSiResNet-1/3D-SaSiResNet-2, and 3D-SiResNet-2-fix have better classification results than their corresponding 2D versions. Moreover, it is also clearly visible from Figures 10-12 that the classification maps of the 3D-based methods are much closer to the ground truth (see Figures 6b, 7b and 8b) than the 2D-based ones. This is due to the fact that, by representing the multiband images in 3D cubes, the joint spectral-spatial structures are effectively learned by the 3D-based classification methods. To further validate the effectiveness of the proposed method, we focus on the features learned by the last layer of various deep learning-based methods. It is notable from Table 10 that the land covers Corn-no till (class 2), Corn-min till (class 3), and Soybean-no till (class 10) are difficult-to-separate classes for the Indian Pines data. For instance, as for class 3, the classification accuracy of the 2D-SaCNN is as low as 43.21%, the 2D-SaSiResNet-2 and 3D-SaSiResNet-2 significantly improve the classification performance, and the classification accuracy of the 3D-SaSiResNet-2 is increased to 81.92%. As for class 2 and class 10, the proposed 3D-SaSiResNet-2 also provides better results than the state-of-the-art methods. To give an intuitive explanation, the two-dimensional features of the Indian Pines data obtained by 2D-SaCNN, 2D-SaSiResNet-2, 3D-SiResNet-2-fix, and 3D-SaSiResNet-2 in a trial are plotted in Figure 13, from which we can see that the representations learned by the 3D-SaSiResNet-2 are more scattered and separable than other methods. Therefore, it is much easier to distinguish the samples with the features obtained by the 3D-SaSiResNet-2. Moreover, as can be observed from Tables 10-12, although 3D-SaSiResNet-1 and 3D-SaSiResNet-2 take more computation time than no-Siamese-based methods, they are much more efficient than the 2D-SiResNet-2-fix and 3D-SiResNet-2-fix for the Indian Pines data and Zurich data. As to the University of Pavia data, although the processing time of 3D-SaSiResNet-1/3D-SaSiResNet-2 is about four minutes longer than 2D-SiResNet-2-fix/3D-SiResNet-2-fix, they can complete the classification task in less than 20 minutes. 3D-SaSiResNet-1/3D-SaSiResNet-2 are efficient because they are object-based methods rather than pixel-based methods, and the kernel sizes of the first 3D convolution layers are set to 100 × 3 × 3 and 50 × 3 × 3 in the two hyperspectral datasets rather than 3 × 3 × 3. In short, the experimental results validate the effectiveness of 3D-SaSiResNet in the classification of multiband images data.     therefore, the number of features in object-based methods (i.e., (b,c,e)) are much less than that in the original data (i.e., (a)) and the pixel-based method (i.e., (d)).

Parameter Analysis
In this section, the influence of four important parameters, i.e., the number of hidden layers in SAE for band reduction of the hyperspectral image, the number of training samples, the number of superpixels, and the number of training epochs on the performance of the proposed method, is discussed. Figures 14-17 show the OA of the proposed method when the four parameters vary.
As shown in Figure 14, we set the number of hidden layers in SAE as 1 (the number of the unit is set to 5), 3 (the number of units is set to 100, 25 and 5, respectively), 5 (the number of units is set to 100, 50, 25, 10 and 5, respectively), 7 (the number of units is set to 150, 100, 75, 50, 25, 10, and 5, respectively), and 9 (the number of units is set to 175, 150, 125, 100, 75, 50, 25, 10, and 5, respectively), and the classification accuracy varies with the different number of hidden layers. It can be observed that the OA is relatively low when the number of hidden layers is 1 or 3. When the number of hidden layers is equal to or larger than 5, the classification performance is satisfactory. In this paper, we set the number of hidden layers to 5 since more hidden layers will increase the complexity of the model. Figure 15 depicts the OA of different methods with various numbers of training samples for the Indian Pines data. The number of training samples in classes 1, 7, and 9 is constantly set to 10 in different cases. It can be seen from Figure 15 that the classification performance of all considered methods yields stable improvement with the increase of the training set size. For instance, the OA of 2D-SaCNN is lower than 60% in case 10 training samples per class are selected and is increased to more than 70% with 50 training samples per class. This stresses again the importance of training samples for multiband images classification. Moreover, it can be concluded from Figure 15 that the 3D-SaSiResNet has better classification performance than the Siamese convolutional neural network (S-CNN) proposed in [45]. This is because, when the number of training samples in each class equals 50, the classification accuracy of S-CNN is lower than 90% (see Figure 20 in [45]), while the accuracy of our proposed method is higher than 90% with 16 rather than nine classes of land-covers and using 50 training samples per class.   Figure 16 plots the OA against the number of superpixels for the Indian Pines data. The number of superpixels is varied within the range from 290 to 2896. We can observe that a dramatically upward tendency appeared with an initial increase of superpixel number, and the classification accuracy decreases in case the number of superpixels is over a certain value (e.g., 1057). Based on the above analysis, too small and too large number of superpixels may be harmful to the classification performance, and it is better to utilize suitable superpixel numbers according to the object characteristics' spatial structure of the remote sensing data. Furthermore, the impact of the number of training epochs is depicted in Figure 17. It can be seen that the OA rapidly increases in the first couple of epochs and then improves slowly as the number of epochs increases. Finally, the OA trends to a certain stable value with the increased number of epochs. As shown in Figure 17, it is suggested that the 3D-SaSiResNet-2 is trained more than 30 epochs to obtain stable and effective classification performance.

Discussion
In this paper, the three datasets are benchmark and widely-used in remote sensing classification. In the experiments, the Indian Pines data gain more improvement in classification accuracy than the other two datasets. The reason why Indian Pines data achieve more discriminative results than the other two datasets is that the spatial resolution of Indian Pines data (i.e., 20 m) is much lower than that of the Pavia (i.e., 1.3 m) and Zurich datasets (i.e., 0.61 m). Therefore, in the Indian Pines data, the region represented by the same number of pixels may contain more different types of land-covers than the other two datasets. It is noteworthy that the proposed 3D-SaSiResNet can better distinguish the boundaries of different land-covers, and our spatial-adaptive-based method has a higher advantage in the classification of Indian Pines data. We perform statistical significance analysis to further confirm the effectiveness of the proposed method. The statistical difference between the proposed 3D-SaSiResNet-2 and competing methods is studied by the McNemar's test, which is achieved by the standardized normal test statistic where f ij , (i, j = 1, 2) denotes the number of samples correctly classified by the classifier i but incorrectly classified by the classifier j and Z is the pairwise statistical significance of the difference between the ith and jth classifiers. If |Z| > 1.96, the classification difference of the two classifiers is considered as statistically significant at the 5% level of significance. It is shown in Table 13 that the 3D-SaSiResNet-2 is statistically superior (|Z| > 1.96) to all of the other methods except the 3D-SaSiResNet-1 (|Z| < 1.96) for the Indian Pines data. Moreover, Figure 13 intuitively compares the separability of the features learned by different methods according to choosing three difficult-to-separate classes for the Indian Pines data. It is observed that the features obtained by the proposed method are easier to distinguish than other methods. In the 3D-SaSiResNet, the nearest neighbor classifier is used to classify the learned features of testing samples. The nearest neighbor classifier can be replaced by other sophisticated classifies (e.g., k-nearest neighbor (KNN), SVM or Softmax layer), and it can be treated as a special case of KNN with k = 1. It is worth noting that, in case the training set is quite small, the nearest neighbor classifier provides better classification performance than other sophisticated classifiers, while KNN, SVM, or the Softmax layer has higher accuracy than the nearest neighbor classifier if more training samples are available.
The classification results shown in Table 10-12 seems a little bit low. For instance, the highest OA in Table 10 is lower than 80%. In addition, there is a lagging between OA and κ. The reason for the above phenomena is that we only choose a small training set (i.e., training samples per class are no more than 10) in the experiment. The number of training samples is essential to the classification accuracies of various methods. The classification performance of a certain method with a small training set is more likely inferior to the situation with a big training set. As shown in Figure 15, the OA of 3D-SaSiResNet-2 is higher than 90% with 50 training samples per class (except the classes 1, 7, and 9). Moreover, the gap between OA and Kappa will decrease with the increase of training samples.

Conclusions
In this paper, we have proposed a deep network architecture, namely 3D-SaSiResNet, for classification of the multiband images data. The underlying idea of 3D-SaSiResNet is to explore the spatial-adaptive and discriminative features that make the samples from the same class close and those belonging to different classes separated. Based on this concept, 3D spatial-adaptive patches are constructed by superpixel segmentation. In contrast to previous approaches that fix the samples at the center, the patches generated by superpixels do not force the samples at the center but flexibly make the neighbors within the same patch be more likely to come from the same class. Moreover, the network has a Siamese architecture composed of two 3D-ResNet-based subnetworks, which are pairwise trained to increase the separability of different classes. The unlabeled testing samples can be fed into the trained network and classified by a simple nearest neighbor classifier. Experiments on three multiband images datasets acquired by different sensors have demonstrated that the proposed 3D-SaSiResNet method outperforms the state-of-the-art techniques in terms of visual quality and recognition performance. Comparisons show that the 3D-SaSiResNet is especially effective in case limited samples are available for training. The main disadvantage of 3D-SaSiResNet is that the running time of the 3D spatial adaptive patch construction step is much longer than that of the 2D-SiResNet-2-fix and 3D-SiResNet-2-fix. In future works, we will try to adaptively determine the optimal number of superpixels for different datasets and improve the proposed method by deformable convolutions. Classifying unbalanced samples, especially the number of samples in some categories that are particularly small, is one of our future activities. Moreover, it would be a great interest to improve the computational efficiency of the Siamese architecture by pruning and quantization.