Spectral-Spatial Classification of Hyperspectral Images : Three Tricks and a New Learning Setting

Spectral-spatial classification of hyperspectral images has been the subject of many studies in recent years. When there are only a few labeled pixels for training and a skewed class label distribution, this task becomes very challenging because of the increased risk of overfitting when training a classifier. In this paper, we show that in this setting, a convolutional neural network with a single hidden layer can achieve state-of-the-art performance when three tricks are used: a spectral-locality-aware regularization term and smoothingand label-based data augmentation. The shallow network architecture prevents overfitting in the presence of many features and few training samples. The locality-aware regularization forces neighboring wavelengths to have similar contributions to the features generated during training. The new data augmentation procedure favors the selection of pixels in smaller classes, which is beneficial for skewed class label distributions. The accuracy of the proposed method is assessed on five publicly available hyperspectral images, where it achieves state-of-the-art results. As other spectral-spatial classification methods, we use the entire image (labeled and unlabeled pixels) to infer the class of its unlabeled pixels. To investigate the positive bias induced by the use of the entire image, we propose a new learning setting where unlabeled pixels are not used for building the classifier. Results show the beneficial effect of the proposed tricks also in this setting and substantiate the advantages of using labeled and unlabeled pixels from the image for hyperspectral image classification.


Introduction
Hyperspectral images contain rich spectral information coming from contiguous spectral bands.In the spectral domain, pixels are represented by vectors for which each component is a measurement corresponding to specific wavelengths [1].The length of the vector is equal to the number of spectral bands that the sensor collects.For hyperspectral images, several hundreds of spectral bands of the same scene are typically available, which form the features of a pixel.Current operational imaging systems provide images for various applications, e.g., in ecology, geology and precision agriculture [2].
A relevant task of hyperspectral image processing is classification, which aims at building a classifier using the pixel features in order to assign each pixel to one of a given set of classes [3].Current stateof-the-art methods take a spectral-spatial approach, meaning that they use neighborhood information of labeled pixels.Spectral-spatial methods are based on diverse techniques, such as Markov random fields [4,5,6], discriminative feature construction [7,8,9,10,11,12], modification and fusion of classifiers [13,14], label propagation, active learning and semi-supervised learning [15,16], the use of external unlabeled data [17] and deep (convolutional) neural networks [18,19,20,21,22,23,24].Furthermore, objectbased methods utilize geometric features of the image extracted by means of segmentation techniques [25,26,27].
These methods achieve excellent performance on benchmark hyperspectral image classification tasks when a large number of labeled pixels for training is provided [19,18].However, pixel labeling is an expensive task.Therefore, a problem of more practical relevance is to perform hyperspectral image classification with only a few manually-labeled pixels for training.A second problem is the inherent class unbalance of hyperspectral images, where some classes have many pixels, while other classes have only a few.
In this paper, we propose to tackle these problems using a simple shallow Convolutional Neural Network (CNN) and three 'tricks': spectral-locality-aware regularization, smoothing-based data augmentation and label-based data augmentation.The shallow architecture is used to prevent overfitting caused by the few labeled pixels and the many features.Locality-aware regularization forces neighboring wavelengths to have similar contributions to the generated features of the neural network.Smoothingbased data augmentation takes advantage of the spectra of neighboring pixels, and label-based data augmentation exploits labels of neighboring pixels in favor of small classes.
Extensive experiments indicate the effectiveness of the proposed method, which achieves comparable or better accuracy performance than existing methods, such as deep neural networks [28], multiple kernel learning [29], probabilistic class structure regularized sparse representation graph [30,31] and low-rank Gabor filtering [12] (see the results in Table 10).
Spectral-spatial methods exploit information from neighborhood pixels.Since the training and testing pixels are drawn from the same image, their features are likely to overlap in the spatial domain due to the shared source of information: for instance, [23] employed input patches, the central pixel of which is in the training set, and [12] applied Gabor filters to an L-size neighborhood of training pixels.As a consequence, the resulting learning setting used in spectral-spatial methods has an intrinsic positive bias induced by the overlap between training and test samples.In order to investigate such a bias, we consider also a non-overlapping learning setting, where only the labeled pixels initially selected for training are used for building a classifier.

Related Work
Below, we briefly mention a few selected spectral-spatial approaches and methods for hyperspectral image classification.We refer the reader to [32] for a recent survey of hyperspectral image classification methods.
We can divide methods for hyperspectral image classification into three broad categories: (1) preprocessing-based; (2) end-to-end methods; (3) hybrid methods.Pre-processing-based methods construct features prior to training a classifier.Recent methods in this category include the Discriminative Low-Rank Gabor Filtering (DLRGF) method by [12] for spectral-spatial feature extraction prior to classification, a deep CNN with 2D input patches and R-PCA [20] and a deep stacked auto-encoder with 2D input patches and PCA [33].
End-to-end methods learn features while training a classifier.These methods include (multiple) kernel learning methods, which use kernels to implicitly map the input space into a high dimensional non-linear space (see the recent survey [29]), and sparse representation-based methods, like [34,30,31], which learn a sparse representation of test pixels by the linear combination of a few training samples from a given dictionary, whereas its corresponding sparse representation coefficients encode the class information implicitly.Hybrid methods involve multi-step procedures, which include pre-and/or post-processing steps.For instance, the superpixel-based graphical model by [35] consists of three steps: the superpixel generation using the watershed segmentation algorithm after performing gradient fusion among multiple spectral bands; the superpixel-based graphical model development with the aid of pixel-level attributes; and the loopy belief propagation algorithm applied at the superpixel level.Here, a superpixel is a group of spatially-connected similar pixels.Object-based methods segment an image and simultaneously try to assign to each segment a class [25,26,27].
Methods specifically related to the one we propose are based on convolutional neural networks and data augmentation.Due to the success of convolutional neural networks in image classification, a plethora of CNN-based methods for hyperspectral image classification have been proposed.They differ mainly in the architecture that they use, the specific loss function that is optimized and the representation of the input data, that is as single pixels, patches of pixels, cubes of pixels, etc.Moreover, some CNN-based methods use preprocessing, often PCA, to either build a low dimensional set of non-linear input features or to extract additional information (e.g., edge detection).These methods include [20], a deep CNN with 2D input patches and R-PCA [33], a deep stacked auto-encoder with 2D input patches and PCA [18], a contextual deep CNN [36], a multi-hypothesis prediction [12], a low-rank Gabor filtering method [19], a deep CNN with 1D pixel spectra [23], a deep CNN with 1D pixel spectra, 2D pixel patches or 3D pixel cubes [21], a deep CNN with 1D pixel spectra and [28] a deep CNN with uniform smoothing kernel and 1D pixel spectra.Fortunately, the authors of the latter method shared the source code with us, which we could then use in our comparative experimental analysis.
Data augmentation is used to enhance the performance of deep neural networks for image classification.This approach has also been used in the context of hyperspectral image classification, in deep CNN-based methods.For instance, [37] used blocks of 5 × 5 pixels as samples and rotated and flipped the resulting training samples to enlarge the training set.In the deep CNN-based method by [38], the number of training samples was augmented four times by mirroring the training samples across the horizontal, vertical and diagonal axes.Our new data augmentation procedure is different because it takes into account the spatial locality of the data.
In [39], it has been observed that the dependence caused by overlap between the training and testing samples may be artificially enhanced by some spatial information processing techniques used in spectral-spatial classification methods, such as spatial filtering and morphological operators.Therefore, the authors introduced an alternative controlled random sampling strategy for spectral-spatial methods to reduce the overlap between training and testing samples and provided a more objective evaluation.However, the proposed strategy uses information on the class distribution, which may not be available in real-life scenarios.The non-overlapping learning setting that we propose overcomes this limitation.

Materials and Methods
A hyperspectral image is represented by a three-dimensional matrix of spectral pixels in R H×W ×M , where H is the height, W is the width and M is the number of wavelengths.We denote such an input image by P and the original input image by P orig .
A subset of pixels I ⊆ H × W from the input image has known class labels.This subset is called the training set, denoted by (x, y).We denote by x i the i-th pixel of the training set, 1 ≤ i ≤ |I | and by y i its label.The rest of the pixels of the image form the test set, denoted as x test .The number of classes is denoted as K, and we will treat labels as binary vectors, so y i,k = 1 if and only if the i-th pixel belongs to class k.
Our method extends the training set by doing data augmentation.With a slight abuse of notation, we also denote the resulting training set by x with assigned labels y.

Data
We consider five groups of hyperspectral images, which are publicly available (http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes): • Pavia Center: obtained with a Reflective Optics System Imaging Spectrometer (ROSIS) sensor during a flight campaign over Pavia, Northern Italy • Pavia University: scanned using the ROSIS sensor during a flight campaign over Pavia, Northern Italy • Kennedy Space Center (KSC): obtained with the Airborne/Visible Infrared Imaging Spectrometer (AVIRIS) sensor over the Kennedy Space Center, Florida (USA) • Indian Pines: scanned using the AVIRIS sensor over the Indian Pines test site in north-western Indiana (USA) • Salinas: scanned using the AVIRIS sensor over Salinas Valley, California (USA) As is common practice in hyperspectral image classification, we consider only the foreground pixels.Moreover, for the Indian Pines dataset, as done, e.g., in [12], we discard classes with very few samples and keep the remaining 12 classes (Corn-notill, Corn-min, Corn, Grass/Pasture, Grass/Trees, Haywindrowed, Soybeans-notill, Soybeans-min, Soybean-clean, Wheat, Woods and Bldg-Grass-Tree-Drives). Characteristics of the images (size, number of features, foreground pixels and considered classes) are given in Table 1.

Transductive Learning Setting
Spectral-spatial methods exploit information from neighborhood pixels.Since the training and testing pixels are drawn from the same image, their features are likely to overlap in the spatial domain due to the shared source of information [39].In this transductive learning setting, a pixel-based random sampling strategy is used to select labeled pixels for training, and unlabeled pixels from the rest of the image can also be used when building a classifier by using information from the neighborhood of training pixels.
The overall number of labeled pixels for each dataset is reported in Table 1.Unlabeled (test) pixels for each of the considered hyperspectral images are shown in Figure 1.

Non-Overlapping Learning Setting
The transductive learning setting used in spectral-spatial methods has an intrinsic positive bias due to the use of the neighborhood of each training pixel in the image, resulting in an overlap between training and test samples.In order to investigate such bias, we consider also a non-overlapping learning setting, where only training samples, i.e., the labeled pixels initially selected for training, are used for building a classifier.
In particular, we propose to randomly select a single patch of pixels for each class to use as training data.We use a patch of 7 × 7 labeled pixels for each class as a training set, which ensures that we have enough training pixels (at most 49) per class.
In general, there is a mismatch between both learning settings used in spectral-spatial hyperspectral image classification and the standard supervised learning.In supervised learning, methods are tested on independent identically distributed (i.i.d.) data.Therefore, in the case of image datasets, methods should be trained on a set of images that are independent of the test set images.Therefore, even if we do not use pixels other than those selected by our sampling procedure, this does not guarantee that the samples in our training set are independent.Hence, our controlled random sampling procedure is not a proper supervised learning setting.Nevertheless, this setting is useful for assessing the performance of methods without the bias caused by overlap between the training and testing samples.
In [39], another controlled random sampling method to select labeled pixels was introduced.The proposed method considers connected component areas in the image, consisting of pixels with equal class.For each such area, pixels are randomly sampled.Each selected pixel and its 8 neighbors form the training set.Pixels in the rest of the image are only used at test time.See [39] (Algorithm 1) for a detailed description of this method.Although this procedure is interesting for assessing the performance of spectral-spatial methods, it is impractical, since one would have to know the class composition of the whole image in order to perform the step 'selects all unconnected partitions Pin the class c'.Our controlled random sampling procedure overcomes this drawback because it does not use information on the class distribution and selects pixels by randomly sampling a patch for each class.

The Baseline CNN
The baseline on which we build our method is a Convolutional Neural Network (CNN) with a single hidden convolutional layer (see Figure 2).Unlike in larger CNN architectures, we do not use pooling or fully-connected hidden layers.We chose this simple architecture because of the limited amount of labeled training data and the relatively high number of features.In this context, a simpler architecture has fewer parameters to learn, which reduces the risk of overfitting.
For training, we use the standard L2 regularized cross-entropy loss function, Cross-entropy loss Here is the network's output, ψ 1 (−) and ψ 2 (−) are the activation functions and W = [w 1 , w 2 ] are the weights from the input to the single hidden layer (w 1 ) and from the hidden layer to the output (w 2 ).K is the number of classes.For the hidden layer, we use a rectified linear activation function ψ 1 (u) = max(0, u), and for the prediction layer, we use softmax, To learn the weights of the neural network(s), which optimize this loss function, we use the common 'Glorot' procedure for initializing the weights [40] and Stochastic Gradient Descent (SGD) [41] for updating them.The standard parameters of a CNN are: • learning rate (η); • momentum (default value used in experiments: 0.7); • number of convolutional kernels (#kernels); • size of the convolutional kernels (N); • stride for the convolution (s); • L2 regularization constant (λ 1 ).
To enhance the robustness of CNN to perturbed versions of the data, we add random noise to copies of the original data, and then, we add these copies to the original data to increase the amount of available training data.In absence of any further knowledge, it is natural to use Gaussian noise.
The new spectrum of a pixel is generated by adding random Gaussian noise to the original wavelengths of the spectrum as follows: where is the k-th wavelength of the new (i, j)-th spectral pixel, generated by perturbing P i jk , with the addition of Gaussian noise i jk having zero mean and unit variance.β is a constant term that we fixed at 0.01.This procedure is applied to all the pixels in the training set.
We use this noise-based data augmentation together with the CNN as a baseline.In the following sections, we describe three tricks to enhance CNN by exploiting spectral-spatial locality: 1. by constraining weights of the neural network corresponding to nearby wavelengths to assume similar values (see Section 2.4); 2. by generating pixels with smoothed spectra from neighbors of labeled pixels (see Section 2.5); 3. by propagating the label of a pixel to its neighbors and adding them to the training set (see Section 2.6).

Trick 1: Locality-Aware Regularization
We add a term to the CNN loss function, which penalizes large differences between values of adjacent weights, as done in [42].In this way we enforce that neighboring wavelengths have similar contributions to the generated features, thus taking advantage of the spectral-locality of the data.The augmented loss function consists of the regularized cross-entropy loss term plus our regularization term, which constrains nearby weights to assume similar values: Cross-entropy loss Locality-aware regularization .
Here, the variables are as in Equation (1).shift(•) is an operation that shifts the elements of an array one position to the left, and λ 2 controls the new spectral-locality-aware regularization term.

Trick 2: Smoothing-Based Data Augmentation
Spectra of nearby pixels are assumed to be related because they are part of an image containing semantically homogeneous components, such as urban or rural areas.
Recent state-of-the-art methods exploit this property in different ways, such as the use of patches to train a (deep) neural network [20,33,18], the generation of discriminatory features using Gabor filters [36,12] or the use of additive Gaussian noise in addition to linear combinations of training pixels [28].
Here, we just use a Gaussian smoothing filter, because of its simplicity and invariance to rotation of the image.This operation has been called spatial smoothing [43,22,12].
Keeping the same notation introduced in Section 2, we generate the smoothed image P smt as: where P i jk is the k-th wavelength of the (i, j)-th spectral pixel and P smt i jk is the new smoothed wavelength.In practice, the above sum is computed over pixels (i , j ) whose distance from pixel (i, j) is at most 3σ. Figure 3 shows two pixel-spectra of the same class before and after spatial smoothing is applied.

Trick 3: Label-Based Data Augmentation
We also exploit spectral-spatial locality with data augmentation at the semantic level, by assuming that neighbor pixels are likely to have the same class.According to this assumption, the label of a pixel in the training set can be propagated to its neighbors.The resulting labeled neighbor pixels are inserted into the training set, which becomes larger at the cost of introducing label-noise.Indeed, this data augmentation procedure is likely to add new pixels with an incorrect label, and even copies of the same pixel labeled in different ways.In this way, the network is trained using a training set that contains pixels with uncertainty on their label.In order to keep the probability that our assumption is wrong as low as possible, we randomly sample only a subset of pixels in the Moore neighborhood of each pixel (consisting of its 8 surrounding pixels).
Furthermore, we can use this augmentation step to tackle the class unbalance, by favoring the selection of pixels in smaller classes.Specifically, for pixel i in the training set with label y i and for each pixel j in its neighborhood, j is selected with probability: where C y i is the number of pixels in the training set with label y i , and C = [C 1 , . . ., C K ] is the vector consisting of the number of pixels of each class.All selected neighbors are added to the (multi-)set I of labeled pixels to give I la .For any j ∈ I la that was added with label augmentation, its label will be y la j = y i .This selection procedure biases the insertion of more pixels from smaller classes.
In summary, our label-based data augmentation procedure can be described as follows: for each pixel i in the training set, 1. find its Moore neighborhood; 2. select a subset of pixels in the neighborhood; see Equation (5); 3. propagate the label of i to the selected neighbor pixels; 4. insert the selected pixels into the training set.

Incorporating the Tricks into the Baseline CNN: CNN-RSL
The resulting method for hyperspectral image labeling, called CNN-RSL, incorporates into CNN the proposed three tricks: Regularization (R), Smoothing-based data augmentation (S) and Label-based data augmentation (L).
The 'augment-train-set' step of Algorithm 1 (also illustrated in Figure 4) consists of the following steps: 1. the original image is perturbed with random Gaussian noise Equation ( 2

Experiments
In this section, we describe the experiments conducted on the five groups of hyperspectral images.First, we describe the 16 algorithms considered in our experiments.Next, we report the results, which we also compare with published results from existing methods based on different approaches.Finally, we discuss these results.

Algorithms
We assess the performance of our baseline CNN with all combinations of the proposed tricks: • R: spectral-locality-aware regularization term (see Section 2.4); • S: smoothing-based data augmentation (see Section 2.5); • L: label-based data augmentation (see Section 2.6).
The combination of the CNN with all three tricks yields CNN-RSL, while the other six combinations are: CNN-R (with R), CNN-S (with S), CNN-L (with L), CNN-RS (with R and S), CNN-RL (with R and L) and CNN-SL (with S and L).
Moreover, in order to investigate the effect of these tricks on other types of neural networks, we incorporate S and L also in the following methods: • SVM-RBF: a support vector machine with the Radial Basis Function (RBF) kernel; • HL-ELM: a deep convolutional neural network for hyperspectral image labeling for which we were able to retrieve the source code.HL-ELM has two convolutional and two max pooling hidden layers arranged one after the other (see [28]).
In order to tune these parameters, we use the standard Random Grid Search Cross-Validation framework (RGS-CV) [44].Resulting values of the CNN-RSL parameters are given in Table 2.
We also use RGS-CV to select the value of σ, the parameter of our spatial smoothing procedure, from the set {1, 1.67, 2.33, 3, 3.67, 4.33, 5}.Like our neural network model, SVM-RBF has also a few parameters to be tuned:
For HL-ELM, we use the parameter setting described in [28].

Results and Discussion
The results of the experiments with few labeled pixels for training are given in Tables 3-8.CNN-RSL achieved the best performance, with significant improvement over the baselines.On the Pavia University dataset with 1% training data, CNN-S was most effective (95.01 mean accuracy), closely followed by CNN-RSL (94.74).On the Salinas dataset, the improvement in accuracy from CNN to both CNN-RSL and CNN-S was about 10%.In this case, CNN-RSL was slightly better than CNN-S.On the other hand, with 1% training data, on the KSC dataset, the improvement in accuracy from CNN to CNN-RSL was about 12% (from 78.09-90.34),while CNN-S achieved a 84.74 mean accuracy; and on the Indian Pines dataset, the improvement in accuracy from CNN to CNN-RSL was more than 30% (from 54.83-86.42),while CNN-S achieved a 70.93 mean accuracy.Overall, statistical tests showed the superiority of our method when using all tricks.The increase in accuracy with respect to CNN, SVM-RBF and HL-ELM was higher when fewer training pixels were used, since smoothing-and label-based data augmentation were more beneficial in that case.Clearly, these two tricks also helped to improve the performance of SVM-RBF and HL-ELM.As expected, by increasing the number of training pixels per class, the average test accuracy of all methods increased.3: Test set classification accuracy on the Pavia Center dataset, when using 1-5% randomly sampled labeled pixels per class for training.We report the mean and standard deviation over 10 runs.We also list the mean and standard deviation of the number of training pixels per class for each training %.The best accuracy for each training set is indicated in bold.An '*' means that the best accuracy is significantly better than the accuracy achieved by the corresponding method according to a binomial test for comparing classifiers [45] (p-value < 0.05).

Method
Existing methods using a transductive setting, such as [19,21,18], have been shown to achieve very good results on these datasets when 200 labeled pixels for each class are used as the training set.Our method also achieves excellent performance in this setting: it improves significantly over the considered baselines, according to a binomial test for comparing classifiers [45]; see Table 9.
Table 10 reports the results of our method and published results of the following state-of-the-art methods based on different approaches: discriminative low-rank Gabor filtering [12], multiple kernel learning [30], kernel sparse representation [34] and probabilistic class structure regularized sparse representation graph [31,29].Unfortunately, due to the diversity of choices regarding the number of training pixels, it is not possible to completely fill Table 10.CNN-RSL achieved higher accuracy compared to the other considered methods.Only in three cases, namely on the Pavia University dataset with 1 and 5% pixels and the Indian Pines dataset with 10% pixels available for training, [29,34] reported a higher accuracy than CNN-RSL.
The test set accuracies in the non-overlapping setting are reported in Table 11.In this setting, spatial smoothing (the S trick) can only be used in a limited way: each pixel in the training set was smoothed using only the other pixels in the training set.Label-based data augmentation (the L trick) cannot be applied any longer, since this step would add new pixels from the image to the training set.Spectrallocality-aware regularization (the R trick) can still be used, since it does not involve the use of pixels that are not in the training set.As we can see, also in the non-overlapping setting, smoothing-based data augmentation helped to achieve a higher accuracy for all the methods we used.Unsurprisingly, since we took a single 7 × 7 patch of pixels per class as the training data, the performance of the all methods The best accuracy for each training set is indicated in bold.An '*' means that the best accuracy is significantly better than the accuracy achieved by the corresponding method according to a binomial test for comparing classifiers [45] (p-value < 0.05).

Method
was much lower than in the transductive setting reported in Tables 3-7.In particular, if there is a large variation in the spectra of a single class, we will miss this by using only a single patch per class.Since we did not use label-based data augmentation, our reference method here was CNN-RS.
In general, results in the non-overlapping learning setting showed a large decrease in performance compared with that in the transductive learning setting.Nevertheless, in this setting, spectral-localityaware regularization and data augmentation, the latter used in a very limited form, were still beneficial, with significant increases in accuracy on the KSC and Indian Pines images.
Overall, the results of all experiments substantiated the beneficial effect of the proposed tricks, which we discuss below.
In general, smoothing-based data augmentation (the S trick) introduces spatial locality into each pixel's spectrum by averaging it with its neighboring pixels' spectra.Since neighboring pixels are likely to belong to the same area, spatial smoothing makes nearby spectra look more alike and eases the network's classification task.Smoothing-based data augmentation has the largest impact on the test accuracy, with significant improvements across all datasets, notably on the Indian Pines and Salinas.
Label augmentation (the L trick) had a bigger impact on small classes, which were also the most difficult to classify correctly, especially in a setting with very few training samples.For a large training set, label-based data augmentation may have a decremental effect, which was nevertheless mitigated or neutralized when used in combination with the other components of CNN-RSL.In particular, labelbased data augmentation had a clearly beneficial effect for the KSC and Indian Pines datasets, which were the datasets having more classes and fewer pixels.The label augmentation trick tended to balance the classes by selecting more new training samples from smaller classes.In our experiments with 10 labeled pixels per class (see Table 8), the training set was already class balanced.In this case, label augmentation still improved the results, but the difference was not as large as in the experiments with class unbalanced training data.In particular, for the KSC dataset, a 6% gain in accuracy was achieved by the baseline CNN when using 10 labeled samples per class instead of 2% of randomly selected labeled pixels as the training set (from 83.98%-90.89%),although selecting 2% of the data results in 10 samples per class on average (see Table 1).On the other hand, with CNN-RSL, the gain in accuracy was only 2.5% (from 95.36%-97.85%).This shows that our data augmentation tricks mitigated the negative effect of the class unbalanced distribution of the training set.
Spectral-locality-aware regularization (the R trick) helped to achieve a higher classification accuracy, when used in conjunction with data augmentation, as can be seen by the reduced accuracy of CNN-SL  5: Test set classification accuracy on the KSC dataset, when using 1-5% randomly sampled labeled pixels per class for training.We report the mean and standard deviation over 10 runs.We also list the mean and standard deviation of the number of training pixels per class for each training %.The best accuracy for each training set is indicated in bold.An '*' means that the best accuracy is significantly better than the accuracy achieved by the corresponding method according to a binomial test for comparing classifiers [45] (p-value < 0.05).
compared to CNN-RSL.Notably, on the Indian Pines image, with only 1% labeled pixels, a gain of almost 20% was achieved when using locality-aware regularization and data augmentation over using only data augmentation.Locality-aware regularization also helped to improve accuracy on the other datasets, although the gain was not as big as for Indian Pines.
We conclude this section with a discussion about the convergence and run time of CNN-RSL.The run time of CNN-RSL depends on the number of pixels and on the value of σ used for the spatial smoothing.In fact, the necessary time for spatial smoothing is proportional to both σ and the number of pixels.This is a disadvantage of CNN-RSL with respect to algorithms that use deep architectures and no spatial smoothing.However, spatial smoothing can be highly parallelized given that each pixel is smoothed independently of the other pixels.Consequently, its running time can be drastically reduced [47].In Figure 6,we report the running time of CNN-RSL using a single CPU with a 2300-MHz clock speed.The values refer to the time needed for predicting a single pixel, and they also include the preprocessing time.

Conclusions
We have introduced a simple method based on convolutional neural networks and data augmentation for spectral-spatial classification of remotely-sensed hyperspectral images.
The main characteristic of our method is its capability to exploit spectral-spatial information at both the data level (through data augmentation) and at the classifier level (through the locality-aware regularization).We proposed two types of data augmentation: smoothing-based, which constructs new pixels from the spectra of neighbors of the labeled pixels, and label-based data augmentation, which expands the training set with neighbors of the labeled pixels.Smoothing-based data augmentation consistently improves the test accuracy of the tested methods, while the contribution of label-based data augmentation is mostly beneficial for datasets with many small classes and skewed class distributions.Furthermore, we modified the loss function by inserting a term to penalize the difference among networks weights corresponding to nearby wavelengths of the spectra.
Both CNNs and data augmentation have been widely used in hyperspectral classification [23].Therefore, at first, the contribution of the proposed method seems limited.Nevertheless, CNN-RSL differs from previous methods in two main aspects: (1) the considered CNN architecture is a very basic shallow architecture with only one hidden layer, without pooling and fully-connected layers, which is advantageous because it does not need a large amount of data or computational resources for training as deep neural networks do, and it is more robust to overfitting; (2) we perform data augmentation not only with a rather standard smoothing-based technique, but also with a new label-based technique to favor the selection of pixels in smaller classes, which is beneficial when few labeled pixels are available and when the class distribution is skewed.
An advantage of the proposed method is its modularity, which favors qualitative analysis of the contribution of the single tricks, as well as their embedding in other types of neural networks.The results of our extensive comparative analysis demonstrated the usefulness of the method.
Our data augmentation approach uses neighbors of training pixels, that is test samples, when building a classifier.This transductive learning setting is the natural setting for hyperspectral image classification.When no overlap between training and test data is allowed, our label augmentation strategy cannot be used.Nevertheless, a limited form of smoothing data augmentation and the spectral-locality-aware regularization term can still be used.The results of experiments showed a substantial drop in accuracy with the non-overlapping learning setting.
Our approach considers a single image.In future work, we intend to adapt the approach to multiple images.For instance, in a dynamic setting, where time-series spectral images are given in order to study seasonal changes of vegetation species, we intend to develop multi-channel convolutional neural networks with locality-aware regularization to enforce smooth change in time.
To guarantee full reproducibility of all results and to facilitate direct usage of CNN-RSL, the source code of our method is publicly available at https://bitbucket.org/TeslaH2O/cnn_hyperspectral.

Figure 1 :
Figure 1: Unlabeled pixels for each of the considered hyperspectral images.

Figure 2 :
Figure 2: Single hidden convolutional layer CNN architecture.The input of the CNN is the spectral feature vector of a pixel to which 1D convolutions are applied in the convolutional layer.Afterwards, the resulting feature maps are flattened and fed to the last, fully-connected, layer, which outputs the class prediction of the input pixels.

Figure 3 :
Figure 3: Effect of spatial smoothing on the spectra of two neighboring pixels: original image (left); image after spatial-smoothing (right).Spectra look more similar after spatial smoothing.

Figure 4 :
Figure 4: CNN-RSL (Regularization (R), Smoothing-based data augmentation (S) and Label-based data augmentation (L)) data processing flowchart.Data augmentation is applied to the original hyperspectral image.The labeled pixels from each of the three hyperspectral images (original, noisy and smoothed) form the training set (x, y), which is used to train the CNN with the spectral locality-aware regularization term (CNN-R).
Figure 5 illustrates the convergence behavior of our loss function during the training of CNN-RSL on one of the datasets used in the experiments.To assess convergence, we use early stopping.The training stops when the validation error does not decrease for at least 100 epochs.

Figure 5 :
Figure 5: Convergence behavior of the CNN-RSL loss function (average over 10 folds of cross-validation on the KSC dataset).

Table 1 :
Description of the hyperspectral images.KSC, Kennedy Space Center.

Table 2 :
Parameter values of CNN-RSL trained with 1% labeled pixels of each class using Random Grid Search Cross-Validation framework (RGS-CV).

Table 4 :
Test set classification accuracy on the Pavia University dataset, when using 1-5% randomly sampled labeled pixels per class for training.We report the mean and standard deviation over 10 runs.We also list the mean and standard deviation of the number of training pixels per class for each training %.

Table 6 :
[45] set classification accuracy on the Indian Pines dataset, when using 1-5% randomly sampled labeled pixels per class for training.We report the mean and standard deviation over 10 runs.We also list the mean and standard deviation of the number of training pixels per class for each training %.The best accuracy for each training set is indicated in bold.An '*' means that the best accuracy is significantly better than the accuracy achieved by the corresponding method according to a binomial test for comparing classifiers[45](p-value < 0.05).

Table 7 :
[45] set classification accuracy on the Salinas dataset when using 1-5% randomly sampled labeled pixels per class for training.We report the mean and standard deviation over 10 runs.We also list the mean and standard deviation of the number of training pixels per class for each training %.The best accuracy for each training set is indicated in bold.An '*' means that the best accuracy is significantly better than the accuracy achieved by the corresponding method according to a binomial test for comparing classifiers[45](p-value < 0.05).

Table 8 :
[45]age classification accuracy of test data over 10 runs using 10 randomly sampled labeled pixels per class for training.The best accuracy for each dataset is indicated in bold.An '*' means that the best accuracy is significantly better than the accuracy achieved by the corresponding method according to a binomial test for comparing classifiers[45](p-value < 0.05).

Table 9 :
[45]18]t classification accuracy when using 200 randomly sampled labeled pixels per class for training.We report the mean and standard deviation over 10 runs.The best accuracy for each dataset is indicated in bold.With the exception of methods from[19,18], an '*' means that the best accuracy is significantly better than the accuracy achieved by the corresponding method according to a binomial test for comparing classifiers[45](p-value < 0.05).

Table 10 :
Average classification accuracy reported on other experimental settings used in the literature.We considered the following methods: Low-Rank and Sparse Representation Classifier with a Spectral Consistency Constraint (LRSRC-SCC), Probabilistic Class Structure Regularized Sparse Representation (PCSSR), Multiple Kernel Learning (MKL), Discriminative Low-Rank Gabor Filtering (DLRGF), Kernel Sparse Representation (KSR), Image Fusion and Recursive Filtering (IFRF) and our method (CNN-RSL).The best accuracy for each dataset and % training samples employed is indicated in bold.

Table 11 :
[45]age classification accuracy of test data over 10 runs in the non-overlapping learning setting (see Section 2.2.2).The best accuracy for each dataset is indicated in bold.An '*' means that the best accuracy is significantly better than the accuracy achieved by the corresponding method according to a binomial test for comparing classifiers[45](p-value < 0.05).