Dual-Window Superpixel Data Augmentation for Hyperspectral Image Classiﬁcation

: Deep learning (DL) has been shown to obtain superior results for classiﬁcation tasks in the ﬁeld of remote sensing hyperspectral imaging. Superpixel-based techniques can be applied to DL, signiﬁcantly decreasing training and prediction times, but the results are usually far from satisfactory due to overﬁtting. Data augmentation techniques alleviate the problem by synthetically generating new samples from an existing dataset in order to improve the generalization capabilities of the classiﬁcation model. In this paper we propose a novel data augmentation framework in the context of superpixel-based DL called dual-window superpixel (DWS). With DWS, data augmentation is performed over patches centered on the superpixels obtained by the application of simple linear iterative clustering (SLIC) superpixel segmentation. DWS is based on dividing the input patches extracted from the superpixels into two regions and independently applying transformations over them. As a result, four different data augmentation techniques are proposed that can be applied to a superpixel-based CNN classiﬁcation scheme. An extensive comparison in terms of classiﬁcation accuracy with other data augmentation techniques from the literature using two datasets is also shown. One of the datasets consists of small hyperspectral small scenes commonly found in the literature. The other consists of large multispectral vegetation scenes of river basins. The experimental results show that the proposed approach increases the overall classiﬁcation accuracy for the selected datasets. In particular, two of the data augmentation techniques introduced, namely, dual-ﬂip and dual-rotate, obtained the best results.


Introduction
Hyperspectral images (HSIs) are formed by a grid of pixels, each of them represented by a high-dimensional vector capturing a fraction of the electromagnetic spectrum for that point, sampled at different wavelengths [1]. The high density of the spectral information contained in HSIs, in the order of tens to hundreds of bands for a single scene, increases the ability to identify the materials present in it. This characteristic makes HSIs popular candidates for supervised classification in the field of remote sensing [2].
Deep learning (DL) models have been introduced in the last few years for HSI classification tasks [3][4][5][6] with promising results. Convolutional neural networks (CNNs) [7] in particular, have been successfully used for solving problems requiring multi-class, multi-label classification involving feature extraction (FE) from images [8]. CNNs operate over small cubes of data called patches instead of relying on spectral information alone. These patches are centered around a pixel of the image and taken from a sliding window of a certain size in order to extract spatial-spectral information. A patch is extracted for every pixel of the image using this procedure.
CNNs generally require large amounts of training samples in order to prevent overfitting. Data augmentation is a technique that synthetically generates new samples by applying a set of domain-specific transformations over the original input dataset to improve the generalization capabilities of a classification model. Several data augmentation techniques applicable to HSIs have been proposed, most of which are based on geometric transformations commonly used for image recognition tasks [9]. More recently, [10,11] described hyperspectral data augmentation techniques where pixels are grouped in blocks and different block pairs are used as the input to a CNN. In [12], samples in the original dataset were shifted along its first principal component or based on the average value in each band. Augmentation based on randomly erasing parts of the input patches has also been proven effective for HSI classification in [13]. Finally, generative adversarial networks have been proposed recently as a data augmentation technique in order to generate new samples mimicking the distribution of the original data [14][15][16].
Segmentation is a preprocessing technique capable of simplifying images, reducing them to meaningful, independent regions of pixels with high intra-region similarity and high inter-region dissimilarity called segments. The simplification makes this technique useful to reduce the complexity of subsequent processing tasks. Examples of the use of superpixels in hyperspectral classification as a way to exploit context information can be found in [17][18][19], and as part of a DL classification scheme in [20,21]. In our proposal, superpixel segmentation [22] is used to reduce the computational cost associated with CNN-based classification. During the training and prediction stages, only a representative subset of pixels from each superpixel of the image is selected, allowing for a significant reduction of the computational cost when compared to the sliding window extraction from pixel-based classification.
In this paper, several augmentation techniques relying on geometric transformations aimed at efficient, superpixel-based DL classification for large images are introduced. The main contributions are:

1.
A data augmentation framework called dual-window superpixel (DWS), based on a combination of superpixel segmentation for patch extraction and geometric transformations is proposed. Patches are divided into two regions and the transformations are applied independently to them. This framework is introduced as part of a CNN classification scheme capable of improving the classification accuracy and significantly reducing the execution time of the classification process.

2.
A number of fast and simple data augmentation techniques based on the DWS data augmentation framework are also proposed.
The rest of the paper is divided into four sections. Section 2 describes DWS, the proposed classification scheme and the derived data augmentation techniques. Section 3 details the experimental setup and lists the results obtained. Section 4 presents the discussion about the experimental results. Finally, Section 5 summarizes the main conclusions.

Dual-Window Superpixel Data Augmentation (DWS)
This section describes in detail the DWS data augmentation framework developed as part of this work, and the data augmentation techniques obtained from it. The main stages of DWS are explained below.

Superpixel-Based Patch Extraction
Usually, in hyperspectral classification using CNNs, a sliding window of a certain size is applied [23] to extract spatial-spectral information from the image I. The contents of this window P, also called patch, are then fed to the network.
The proposed scheme replaces the sliding window with the extraction of patches based on superpixel information in order to reduce the computational cost. These superpixels are obtained by applying simple linear iterative clustering (SLIC) [24], a low-complexity superpixel segmentation method commonly used in computer vision, to the image I. Figure 1 shows the result of applying SLIC to the Salinas hyperspectral scene. Strong adherence of the superpixel boundaries to the edges of the image can be observed.  Figure 2 shows the first stages of DWS, related to the acquisition of the patches, prior to the data augmentation itself. The chosen patch size in this example is 25 × 25 × B, B being the number of bands in the image I. After the application of SLIC to I, each of the resulting superpixels is considered a sample, and a patch is extracted from its center. This reduces the number of processed patches from W × H, W and H being the spatial dimensions of the image, to a much smaller number of superpixels.

Patch Subdivision
Augmentation operations are commonly performed on the patch as a whole. The complete patch is flipped, rotated, etc. This paper hypothesizes that, in contrast, applying transformations over independent regions of the patch would produce better results. The patch subdivision operation in Figure 2 depicts the patch subdivision stage proposed in this paper. Patches are divided into two regions according to the distance from the central pixel. In the example, the inner region is set to 15 × 15 pixels. This subdivision makes it possible to apply transformations to the inner region and outer regions of the patch independently. Any transformation able to operate over a patch of data can be applied, regardless of its nature.
The process of patch extraction and subdivision is shown in more detail in Figure 3. On the left, the borders of the different superpixels obtained after applying SLIC over the image are depicted, along with the central pixel of one of the superpixels. The black square centered on that same pixel represents a patch of the desired dimensions that will be extracted at that location. The patch, shown on the right, is then subdivided into two regions as part of this stage.

Patch Transformation
Several augmentation techniques are introduced in this paper using the patch subdivision principle described in the previous section in combination with the traditional rotate and flip transformations. They can be divided into techniques with transformations applied to the inner region (prefixed with the term inner) and techniques with transformations applied to both the outer and inner regions independently (prefixed with the term dual). Figure 4 shows examples of the outputs of all the techniques considered in this paper. None indicates that no data augmentation is applied to the input patch. Random occlusion (RO), as presented in [13], performs selective data erasing. Rotate and flip are the traditional data augmentation techniques where the homonymous operation is applied over the whole patch. In addition to those, the following techniques are proposed: The dual-flip 16× technique is illustrated in Figure 5. The arrows next to the patches indicate flip transformations performed on certain axes. Long arrows each represent a flip applied to the outer patch, whereas small arrows each represent the same operation applied to the inner region. During dual-flip 16× patch transformation, the following operations take place: first, flip transformations over the horizontal axis, vertical axis and a combination of both are applied to the full patch producing patches 1, 5, 9 and 13. Next, for each of the outputs from that step, the same flip transformations are applied only to the inner region (row of transformations at the bottom of the figure). ...

Results
This section contains information about the experimental conditions, datasets used in the experiments and parameter selection. Lastly, the classification results obtained are presented.

Experimental Conditions
All the experiments were run using the classification scheme and network architecture described in Figure 6. The figure shows, on the left, a patch extracted from the HSI. The black boundaries represent the superpixel edges. The extraction process is explained in Section 2.1. The data augmentation technique of choice was then applied to the patch. The data augmentation techniques based on DWS are explained in Section 2.3. The patches resulting from this data augmentation process were then fed to the CNN. The network consists of two blocks of 2D-convolutional layers coupled with 2D-max-pooling layers, and two dense layers. With the aim of reducing overfitting, two dropout layers were added to the network, both of them using an aggressive dropout ratio of 0.5. Table 1 details the parameters for each layer. ELU activations were used for all layers due to the advantages this function has over others such as the ReLU family. Namely, ELU provides more robust training and faster learning rates [25]. All trainings were run for 112 epochs using a NADAM [26] optimizer with learning_rate = 0.0001, β 1 = 0.9, β 2 = 0.999, = 1 × 10 −7 . Table 1. Neural network architecture.

# Type
Output Shape Activation Filter Size Stride Padding In the experiments, data augmentation was performed online for the first epoch and then cached for the rest of the training. The techniques considered are a set of commonly used data augmentation techniques from the literature and the four proposals introduced as part of this work. In the first group we have: rotate, the standard 90 degree rotation applied four times; flip, the standard flip over both axes of the image; PVS(+/−) or pixel-value shift augmentation, where pixel values of the input image are shifted relative to the average band [12]; and random occlusion, which removes rectangular regions of the patch on up to 50% of the input patches in a batch [13]. The new proposals to be studied include inner-rotate, inner-flip, dual-rotate and dual-flip, as described in Section 2.3. The results are compared in terms of classification accuracy after the application of the different data augmentation techniques.
All the tests were run on a 6-core Intel i5 8400 CPU at 2.80 GHz and 48 GB of RAM, and an NVIDIA GeForce GTX 1060 with 6 GB. All experiments were run under Ubuntu Linux 16.04 64-bits, Docker 19.03.5, Python 3.6.7, Tensorflow 2.0.0 and CUDA toolkit 10.0. All the training instances were performed on Tensorflow with GPU support enabled, using single-precision arithmetic.

Datasets
This section describes the characteristics, including the compositions of the disjoint training and testing sets, of the six scenes used to evaluate the performance of the proposed data augmentation techniques. Three widely available hyperspectral scenes [27] from the literature (standard dataset, from now onwards) and four large multispectral images from river basins belonging to the the Galicia dataset [28] were considered. All the images of the Galicia dataset were captured at an altitude of 120 m by a UAV mounting a MicaSense RedEdge multispectral camera [29]. Its spatial resolution is 0.082 m/pixel and it covers a spectral range from 475 to 840 nm. Figures 7 and 8 display the false color composite and reference data for the scenes of the standard and Galicia datasets, respectively.
More specifically, the Salinas Valley, Pavia University and Pavia Centre scenes from the standard dataset along with the River Oitavén, Creek Ermidas, Eiras Dam and River Mestas scenes from the Galicia dataset were selected for the experiments. The detailed descriptions of the scenes are as follows: 1. Salinas valley (Salinas): Mixed vegetation scene in California. It was obtained by the NASA AVIRIS sensor with a spatial resolution of 3.7 m/pixel, covering a spectral range from 400 to 2500 nm. The image is 512 × 217 pixels and has 220 spectral bands. The reference information contains sixteen classes. The scene is located at 36°39 33.8 N 121°39 58.7 W.    Tables 2 and 3 show the number of samples for each class for all the scenes in all the scenarios considered in this comparison. The scenes from both datasets were used as follows: 60% for training samples, 20% for testing samples and 20% for validation samples for superpixel-based classification. Table 2 also displays the number of samples for pixel-based classification; 20% of the samples were used, again, as the validation set for this scenario. The number of samples was chosen to prevent an excessively high baseline accuracy when no data augmentation technique was used.   Scenes 1 to 3 were segmented using SLIC with a superpixel size of 50 and a compactness parameter of 20, whereas datasets 4, 5, 6 and 7 used a superpixel size of 800 and a compactness parameter of 40. The compactness determines the balance of space and spectral proximity, with higher values favoring space proximity and causing segments to take on a more square shape.
Two augmentation factors of 4× and 16× were considered, and are displayed in the tables next to the name of each data augmentation technique. For every experiment, the following three accuracy measures [30] are reported: overall accuracy (OA), representing the number of overall pixels correctly predicted; average accuracy (AA), the mean of correctly classified pixels per class; and Kappa (κ), which measures the agreement between pixel predictions across all classes, also taking into account the occurrences attributed to chance [31]. The values shown are the results of 20 Monte Carlo runs for each scenario. All the values were obtained under identical experimental conditions.

Superpixel-Based Classification
This section contains the experimental results of the proposal for superpixel-based classification, as described in Section 2. A single scenario training the network with the 60% of the labeled superpixels available has been considered. The classification performance was measured at the pixel level, i.e., considering the same label for all the pixels in the same superpixel.
In order to select values for superpixel size and patch size, some tests were run using one image considered as representative of each of the datasets. PaviaC was selected from the standard dataset and Oitaven was chosen as the counterpart from the Galicia dataset. The results were obtained for two scenarios, one where no augmentation was applied, and a second one where the DWS-based dual-flip 16× technique was used in order to improve accuracy. Values for superpixel sizes each represent an average superpixel area used in the SLIC segmentation. Values for patch sizes each represent the size of a side of the square patch used in the CNN classification process. Inner region sizes are 10 pixels smaller than the corresponding outer regions. The relationship between inner and outer size was chosen as a trade-off between the variability introduced by the transformations applied to the data and the relevance of the inner region. Small changes in inner patch size would not significantly alter the results obtained. Figures 9 and 10 show the overall classification accuracy for the PaviaC and Oitaven images, respectively, as the superpixel area (left) and the patch size (right) vary. In general, we can observe an inverse correlation between superpixel sizes and accuracy of the CNN model. The accuracy without applying data augmentation is correlated with patch size, and bigger patches produce better results for the Oitaven image. The observed effect is smaller in the case of PaviaC. In this last case, the oscillations in the graph are caused by the dispersion of the values across experiments. Finally, results show a very limited increase in accuracy as the size of the patches grows when using the dual-flip augmentation technique. The patch size selected for the experiments was 25 × 25 pixels, with an inner region size of 15 × 15. The complexity of the proposed CNN is low, and as such, increases in patch size have a small impact during training, allowing us to work with larger amounts of data at this stage. In contrast to this, the amount of samples, most especially when augmentation is applied, has a large impact on the speed of the training stage. Due to this, and in order to keep computation times at moderate values, 50 and 800 were selected as the superpixel sizes for the scenes of the standard and Galicia datasets, respectively. Figures 11-14 show the evolution of the training metrics across all the training epochs. It can be seen how the additional data generated by the augmentation methods cause the training process to converge earlier. As a result, significantly steeper curves in both loss and accuracy can be observed. Table 4 shows the classification results for Salinas, PaviaU and PaviaC. These images were used for comparison purposes due to their prevalence in land-cover classification papers. Nevertheless, scenes of a size this small see limited benefits from a superpixel level classification, as they are low-resolution and contain very small and irregular structures. Dual-rotate 16× obtained the highest OA for Salinas, which contains bigger and more regular structures. The best results for PaviaU and PaviaC scenes were obtained by the flip 4× and rotate 4× techniques.   Table 5 shows the classification results for the large multispectral scenes of the Galicia dataset, namely, Oitaven, Ermidas, Eiras and Mestas. It can be seen that the proposed classification scheme obtained high accuracy across all datasets, even when no data augmentation was applied during training. Among the techniques tested, approaches based on the DWS framework introduced in this work achieved the best results: dual-rotate 16× and dual-flip 16× reached 96.20% and 96.19% OA for Oitaven, respectively. The results for Ermidas, Eiras and Mestas share many similarities, with dual-flip 16× and dual-rotate 16× leading in terms of OA. When a lower data augmentation factor is considered (4×), inner-flip 4× and inner-rotate 4× can be seen systematically obtaining higher OA values than other methods. It can be seen that techniques based on flips tend to perform better than those based on rotations when applied under similar constraints. Figure 15 shows the resulting classification maps for the images of the dataset.  Table 3. Table 5. Superpixel-based classification performance (in percent) for Oitaven, Ermidas, Eiras and Mestas scenes using the training and testing sets detailed in Table 3. The best result for each scene is displayed in bold.  In order to summarize the results, Table 6 displays the performance differences between the baseline performance and each of the augmentation techniques. The results for the standard dataset show dual-rotate 16× was the best performing method overall, followed by flip 4×. For the Galicia dataset, dual-flip 16× and dual-rotate 16× obtained the highest increase in OA, with 1.51% and 1.42%, respectively. Table 6. Superpixel-based overall accuracy delta (in percent) per augmentation technique over baseline performance for all the scenes from the standard and Galicia datasets. The best average result for each dataset is displayed in bold. RO denotes random occlusion.

Pixel-Based Classification
This section contains the experimental results of the proposal for pixel-based classification. This scenario was considered in order to show the performances of the proposed augmentation techniques with a traditional pixel-based classification scheme. The scheme applied was the same one described in Section 2, albeit without the superpixel segmentation step. The patches are centered on pixels using a sliding window approach. Tests were run for the Salinas, PaviaU and PaviaC scenes. Table 7 displays the execution times of all the augmentation techniques reviewed in this study for the PaviaC scene. Based on the data obtained, we can observe that the total execution times are significantly higher in the case of pixel-based classification. Training times have a very strong linear correlation with the augmentation factor, and thus, with the number of samples being processed. Prediction times are only dependent on the dimensions of the image, and there is little variation across the different executions. It is worth noting that pixel-based classification is not practical for large images due to the large execution time required. The large number of pixels that need to be predicted, three orders of magnitude higher than the number of training samples, causes prediction times to be significantly higher than training times.

Discussion
The literature on multispectral and hyperspectral image augmentation contains a multitude of different data augmentation techniques, most of which are used in combination with pixel-based classification. There is a notable lack of proposals approaching data augmentation in combination with superpixel-based classification schemes. Examples of the use of geometric transforms in pixel-based classification schemes can be seen in [13,32,33]. Attempts to synthesize new samples drawing data from a multivariate normal distribution initialized with the standard deviations of the bands in an HSI can be found in [34]. Reference [15] approaches the generation of synthetic samples to be later used in the training of deep networks through the use of GANs, as does [14]. These generative models are very costly to train and require fine-tuned adjustments of the hyperparameters on a per scene basis. The data augmentation framework proposed in this work leverages the composition of simple transformations that require no parameter tuning in order to achieve robust increases in classification accuracy when used in a superpixel-based classification scheme. This makes the derived methods a convenient replacement for the traditional rotation and flip (mirroring) operations.
Reference [35] already introduced the idea of the division of an input patch into several regions in order to better exploit the spatial information. The paper focuses on the selection and extraction of a number of predefined regions surrounding a pixel of interest that are later fed to several CNNs. More examples of the use of the spatial correlation between pixels in a scene can be found in [10], where the authors define pixel-pair features. During the training phase, sample pairs are fed to a CNN architecture that, once trained, is using during the testing phase to compute the labels of the surrounding samples of the pixel being tested and obtain the final label performing a majority vote of the outputs. A similar approach is taken in [11] with pixel-block pairs (PBPs), where the author further builds upon that idea by adding explicit spatial information to the new PBP features. These proposals try to achieve a goal that is similar, in essence, to what DWS does by using the central pixel of each superpixel in order to extract a patch. By selecting the central pixel, the probability of obtaining data from a region with high homogeneity is maximized, yielding an increase in classification accuracy.
To the best of our knowledge, no systematic comparison of data augmentation techniques focused on superpixel-based classification has been published as of yet in the field of remote sensing. The existing papers focus on pixel-based classification and the datasets usually comprise small, low-resolution scenes. An additional factor that increases the difficulty when trying to compare the results from different proposals arises from the use of different models or network architectures to obtain them, making the comparison of the quality between different techniques a challenging endeavor. In this work we provide a view of the current data augmentation landscape, by showing a comparison to prove the effectiveness of the proposed DWS approach.
The experimental results for all the datasets show that techniques based on the DWS proposal outperform the other techniques from the literature considered in this comparison for the classification of large, high-resolution images. It is important to note that using DWS, training is performed using a single observation per superpixel, reducing the amount of data that has to be evaluated by over two orders of magnitude compared to pixel-based schemes. In most of the previous tests, augmentation techniques making use of the DWS framework showed less dispersion in the results across runs for all scenes, with noticeably smaller standard deviations than traditional techniques. Based on the evidence observed, dual-rotate 16×, with an increase in OA of 1.51% should be the preferred augmentation technique when processing small, low-resolution scenes, and dual-flip 16×, with an increase in OA of 2.01%, should be the best method for large, high-resolution scenes.
As part of future work, we plan to study the viability of further improving the data augmentation techniques for images containing irregular structures when performing superpixel-based classification. The possibility of adding some parametrization to the augmentation techniques based on the characteristics of the patches extracted from the superpixels is being considered.

Conclusions
In this work, the DWS data augmentation framework for superpixel-based DL classification of large hyperspectral scenes is presented. DWS relies on patch extraction using a superpixel segmentation obtained by the application of the SLIC algorithm in order to reduce the complexity of the classification process. The extracted patches undergo patch subdivision, creating two regions over which transformations are independently applied. Four data augmentation techniques based on the DWS framework using rotate and flip transformations are proposed: inner-rotate, dual-rotate, inner-flip and dual-flip. These techniques can also be used for pixel-based classification with minimal changes.
A comprehensive comparison of the proposals to other data augmentation techniques from the literature was carried out for both superpixel and pixel-based classification scenarios in terms of classification accuracy and execution times. The results obtained show that the proposed DWS approach successfully manages to reduce overfitting and increase the generalization capabilities of the resulting models. Execution times are also reduced when compared to traditional pixel-based classification schemes. Based on the results obtained, the DWS-based dual-rotate 16× is the preferred augmentation technique when processing small, low-resolution scenes, and dual-flip 16× is the best method for large, high-resolution scenes.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: