Sample Generation with Self-Attention Generative Adversarial Adaptation Network (SaGAAN) for Hyperspectral Image Classiﬁcation

: Hyperspectral image analysis plays an important role in agriculture, mineral industry, and for military purposes. However, it is quite challenging when classifying high-dimensional hyperspectral data with few labeled samples. Currently, generative adversarial networks (GANs) have been widely used for sample generation, but it is difﬁcult to acquire high-quality samples with unwanted noises and uncontrolled divergences. To generate high-quality hyperspectral samples, a self-attention generative adversarial adaptation network (SaGAAN) is proposed in this work. It aims to increase the number and quality of training samples to avoid the impact of over-ﬁtting. Compared to the traditional GANs, the proposed method has two contributions: (1) it includes a domain adaptation term to constrain generated samples to be more realistic to the original ones; and (2) it uses the self-attention mechanism to capture the long-range dependencies across the spectral bands and further improve the quality of generated samples. To demonstrate the effectiveness of the proposed SaGAAN, we tested it on two well-known hyperspectral datasets: Pavia University and Indian Pines. The experiment results illustrate that the proposed method can greatly improve the classiﬁcation accuracy, even with a small number of initial labeled samples.


Introduction
With the fast development of remote sensing technology, hyperspectral sensors are now able to capture high spatial resolution images with hundreds of spectral bands, such as those on the recently launched satellites Zhuhai and Gaofen-5. With narrow and contiguous spectral bands, it is now possible to identify land cover targets at high accuracy. Therefore, hyperspectral images have been widely used in crop monitoring, mineral exploration, and urban planning. To achieve such applications, the primary task for hyperspectral data application is image classification. Due to the high-dimensionality of hyperspectral data, it is difficult to find representative features to discriminate between different classes. To explore robust features in the spectral domain, the principal component analysis (PCA), locally linear embedding, and neighborhood-preserving embedding have been widely used for efficient unsupervised feature extraction [1,2]. Meanwhile, the supervised dimension collapse due to the contradictory nature of two-player games. To improve the stability of GANs, a triple GAN was proposed to achieve better performances in discriminative ability [17]. However, to generate additional labeled spectral profiles, the current GANs are sensitive to noises and neglect the relationships between spectral bands. Besides, the generated samples are often alienated from the original ones which, inevitably fail in boosting classification results.
To solve the above problems, in this paper, we propose the self-attention generative adversarial adaptation network (SaGAAN) to generate high quality labeled samples in the spectral domain for hyperspectral image classification. In general, two modifications have been made in this framework: the self-attention mechanism is included to formulate long-range dependencies [18] and reduce unintentional noises to stabilize GAN models and the cross-domain loss term is added to increase the similarity between generated samples and the original ones. Therefore, the SaGAAN is able to generate high-quality realistic samples by considering band dependencies and cross-domain loss. Based on the generated samples, better classification results can be achieved.
The rest of this paper is constructed as follows. Section 2 describes the background of relevant studies. Section 3 gives the detailed information about the proposed SaGAAN. Section 4 details the experimental results and comparisons with other methods. Finally, the conclusion is given in Section 5.

Generative Adversarial Networks (GAN)
Different from the discriminative models, GAN is one of the representative models in the field of generative modeling. Instead of exploring discriminative features, the generative model aims to estimate the distribution from unknown data p data . In the scope of discriminative modeling, GANs use the framework of the deep neural network to formulate data distribution. Traditionally, a GAN consists of two adversarial players: a generator and a discriminator. The generator aims to generate realistic samples to fool the discriminator while the discriminator also constantly upgrades itself to make a better identification of fake or real samples.
Mathematically, the generator can be represented by G with the parameters θ G . Similarly, the discriminator is denoted as D with the parameters of θ D . The standard loss function of GAN models is min where x is the the sample data from unknown distribution p data and z is the noise space to initialize the generator. GAN has achieved great success in image generation, information restore and data fusion [19][20][21]. Recently, some improved GANs can perform image classification by adding class-specific terms to the discriminator (e.g., [7,14,15]). However, the power of the generator and additional samples derived from the generator remain unexplored. Therefore, it is necessary to analyze the quality of generated samples and its improvements in hyperspectral image classification.

Domain Adaptation
Although GAN has the ability to generate realistic samples to enrich the training dataset. It is difficult to stabilize the GAN during its training process, especially for high-dimension data generation. Moreover, the generated samples are often alienated from the original ones (e.g., due to spectral shifts), which fail in high-quality sample generation. To improve the ability of sample generation, the domain adaptation term is proven to be useful. In general, there are several categories in domain adaptation methods, e.g. representation matching, transferable feature selection, and selective sampling [22].
To compensate for the effects of data shifts, domain adaptation aims to make samples transferable across different datasets. Suppose there are two domains called source domain and target domain, which are two data acquisitions of different times or regions. To formulate the classification problem, the joint probability distribution of class labels and its observations from source and target domain are J s (X s , L) and J t (X t , L), respectively. X is the input data (e.g., spectral bands) and L is the output label. Domain adaptation methods can transfer the classifier trained on the source domain to predict class labels in the target domain. In this scope, multidimensional histogram matching [23], principle component analysis (PCA) based data alignment, and kernel PCA (KPCA) have been used for domain adaptation [24]. Similarly, the maximum mean discrepancy (MMD) has been used to minimize the sample distances between the source and target domain [25]. Meanwhile, the semisupervised domain adaptation methods have also been intensively studied. For instance, the maximum-likelihood (ML) classifier has been extended with Bayesian rules for the problem of domain adaptation. Based on the assumption of Gaussian distribution, cross-domain information can be effectively captured. To explore domain invariant features, deep learning algorithms such as CNN and GANs also could be used to reduce domain shifts [26,27]. Therefore, domain adaptation is one of the most effective strategies to reduce the discrepancy between two separate domains.

Attention Models
The generative models can directly estimate the data distribution from real samples. Compared to nature image generation, hyperspectral samples have an abundant number of spectral bands, which makes it hard for GANs to capture the dependencies between spectral bands. The attention mechanism has the ability to capture the global contextual information and model long-range dependencies. For instance, self-attention [28] calculates the response for a specific position inside a sequence by attending to all positions within the same sequence. Self-attention has proven to be useful in terms of machine translation models [29]. In addition, the combination of self-attention and deep learning algorithms can significantly improve the precision of the image classification, image generation, and spatial-temporal pattern recognition [30,31]. To formulate the conventional self-attention models, we have where α i,j indicates the response when attending to the location i, j over the entire sequence. The output of self-attention layer is where Q = W f x, K = W g x and V = W v x. W ∈ RĈ ×C are the convolution weights with the kernel sizes of 1 × 1. The final output of the attention layer has the form of g i = γAttention + x i .

Self-Attention Generative Adversarial Adaptation Network
To improve the stability of the traditional GANs and increase the quality of generated samples, we propose the Self-attention Generative Adversarial Adaptation Network (SaGAAN) for hyperspectral sample generation and classification, as shown in Fig. 1. SaGAAN considers both self-attention and domain adaptation to improve the quality of generated samples. Specifically, it is difficult to stabilize the traditional GANs during the training process. Furthermore, the generated samples are often alienated from the original ones. Therefore, to ensure that the generated samples are similar to the input original ones, we introduce the domain adaptation technique to constrain the similarity between generated and original samples. Suppose the generated samples are G(z) and the original ones are O. To construct the domain adaptation term, for a N-layer discriminator D, we have where D n (O) represents the deep features from the discriminator by middle layer activation. To better measure the divergence between generated samples and the reference ones, the maximum mean discrepancy (MMD) loss function applied in this study measures the distances between two probability distributions. The MMD attains its minimum zero if the original data and generated samples are equal.
Suppose the original hyperspectral profiles are o ∈ O with the data distribution P O to be learned. For SaGAAN, the generator G learns to map a variant z from latent space to the original data space G(z) ∈ X with the distribution of P G and conditional label y ∈ Y. Then, the discriminator evaluates the sample whether from the original distribution or generated ones. Different from the minimax loss or hinge loss, the MMD loss uses kernel k to map the discrepancy between two samples. Given two distributions P G and P O , the square MMD distance have the following formulation where g, g are two samples from the generator and o, o are two samples from the original dataset. The kernel k(o, g) measures the similarity between two samples. When the generated samples have a distribution that is equal to the original one P O , D 2 k (P G , P O ) is zero. Instead of using MMD as the loss function for adversarial network optimization, SaGAAN calculates the MMD term for domain adaption. Thus, the discriminator D has the ability to measure the discrepancies between two samples. The objective function for discriminator can be formulated as To maximize the loss function, the discriminator aims to reduces E o,g (k D (o, g)) that forces generated samples away from the original ones. Meanwhile, the discriminator minimizes the intra-class variance by implementing E g,g and E o,o (k D (o, o )). Similarly, the loss function for the generator is The discrepancy between generated samples and the original ones can be reduced by implementing the MMD-based domain adaptation term. However, the generator usually introduces noises from latent distribution and neglected long range dependencies. Therefore, it is still important to consider the band dependencies for hyperspectral sample generation. For SaGAAN, the self-attention mechanism is integrated to improve the quality of generated samples.
In general, the SaGAAN has two improvements compared to the traditional GAN model: the MMD-based domain adaptation for the discriminator and the self-attention mechanism for long-range dependency improvements. The final loss function can be formulated as For SaGAAN, the conditional adversarial network has been adopted for class-specific hyperspectral data generation. Once the loss function is optimized, SaGAAN can produce high-quality class-specific hyperspectral samples. Different from the traditional GAN, SaGAAN can effectively capture the band dependencies over the spectral domain and reduce noises. Moreover, the generated samples are closer to the original ones with the help of domain adaptation and MMD penalization. With the help of generated samples, it is now able to perform hyperspectral imagery classification without much additional training samples.

Hyperspectral Datasets
To demonstrate the ability of the proposed SaGAAN, two well-known hyperspectral datasets were included. These two datasets were collected by the Reflective Optics System Imaging Spectrometer (ROSIS) and the AVIRIS sensor, respectively. Due to the high dimensionality of the above datasets and lack of training samples, it is difficult to interpret them efficiently. The detailed description of these two datasets are as follow.

Pavia University Dataset
The Pavia University dataset was acquired by the ROSIS sensor during a flight campaign over Pavia, northern Italy. The sizes of this dataset are 610 × 340 pixels, with the ground spatial resolution of 1.3 m. There are 103 spectral bands available after removing 12 noisy bands. The spectral bands range from 430 to 860 nm. Nine types of land cover targets were labeled for identification and 10% labeled samples were used for training and another 10% for testing. The pseudo-color composite image and the reference map are shown in Figure 5.

Indian Pines Dataset
The Indian Pines dataset was acquired by the AVIRIS sensor over the Indian Pines test site in northwestern Indiana. The size is the images in this dataset is 145 × 145 pixels, with high dimensionality in the spectral domain. The sensor system used in this case measured pixel response in 224 bands in the 400-2500 nm region of the visible and infrared spectrum. Due to atmospheric absorption, after removing noisy bands, 200 spectral bands are left for data analysis; 10% labeled samples were used for training and another 10% for testing. The pseudo-color composite image and the reference map are shown in Figure 6.

Configuration of Sagaan
To serve the purpose of hyperspectral sample generation, the SaGAAN framework is developed based on a 1D generator and a discriminator. To capture the data distribution over spectral bands, SaGAAN converts noises from latent space to realistic spectral profiles. The configuration of SaGAAN is illustrated in Table 1. Compared to the traditional GANs, SaGAAN uniquely pays attention to long-range dependencies and domain adaptation for high-quality sample generation. For the attention term, it is integrated inside of the generator to reduce noises and consider long-range dependencies during sample generation. Meanwhile, to ensure the generated samples are equilibrium to the original ones, the domain adaptation term is added inside of the discriminator. Due to the nature of deconvolution operation, we added an additional band to make sure the number of spectral bands is an odd number. Moreover, to better illustrate the effectiveness of SaGAAN, we included other sample generation approaches (the traditional GAN, self-attention GAN (SAGAN), and adaptation GAN (ADGAN)) for comparison.

Effect of Domain Adaptation
Domain adaptation is one of the most important factors for high-quality sample generation. Due to the difficulty of adversarial network training, the generated samples are often alienated from the original ones. Therefore, how to reduce the discrepancies between generated samples and the original ones is the major challenge for successful adversarial network training. In SaGAAN, the discriminator contains an additional term to measure the feature distances between generated samples and the original spectral profiles. Specifically, the discriminator D as a 1D convolutional neural network (CNN) has L hidden layers. For each layer, the deep feature can be represented as D l (x), l ∈ L, and the feature distance between generate samples and the original ones are D l (o i ) − D l (g i ). Thus, SaGAAN is able to produce realistic samples based on the similarity measurement. To better illustrate the effect of domain adaptation term for SaGAAN, we developed two separate adversarial networks for hyperspectral sample generation with/without domain adaptation.
For convenience, we tested the domain adaptation term on the Pavia dataset by using the training dataset. Each network generated 0.2 million hyperspectral samples in total, which was about 22 thousand samples for each class. We mapped the generated samples into the lower dimension for better illustration, as shown in Figure 2. We can conclude that the projection map from domain adaptation samples has clear boundaries between different classes. Without domain adaptation, the generated sample often mixed together, which failed to guide supervised classification. Especially, for Classes 8 (Bitumen) and 3 (Self-Blocking Bricks), the inter-class similarity has been significantly reduced. Meanwhile, intra-class variation such as for Classes 1 (Asphalt) and 7 (Bitumen) also has been greatly suppressed. Therefore, domain adaptation is a major improvement in generative adversarial networks since it considers the mismatches between generated samples and the original ones.

Effect of Self-Attention
Hyperspectral data contain hundreds of spectral bands that have long-range dependencies (e.g., vegetation has high reflectances in near infrared bands compared to the red band). However, the traditional GANs only pay attention to mimic spectral reflectances at local scales which neglected the relationships across spectral bands. Moreover, traditional GANs introduce noises that also impact high-quality sample generation. Different from domain adaptation, the self-attention mechanism focuses on capturing long-range dependencies between spectral bands. Meanwhile, the self-attention reduces unwanted noises and makes the curves of generated hyperspectral samples smoother.
To demonstrate the effectiveness of the self-attention mechanism, we compared the generated samples by using SaGAAN with/without self-attention constraint. To better understand the impact of self-attention, we chose Classes 5 and 6 in Pavia dataset for hyperspectral data generation. The generated samples are shown in Figure 3. The first two rows represent the spectral curves of painted metal sheets and the last two rows are bare soil reflectances. For the first two rows, we can conclude that much noise has been introduced, which resulted in spikes across the spectral bands, especially for the middle column. In addition, for the bare soil, the generated curves suffer from random noises when not using the self-attention constraint. However, the bare soil spectral profiles become much more similar to the original ones after adding the self-attention term. Moreover, long-range dependencies for spectral curves such as low points and high points have been well represented by the self-attention mechanism. Figure 3. The spectral curves generated by SaGAAN with/without self-attention mechanism: (a) the Class 5 (Painted metal sheets) spectral curves generated without self-attention; (b) the Class 5 spectral curves generated with self-attention; (c) the Class 6 (Bare Soil) spectral curves generated without self-attention; and (d) the Class 6 spectral curves generated with self-attention.

Generated Sample Analysis
From the above, we can conclude that both domain adaptation and self-attention are crucial parts of high-quality spectral profile generation. In SaGAAN, we utilize MMD measurement to minimize the distances between the generated samples and the original ones. To calculate the MMD distance, the activation of hidden layers inside discriminator convert the generated samples and the original ones into deep feature representations. Then, the similarity of those features can be measured by implementing the MMD strategy. Meanwhile, the self-attention mechanism also enforces the generated samples to be aware of long-range dependencies across different spectral bands. In general, the domain adaptation and self-attention will strongly stabilize SaGAAN during the training process and prevent potential gradient explosion. To illustrate the effectiveness of combining domain adaptation and self-attention, the loss function values for both generator and discriminator are shown in Figure 4. In this figure, the loss function values for generator and discriminator jitter at the beginning for SaGAAN without domain adaptation and self-attention. Moreover, the loss values can reach almost 6 and then raise again at Iteration 420 where the generator is not stable during the training process. When the domain adaptation and self-attention are involved, the loss values become much more stable through the entire training stage. To measure the quality of generated samples, we mapped all available training samples in Pavia dataset to the two-dimension space, as shown in Figure 4c. The number of training samples is not evenly distributed, where Class 2 (Meadows) represents almost half of the total samples. In addition, samples are scattered in the feature space without significant class boundaries. Complementary, SaGAAN generated high-quality samples based on a small fraction (only 10%) of all available ones. In this experiment, SaGAAN generates 0.2 million samples and each class is equally distributed with 22 thousand samples, as shown in Figure 4d. With the help of domain adaptation and self-attention, SaGAAN generated high-quality samples that contain rich intra-class variation and clear boundaries between different classes. Based on high quality generated samples, better classification results can be achieved.

Hyperspectral Image Classification and Comparison
To demonstrate the effectiveness of the generated samples, we combined generated samples with the original dataset for the purpose of hyperspectral image classification. Specifically, for each dataset, we selected a specific number of generated samples that have the same sizes as the original training samples. For the purpose of image classification, the 1D CNN framework was applied for hyperspectral image classification. The configuration of 1D CNN is the same as the first five layers of the discriminator illustrated in Table 1. Finally, we tested the classification performances with or without using the additional generated samples.

Pavia University Dataset
In the experiment, we compared the SaGAAN-based hyperspectral image classification method with the three other image classification strategies. Specifically, the original training sample was directly fed into the 1D CNN framework for training and classification. Then, the domain adaptation-based sample generation strategy was applied to generate additional samples. Furthermore, the generated samples along with the original ones were fed into 1D CNN for training and classification. Meanwhile, the self-attention based sample generation also was applied for sample generation and 1D CNN training. During the entire experiment, each method generated 4273 additional samples for the Pavia dataset, which is as same as the original training dataset.
The classification accuracies are reported in Table 2. For CNN with the original training dataset, the classification accuracy can reach 91.48%. However, due to the training sample shortage, the accuracy is relatively low for Class 7, where it is around 78%. With the help of sample generation strategy, the classification accuracies get higher with additional training samples. For the Ada-CNN, the domain adaptation has been adopted in the traditional GAN framework, and the generated samples along with the original ones were fed into CNN for classification. Therefore, the classification accuracy has increased to 92.08% with domain adaptation samples. Then, the self-attention based samples also increased the overall accuracy about 0.4%. Lastly, the SaGAAN generated high-quality samples were utilized to increase the classification accuracy. The classification maps of these four strategies are shown in Figure 5.

Indian Pines Dataset
For the Indian Pines dataset, we tested the SaGAAN with the three other classification strategies. The classification accuracies are illustrated in Table 3. From the results in the table, we concluded that the classification accuracy is quite low when performing the traditional CNN with a limited number of training samples. For Classes 1 and 4, the classification accuracies are 60% and 55.64%, respectively. The overall accuracy is 77.44% when only using the original samples. With the domain adaptation-based sample generation, the overall accuracy has increased to 80.58%, but still faces challenges in Classes 1 and 9 where the number of samples is relatively low. The self-attention mechanism has greatly improved the quality of generated samples; the overall accuracy is about 80.97%. However, the classification accuracies of each class are not in balance. SaGAAN considers both domain adaptation and self-attention mechanisms have significantly improved the quality of generated samples, and it improved the overall classification accuracy to 81.14%. In addition, detailed information about classification maps is shown in Figure 6.

Conclusions
In this paper, to generate high-quality hyperspectral samples, we propose a self-attention generative adversarial adaptation network (SaGAAN) to generate realistic samples and improve the classification results of hyperspectral images. Specifically, we include the domain adaptation to increase the similarity between generated samples and the original ones. Meanwhile, to capture the long-range dependencies and reduce unwanted noises, the self-attention mechanism is also integrated with SaGAAN. The experimental results demonstrate that the SaGAAN has the ability to generate high-quality hyperspectral samples and boost the classification accuracy. In the future, we still need to focus on the spatial feature generation, which is also important for hyperspectral image classification.