Self-Attention-Based Conditional Variational Auto-Encoder Generative Adversarial Networks for Hyperspectral Classiﬁcation

: Hyperspectral classiﬁcation is an important technique for remote sensing image analysis. For the current classiﬁcation methods, limited training data affect the classiﬁcation results. Recently, Conditional Variational Autoencoder Generative Adversarial Network (CVAEGAN) has been used to generate virtual samples to augment the training data, which could improve the classiﬁcation performance. To further improve the classiﬁcation performance, based on the CVAEGAN, we propose a Self-Attention-Based Conditional Variational Autoencoder Generative Adversarial Network (SACVAEGAN). Compared with CVAEGAN, we ﬁrst use random latent vectors to obtain more en-hanced virtual samples, which can improve the generalization performance. Then, we introduce the self-attention mechanism into our model to force the training process to pay more attention to global information, which can achieve better classiﬁcation accuracy. Moreover, we explore model stability by incorporating the WGAN-GP loss function into our model to reduce the mode collapse probability. Experiments on three data sets and a comparison of the state-of-art methods show that SACVAEGAN has great advantages in accuracy compared with state-of-the-art HSI classiﬁcation methods.


Introduction
With the development of remote sensing technology, hyperspectral images (HSI) have made great breakthroughs in earth observation. Different from three-channel color images, HSI can simultaneously collect images in hundreds of spectral bands, providing rich spectral information. Therefore, HSI are widely used in many fields [1][2][3][4][5][6], such as satellite remote sensing, agriculture observation, and mineral exploration.
training stage and the samples generated from GAN are far from natural. The CVAEGAN network combines the advantages of the two to improve the effect of generating images.
CVAEGAN [47] is a network structure that combines VAE [49] and GAN [32]. It consists of four parts: (1) an encoder network E, used to learn the relationship between a latent vector space and the real image space; (2) the generator network G, which generates the corresponding virtual sample through the given latent vector; (3) the discriminator network D, used to judge whether a given sample is a real sample or a virtual sample; and (4) the classifier network C, used to classify a given sample.
The encoder network E transfers the data x to a latent vector space with a Gaussian distribution N(0, 1) through the learned distribution. It outputs the mean value µ and covariance of the latent vector corresponding to the data sample x. Then, through z = µ + r * exp( ), r ∈ N(0, 1), the latent vector z can be obtained. At the same time, the distance between p(z) and q(z|x) can be reduced through KL divergence, where p represents the true latent vector distribution and q represents the latent vector distribution predicted by the model. The following is the equation: )) (1) The classifier takes the data x as input and outputs the posterior probability P(c|x). Classifier C minimizes the following loss function as follows: Suppose the generator G generates data samples G(z) corresponding to the latent vector by receiving the latent vector z. The discriminator is a binary classifier to judge the true or false of a given data sample. The loss function of the discriminator is as follows: In the generator, to solve the problem of unstable training and to synthesize more realistic data samples, CVAEGAN [47] adds a feature matching method to the generator. The generator G minimizes the following loss function: where f D and f C represent the middle layer features of the discriminator and classifier, respectively; G(z, c) represents the sample generated by the latent vector generated by the encoder; and G(z r , c) represents the sample generated by the latent vector generated randomly.

Self-Attention
The purpose of the self-attention [46,48] mechanism is to improve the classification performance of the classifier. Most GAN-based methods [32] used self-attention in their encoder or generator to enhance the performance. Figure 1 shows the structure of selfattention. f (x), g(x), and h(x) are all ordinary 1 × 1 convolutions, and the difference lies in the different output channels. The output of f (x) is transposed and multiplied by the output of g(x), and the attention map is obtained after the softmax activation function.
After multiplying the obtained attention map by the output of h(x) pixel by pixel, the final attention feature map can be obtained. The specific process is as follows. Suppose that the image features x ∈ R C×N in the previous hidden layer are transformed into two features spaces f and g to calculate attention. Among them, f (x) = W f x, g(x) = W g x. β j,i represents the degree of participation in the i area when the model synthesizes the image content of the j area, that is, the correlation.
The output of the attention layer is o = (o 1 , o 2 , . . . , o j , . . . , o N ∈ R C×N ), C is the number of channels, and N is the number of feature positions of the previous hidden layer feature.
Finally, the output of the attention layer is multiplied by a scale parameter, and the input feature map is added. Therefore, the final output is as follows: To allow the network to pay attention to the neighborhood information, γ is a scalar that can be learned in the self-attention training process with the initial value of 0, and then, the weight is slowly assigned to other long-distance features.

Methodology
In this section, we introduce the details of the proposed SACVAEGAN method. As shown in Figure 2, the network structure consists of three modules: discriminator, VAE, and classifier. The module of the discriminator is used to determine whether the input samples are real samples or virtual samples. The VAE module is divided into two parts: encoder and generator. The encoder transfers real samples to the latent vector space, and the generator uses the latent vectors generated by the encoder and the random latent vectors to generate virtual hyperspectral samples. The classifier module classifies the input real samples and virtual samples.

Discriminator
As shown in Figure 3, it is the module of the discriminator. It consists of four convolution layers. For each layer, the kernel size of each layer is 3 × 3. The self-attention mentioned in Section 2 is applied after the first and second convolution layers. After the last layer, we reshape the data into a feature vector. According to Reference [50], the label information should also be input to the discriminator to make the model more stable. Therefore, we reshape the label to a vector through a full connection layer and concatenate it with the feature vector. Then, a full connection layer is applied to reduce the dimension. At last, the sigmoid function is used to determine whether the data are real.
The loss function of the discriminator D is as follows: The first term is the loss in WGAN-GP between x and G(z|y), which can make the model more stable, and the second term is the loss of the discriminator, which judges whether G(z random |y) is false.
Among them: where z represents the latent vector generated by the encoder, z random represents the latent vector generated randomly, x real represents real samples, G(z|y) represents the virtual samples generated by the generator according to z and the corresponding label, G(z random |y) represents the virtual sample generated by the generator according to z random and the corresponding label, and y represents the label.

Variational Auto-Encoder
The module of the Variational Auto-Encoder (VAE) is shown in Figure 4. The VAE consists of two parts: the encoder and the generator. The encoder E is used to transfer the real samples to the latent vector space, while the generator G uses the latent vector to generate the virtual hyperspectral sample. We can see from Figure 4 that the encoder E is divided into two spectral-spatial feature extraction networks: one is used to get the mean vector µ, and the other is for the covariance of the latent vector space. For each feature extraction network, the network structure is the same. It consists of spectral and spatial feature extraction networks. For the spectral feature extraction network, it consists of 4 1-D convolution layers with 5 × 1 kernel. Self-attention is also applied to the first and second layers. For the spatial feature network, there are four 2-D convolution layers with 3 × 3 kernel and self-attention in the first and second layers. After the spectral and spatial feature extraction network, we concatenate the spectral-spatial feature together and use a full connection layer for dimension reduction. After we obtain the mean vector µ and the covariance , we use the following equation to obtain the latent vector: The goal of generator G is to learn the distribution of training data and to generate virtual hyperspectral samples. For the generator G, it consists of two full connection layers and four transposed convolution layers. After obtaining the latent vector and the corresponding label, it uses two full connection layers to reshape the vector. Then, the vector is reshaped into a 3-D data cube, which is sent to the transposed convolution layers. The kernel size of the transposed convolution layer is 3 × 3. At last, we can obtain a virtual hyperspectral sample.
The loss function of VAE is as follows: The first term is the KL divergence, which is used to reduce the difference between the distribution of the obtained latent vector and the assumed latent vector distribution. The second term is the l 2 reconstruction loss between x and G(z|y). The third term is the sum of the pair-wise feature matching loss between x and G(z|y) in the discriminator D and the loss of the discriminator in judging whether G(z random |y) is true. The last term is the sum of the pair-wise feature matching loss between x and G(z|y) in the classifier C and the loss of the classification result of the classifier.
Among them: where µ and represent the mean value and covariance generated by the encoder, respectively; f D represents the features of the middle layer in the discriminator D; f C represents the features of the middle layer of the classifier; the real samples is represented as x real ; the virtual sample generated according to the latent vector generated by the encoder is G(z|y); the virtual sample generated according to the randomly generated latent vector is G(z random |y); and y represents the label.

Classifier
The module of classifier C is shown in Figure 5. The purpose of classifier C is to obtain the classification results. For classifier C, it also consists of spectral-spatial feature extraction networks. For the spectral feature extraction network, there are five 1-D convolution layers with the kernel of 1 × 5, and for the spatial feature extraction network, it also consists of five 2-D convolution layers with 3 × 3 kernel. Finally, we concatenate the spectral and spatial features together and then send them to two full connection layers to obtain the final results. The loss function L C is as follows: where the first term is the loss of the result obtained by classifying x. The second term is the sum of the pair-wise feature matching loss between x and G(z|y). The last term is the loss of the result obtained by classifying G(z random |y). λ 1 and λ 2 are the weights of L C xz and L C xx random loss, respectively.
where f C represents the characteristics of the middle layer of the classifier, x real represents the real samples, x z represents the virtual sample generated by inputting the latent vector z generated by the encoder into the generator, x z random represents the virtual sample generated by inputting the randomly generated latent vector z random into the generator, and y represents the label. The procedure of the proposed SACVAEGAN is summarized in Algorithm 1.

Algorithm 1: Optimization procedure of SACVAEGAN.
Input: the training samples

Experiments
To evaluate the accuracy of our proposed SACVAEGAN, we conducted training and evaluation on three data sets, namely the Indian Pines data set, the PaviaU data set, and the Salinas data set. We used three performance metrics: OA, AA, and K. OA is the overall classification accuracy, which represents the ratio between the categories that are properly classified and the total number of categories. AA is the average classification accuracy, which represents the average accuracy between categories. K represents the kappa coefficient for different weights in the confusion matrix.

Indian Pines
The Indian Pines data set was collected by airborne visible infrared imaging spectrometer (AVRIS) sensors in Indian Pines testing in northwest Indiana. The AVIRIS sensor captured images in 0.4 to 2.5 microns. The data consist of 145 × 145 pixels and 224 spectral reflection bands. By deleting the bands covering the water absorption area, the number of spectral was reduced to 200. There are 16 categories in total, and the distribution of samples in each category is extremely unbalanced.

Paviau
The PaviaU data set was acquired by Reflective Optics System Imaging Spectrometer (ROSIS) sensors in Pavia region of northern Italy. The data consist of 610 × 340 pixels and 115 spectral reflection bands. By deleting 12 bands affected by noise, 103 spectral signatures remained. The size of the data is 610 × 340. However, most of them are background pixels, with only 42,776 pixels being foreground pixels in nine categories, including trees, Asphalt, Bricks, Meadows, etc.

Salinas
The Salinas data set was also collected by the AVIRIS sensor in the Salinas Valley, California, with a spatial resolution of 3.7 m. The data consist of 512 × 217 pixels and 224 spectral reflection bands. By deleting the bands covering the water absorption area, the spectrum was reduced to 204. The image size is 512 × 217 with 16 classes.
We divided the labeled samples in the three data sets into two parts: the training set and the testing set. Among them, for the Indian Pines data set, we chose 5% data for training. For the PaviaU data set, we chose 2% points as the training samples. For the Salinas data set, we chose 1% points as the training samples. Tables 1-3 show the number of training samples and testing samples for each category on the Indian Pines data set, PaviaU data set, and Salinas data set, respectively.

Parameter Analysis
Before training the model, some important factors should be analyzed, which may affect the results. Due to the introduction of WGAN-GP, we used RMSprop as the optimizer in the VAE module and the discriminator module. The Adam optimizer was used in the classifier module. We mainly discuss three factors: the different size of patch sizes, the network regularization method, and the parameters of λ 1 and λ 2 .

Analysis of the Size of Patches
To analyze the influence of the size of patches, we calculated the overall accuracy (OA) on three data sets using different sizes of patches. As shown in the Table 4, the size of the patches is 19 × 19, 23 × 23, 27 × 27, 31 × 31. It can be seen from Table 4 for the Indian Pines and the PaviaU data sets. When the patch size was 27, the proposed method obtained the best performance, while for the Salinas data set, when the size was set to 31, it obtained the best results. Therefore, in the experiments, we set the patch size as 27, with better overall accuracy (OA). To analysis the regularization methods, we used different methods to regularize our proposed method, including batch normalization (BN) and dropout. Batch normalization can normalize the input data and can solve the problem of the gradient disappearing and the gradient exploding. It can also accelerate the convergence speed. Dropout enhances the generalization of the model by making the activation value have a certain probability p during forward propagation. As shown in Table 5, when the proposed method uses dropout and batch normalization, it obtains the best performance. Therefore, we applied dropout and Batch Normalization (BN) in our proposed method.

Analysis of the λ 1 and λ 2 in the Classifier Loss Function
In order to analyze the influence of λ 1 and λ 2 in Equation (17), we conducted experiments with different values of λ 1 and λ 2 . We set λ 1 from 0 to 0.5 and λ 2 from 0.5 to 1 with the interval of 0.1. We calculated the performance on three data sets. As can be seen in Tables 6-8, when λ 1 = 0.8 and λ 2 = 0.2, our model obtains the best average classification accuracy. Therefore, we set the value of λ 1 to 0.8, and λ 2 to 0.2 in our experiments.

Classification Results
To illustrate the effectiveness of our proposed method, SACVAEGAN was compared with several different hyperspectral image classification methods, including SVM-Radial Basis Function (RBF), Two-CNN [31], 3D-CNN [22], DCGAN [36], DBGAN [41], and CVAE-GAN [37]. SVM-Radial Basis Function uses a Gaussian kernel function, and the parameter gamma of the kernel function was set to 0.125. The hyperparameters of the deep learning method were all set to the parameters mentioned in the corresponding paper. We used the classifier module for the final classification. Tables 9-11 show the comparison of the methods we proposed on the Indian Pines data set, the PaviaU data set, and the Salinas data set. All of the classification methods were tuned to the best settings. The training set and test set are shown in Tables 1-3.  Tables 9-11 show the results of the proposed methods on the Indian Pines data set, the PaviaU data set, and the Salinas data set, respectively. From the tables, we can see that the performance of the deep learning methods is better than the traditional method. This is because the deep learning method can extract better features. Moreover, for the GANbased method, the performance is better than other deep learning methods because the GAN-based method could generate more training data, which could help the training. From these tables, we can see that our proposed method obtained the best performance. This may be due to the following reasons: First, compared with DCGAN and DBGAN, the VAE module is applied to our proposed method, which makes it higher. Second, compared with CVAEGAN, the conditional GAN is also applied to our proposed method, which could generate more high-quality training data and can enhance the performance. Figures 6-8 show the false color images of the three HSI data sets, the corresponding ground truth maps, and the classification maps of each method. It can be seen that the classification result diagram is consistent with the classification results shown in Tables 9-11. From these figures, we can see that the classification maps of our proposed method are clearer, less noisy, and closer to the ground truth. This indicates that our proposed method has a better classification capability than other methods.

Analysis of the Size of the Training Set
To verify the robustness of SACVAEGAN for different sizes of training data, we selected different proportions of samples for training. For the three data sets, we selected 1 to 5% data for training. We show the results in Figure 9. From Figure 9, we can see that our proposed method obtains a better performance than other methods. This demonstrates that our proposed method is more stable compared with other methods and robust to different sizes of training data.

Investigation on the Run Time
To evaluate the effectiveness of our proposed method, we show the training and testing time of seven different methods in Table 12. The experiments were run with an NVIDIA GTX 1660 GPU and an Inter i5-9300H 2.40-GHz CPU with 16 GB of RAM. It can be seen from Table 12 that the traditional method is much faster than the deep learning method. The deep learning method using the generative adversarial network is slower in training speed than and similar in test speed to the other deep learning methods. The method proposed in this paper requires a relatively long training time on the three data sets. The reason is that our proposed method extracts spectral and spatial features for training to enhance the performance, so more training time is required.

Discussion
To analyze the influence of the self-attention mechanism, WGAN-GP loss, and additional generated samples on the classification accuracy of the proposed method, we conducted several ablation experiments to analyze the spatial features, spectral features, and overall accuracy under different conditions.

Spatial Feature Analysis
To illustrate the advantages of the spatial features of the generated samples, we draw the corresponding virtual samples under different conditions. As shown in Figure 10, it can be seen that the samples generated by our method are closer to the real sample distribution. Thus, the detailed features captured by the virtual sample can improve the classification performance of the model in return. To analyze the difference between the spatial features of the generated virtual sample and the real sample, we calculated the mean square error (MSE) between the generated sample and the real sample. The mean square error can be calculated using Equation (24). As shown in Table 13, the sample generated by our proposed method is closer to the real sample compared with other methods.

Spectral Feature Analysis
To illustrate the advantages of spectral feature, we plot the spectral feature map corresponding to the generated virtual sample under different conditions in Figure 11. It can be seen that the spectral feature distribution of the virtual sample generated by our method is more consistent with the spectral feature distribution of real samples. In order to analyze the difference in spectral features between the virtual sample and the real sample, we calculated the spectral information divergence (SID) between the virtual sample and the real sample. SID is based on the theory of information theory to measure the difference between two spectral calculated by Equation (25). The smaller the SID value, the more similar the spectral. As shown in Table 14, the proposed method obtains the optimal SID on the three data sets.

Overall Accuracy Analysis
Self-attention enables the model to better extract global features. For the WGAN-GP, it can make the model more stable and can improve performance. Additional virtual samples can improve the generalization performance and classification accuracy. To analyze the advantage of each module, we run the experiments with or without each module for 10 times. As shown in Table 15, each module contributes to the improvement of classification accuracy. When the proposed method uses all three strategies, it achieves the best performance. Therefore, we apply all three strategies in our proposed method.
To analyze the impact of WGAN-GP loss on model performance, we calculate the Frechet Inception Distance (FID) of the proposed method with and without WGAN-GP. FID is a criterion to evaluate the performance of GAN. The basic idea is to input the training samples and generated samples into the classifier. Then, it extracts the features of the middle layer of the classifier. Assuming that the data conforms to the multivariate Gaussian distribution, it estimates the mean values of µ train and µ gen , and the variance of σ train and σ gen of the Gaussian distribution of the generated sample and the training sample. Then, it calculates the Freche distance of two Gaussian distributions using Equation (26). The smaller the value of FID, the better the performance of GAN. We run experiments 10 times on each data set and obtain the average values. It can be seen from Table 16 that, when the WGAN-GP loss is added, the FID of our model is lower. This proves that adding WGAN-GP can improve the performance of the SACVAEGAN model to a certain extent. FID = µ train − µ gen + tr(σ train + σ gen − 2(σ train σ gen ) 1 2 )

Conclusions
In this paper, a self-attention-based conditional variational autoencoder generative adversarial network (SACVAEGAN) is proposed for hyperspectral classification. We combine the Conditional GAN with CVAEGAN, which can generate more high-quality training data to enhance the performance. Moreover, the self-attention mechanism is also applied to our proposed SACVAEGAN to extract better features. A novel loss function is used to make the whole training process more stable. Compared with the GAN-based methods, SACVAEGAN achieved a better classification performance than the state-of-theart methods on three commonly used hyperspectral image data sets by incorporating an extra self-attention mechanism and the WGAN-GP loss. In the future, we will explore more GAN-based models for HSI classification.

Conflicts of Interest:
The authors declare no conflicts of interest.