Variational Generative Adversarial Network with Crossed Spatial and Spectral Interactions for Hyperspectral Image Classiﬁcation

: Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) have been widely used in hyperspectral image classiﬁcation (HSIC) tasks. However, the generated HSI virtual samples by VAEs are often ambiguous, and GANs are prone to the mode collapse, which lead the poor generalization abilities ultimately. Moreover, most of these models only consider the extraction of spectral or spatial features. They fail to combine the two branches interactively and ignore the correlation between them. Consequently, the variational generative adversarial network with crossed spatial and spectral interactions (CSSVGAN) was proposed in this paper, which includes a dual-branch variational Encoder to map spectral and spatial information to different latent spaces, a crossed interactive Generator to improve the quality of generated virtual samples, and a Discriminator stuck with a classiﬁer to enhance the classiﬁcation performance. Combining these three subnetworks, the proposed CSSVGAN achieves excellent classiﬁcation by ensuring the diversity and interacting spectral and spatial features in a crossed manner. The superior experimental results on three datasets verify the effectiveness of this method.


Introduction
Hyperspectral images (HSI) contain hundreds of continuous and diverse bands rich in spectral and spatial information, which can distinguish land-cover types more efficiently compared with ordinary remote sensing images [1,2]. In recent years, Hyperspectral images classification (HSIC) has become one of the most important tasks in the field of remote sensing with wide application in scenarios such as urban planning, geological exploration, and agricultural monitoring [3][4][5][6].
Originally, models such as support vector machines (SVM) [7], logistic regression (LR) [8] and and k-nearest neighbors algorithm (KNN) [9], have been widely used in HSI classification tasks for their intuitive outcomes. However, most of them only utilize handcrafted features, which fail to embody the distribution characteristics of different objects. To solve this problem, a series of deep discriminative models, such as convolutional neural networks (CNNs) [10][11][12], recurrent neural network (RNN) [13] and Deep Neural Networks (DNN) [14] have been proposed to optimize the classification results by fully utilizing and abstracting the limited data. Though having gained great progress, these methods only analyze the spectral characteristics through an end-to-end neural network without full consideration of special properties contained in HSI. Therefore, the extraction of high-level and abstract features in HSIC remains a challenging task. Meanwhile, the jointed spectral-spatial features extraction methods [15,16] have aroused wide interest in Geosciences and Remote Sensing community [17]. Du proposed a jointed network to extract spectral and spatial features with dimensionality reduction [18]. Zhao et al. proposed a hybrid spectral CNN (HybridSN) to better extract double-way features [19], which combined spectral-spatial 3D-CNN with spatial 2D-CNN to improve the classification accuracy.
Although the methods above enhance the abilities of spectral and spatial features extraction, they are still based on the discriminative model in essence, which can neither calculate prior probability nor describe the unique features of HSI data. In addition, the access to acquire HSI data is very expensive and scarce, requiring huge human resources to label the samples by field investigation. These characteristics make it impractical to obtain enough markable samples for training. Therefore, the deep generative models have emerged at the call of the time. Variational auto encoder (VAE) [20] and generative adversarial network (GAN) [21] are the representative methods of generative models.
Liu [22] and Su [23] used VAEs to ensure the diversity of the generated data that were sampled from the latent space. However, the generated HSI virtual samples are often ambiguous, which cannot guarantee similarities with the real HSI data. Therefore, GANs have also been applied for HSI generation to improve the quality of generated virtual data. GANs strengthen the ability of discriminators to distinguish the true data sources from the false by introducing "Nash equilibrium" [24][25][26][27][28][29]. For example, Zhan [30] designed a 1-D GAN (HSGAN) to generate the virtual HSI pixels similar to the real ones, thus improving the performance of the classifier. Feng [31] devised two generators to generate 2D-spatial and 1D-spectral information respectively. Zhu [32] exploited 1D-GAN and 3D-GAN architectures to enhance the classification performance. However, GANs are prone to mode collapse, resulting in poor generalization ability of HSI classification.
To overcome the limitations of VAEs and GANs, VAE-GAN jointed framework has been proposed for HSIC. Wang proposed a conditional variational autoencoder with an adversarial training process for HSIC (CVA 2 E) [33]. In this work, GAN was spliced with VAE to realize high-quality restoration of the samples and achieve diversity. Tao et al. [34] proposed the semi-supervised variational generative adversarial networks with a collaborative relationship between the generation network and the classification network to produce meaningful samples that contribute to the final classification. To sum up, in VAE-GAN frameworks, VAE focuses on encoding the latent space, providing creativity of generated samples, while GAN concentrates on replicating the data, contributing to the high quality of virtual samples.
Spectral and spatial are two typical characteristics of HSI, both of which must be taken into account for HSIC. Nevertheless, the distributions of spectral and spatial features are not identical. Therefore, it is difficult to cope with such a complex situation for a single encoder in VAEs. Meanwhile, most of the existing generative methods use spectral and spatial features respectively for HSIC, which affects the generative model to generate realistic virtual samples. In fact, the spectral and spatial features are closely correlated, which cannot be treated separately. Interaction between spectral and spatial information should be established to refine the generated virtual samples for better classification performance.
In this paper, a variational generative adversarial network with crossed spatial and spectral interactions (CSSVGAN) was proposed for HSIC, which consists of a dual-branch variational Encoder, a crossed interactive Generator, and a Discriminator stuck together with a classifier. The dual-branch variational Encoder maps spectral and spatial information to different latent spaces. The crossed interactive Generator reconstructs the spatial and spectral samples from the latent spectral and spatial distribution in a crossed manner. Notably, the intersectional generation process promotes the consistency of learned spatial and spectral features and simulates the highly correlated spatial and spectral characteristics of true HSI. The Discriminator receives the samples from both generator and original training data to distinguish the authenticity of the data. To sum up, the variational Encoder ensures diversity, and the Generator guarantees authenticity. The two components place higher demands on the Discriminator to achieve better classification performance.
Compared with the existing literature, this paper is expected to make the following contributions: • The dual-branch variational Encoder in the jointed VAE-GAN framework is developed to map spectral and spatial information into different latent spaces, provides discriminative spectral and spatial features, and ensures the diversity of generated virtual samples. • The crossed interactive Generator is proposed to improve the quality of generated virtual samples, which exploits the consistency of learned spatial and spectral features to imitate the highly correlated spatial and spectral characteristics of HSI. • The variational generative adversarial network with crossed spatial and spectral interactions is proposed for HSIC, where the diversity and authenticity of generated samples are enhanced simultaneously. • Experimental results on the three public datasets demonstrate that the proposed CSSVGAN achieves better performance compared with other well-known models.
The remainder of this paper is arranged as follows. Section 2 introduces VAEs and GANs. Section 3 provides the details of the CSSVGAN framework and the crossed interactive module. Section 4 evaluates the performance of the proposed CSSVGAN through comparison with other methods. The results of the experiment are discussed in Section 5 and the conclusion is given in Section 6.

Variational Autoencoder
Variational autoencoder is one variant of the standard AE, proposed by Kingma et al. for the first time [35]. The essence of VAE is to construct an exclusive distribution for each sample X and then sample it represented by Z. It brings Kullback-Leibler [36] divergence penalty method into the process of sampling and constrains it. Then the reconstructed data can be translated to generated simulation data through deep training. The above principle gives VAE a significant advantage in processing hyperspectral images with expensive and rare samples. VAE model adopts the posterior distribution method to verify that ρ(Z|X) rather than ρ(Z) obeys the normal distribution. Then it manages to find the mean µ and variance σ of ρ(Z|X k )) corresponding to each X k through the training of neural networks (where X k represents the sample of the original data and ρ(Z|X k ) represents the posterior distribution). Another particularity of VAE is that it makes all ρ(Z|X) align with the standard normal distribution N ∼ (0, 1). Taking account of the complexity of HSI data, VAE has superiority over AE in terms of noise interference [37]. It can prevent the occurrence of zero noise, increase the diversity of samples, and further ensure the generation ability of the model.
A VAE model is consists of two parts: Encoder M and Decoder N. M is an approximator for the probability function m τ (z|x), and N is to generate the posterior's approximate value nθ(x, z). τ and θ are the parameters of the deep neural network, aiming to optimize the following objective functions jointly.
Among them, R is to calculate the reconstruction loss of a given sample x in the VAE model. The framework of VAE is described in Figure 1, where e i represents the sample of standard normal distribution, corresponding with X k one to one.

Generative Adversarial Network
Generative adversarial network is put forward by Goodfellow et al. [24], which trains the generation model with a minimax game based on the game theory. The GAN has gained remarkable results in representing the distribution of latent variables for its special structure, which has attracted more attention from the field of visual image processing. A GAN model includes two subnets: the generator G, denoted as G(z; θ g ) and the discriminator D, denoted as G(x; θ d ), and θ g and θ d are defined as parameters of the deep neural networks. G shows a prominent capacity in learning the mapping of latent variables and synthesizing new similar data from mapping represented by G(z). The function of D is to take the original HSI or the fake image generated by G as input and then distinguish its authenticity. The architecture of GAN is shown in Figure 2. After the game training, G and D would maximize log-likelihood respectively and achieve the best generation effect by competing with each other. The expression of the above process is as follows: where P (x) represents the real data distribution and P g(z) means the samples' distribution generated by G. The game would reach a global equilibrium situation between the two players when P (x) equaling to P g(z) happened. In this case, the best performance of D(x) can be expressed as: However, the over-confidence of D would cause inaccurate results of GAN's identification and make the generated data far away from the original HSI. To tackle the problem, endeavors have been made to improve the accuracy of HSIC by modifying the loss, such as WGAN [38], LSGAN [39], CycleGAN [40] and so on. Salimans [41] raised a deep convolutional generative adversarial network (DCGAN) to enhance the stability of the training and improve the quality of the results. Subsequently, Alec et al. [42] proposed a one-side label smoothing idea named improved DCGAN, which multiplied the positive sample label by alpha and the negative sample label by beta, that is, the coefficients of positive and negative samples in the objective function of D were no longer from 0 to 1, but from α to β. (β in the real application could be set to 0.9). It aimed to solve the problems described as follows: In this instance, GAN can reduce the disadvantage of overconfidence and make the generated samples more authentic.

The Overall Framework of CSSVGAN
The overall framework of CSSVGAN is shown in Figure 3. In the process of data preprocessing, assuming that HSI cuboid X contains n pixels; the spectral band of each pixel is defined as p x ; and X can be expressed as X R n * p x . Then HSI is divided into several patch cubes of the same size. The labeled pixels are marked as X 1 = x 1 i R (s * s * p x * n 1 ) , and the unlabeled pixels are marked as X 2 = x 2 i R (s * s * p x * n 2 ) . Among them, s, n 1 and n 2 stand for the adjacent spatial sizes of HSI cuboids, the number of labeled samples and the number of unlabeled samples respectively, and n equals to n 1 plus n 2 .  It is noteworthy that HSI classification is developed at the pixel level. Therefore, in this paper, the CSSVGAN framework uses a cube composed of patches of size 9 × 9 × p x as the inputs of the Encoder, where p denotes the spectral bands of each pixel. Then a tensor represents the variables and outputs of each layer. Firstly, the spectral latent variable Z 1 and the spatial latent variable Z 2 are obtained by taking the above X 1 as input into the dualbranch variational Encoder. Secondly, these two inputs are taken to the crossed interactive Generator module to obtain the virtual data F 1 and F 2 . Finally, the data are mixed with X 1 into the Discriminator for adversarial training to get the predicted classification resultŝ Y =ŷ i by the classifier.

The Dual-Branch Variational Encoder in CSSVGAN
In the CSSVGAN model mentioned above, the Encoder ( Figure 4) is composed of a dual-branch spatial feature extraction E 1 and a spectral feature extraction E 1 to generate more diverse samples. In the E 1 module, the size of the 3D convolution kernel is (1 × 1 × 2), the stride is (2, 2, 2) and the spectral features are marked as Z 1 . The implementation details are described in Table 1. Identically, in the E 2 module, the 3D convolution kernels, the strides and the spatial features are presented by (5 × 5 × 1), (2, 2, 2) and Z 2 respectively, as described in Table 2

Input Size
Layer Operations Output Size  Meanwhile, to ensure the consistent distribution of samples and original data, KL divergence principle is utilized to constrain Z 1 and Z 2 separately. Assuming that the mean and variance of Zi are expressed as Z meani and Z vari (i = 1, 2), the loss function in the training process is as follows:

Input Size Layer Operations Output Size
where p(z i |x) is the posterior distribution of potential eigenvectors in the Encoder module, and its calculation is based on the Bayesian formula as shown below. But when the dimension of Z is too high, the calculation of P(x) is not feasible. At this time, a known distribution q(z i |x) is required to approximate p(z i |x), which is given by KL divergence. By minimizing KL divergence, the approximate p(z i |x) can be obtained. θ and ϕ represent the parameters of distribution function p and q separately.
Formula (6) in the back is provided with a constant term logN, the entropy of empirical distribution q(x). The advantage of it is that the optimization objective function is more explicit, that is, when

The Crossed Interactive Generator in CSSVGAN
In CSSVGAN, the crossed interactive Generator module plays a role in data restoration of VAE and data expansion of GAN, which includes the spectral Generator G 1 and the spatial Generator G 2 in the crossed manner. G 1 accepts the spatial latent variables Z 2 to generate spectral virtual data F 1 , and G 2 accepts the spectral latent variables Z 1 to generate spatial virtual data F 2 .
As shown in Figure 5, the 3D convolution of spectral Generator G 1 is (1 × 1 × 2) that uses (2, 2, 2) strides to convert the spatial latent variables Z 2 to the generated samples. Similarly, the spatial Generator G 2 with (5 × 5 × 1) convolution uses (2, 2, 2) strides to transform the spectral latent variables Z 1 into generated samples. Therefore, the correlation between spectral and spatial features in HSI can be fully considered to further improve the quality and authenticity of the generated samples. The implementation details of G 1 and G 2 are described in Tables 3 and 4.

Input Size
Layer Operations Output Size  Because the mechanism of GAN is that the Generator and Discriminator are against each other before reaching the Nash equilibrium, the Generator has two target functions, as shown below.
where n is the number of samples, i = 1, 2, y j means the label of virtual samples, andȳ j represents the label of the original data corresponding to y j . The above formula makes the virtual samples generated by crossed interactive Generator as similar as possible to the original data.
Binary Loss is a logarithmic loss function and can be applied to the binary classification task. Where y is the label (either true or false), and p(y) is the probability that N sample points belonging to the real label. Only if y j equals to p(y i ), the total loss would be zero.

The Discriminator Stuck with a Classifier in CSSVGAN
As shown in Figure 6, the Discriminator needs to specifically identify the generated data as false and the real HSI data as true. This process can be regarded as a two-category task using one-sided label smoothing: defining the real HSI data as 0.9 and the false as zero. The loss function of it marked with Binary (Loss D ) is the same as the Formula (10) enumerated above. Moreover, the classifier is stuck as an interface to the output of Discriminator and the classification results are calculated directly through the SoftMax layer, where C represents the total number of labels in training data. As mentioned above, the Encoder ensures diversity and the Generator guarantees authenticity. All these contributions place higher demands on Discriminator to achieve better classification performance. Thus, the CSSVGAN framework yields a better classification result. The implementation details of the Discriminator in CSSVGAN are described in Table 5 with the 3D convolution of (5 × 5 × 2) and strides of (2, 2, 2). Identifying C categories belongs to a multi-classification assignment. The SoftMax method is taken as the standard for HSIC. As shown below, the CSSVGAN method should allocate the sample x of each class c to the most likely one of the C classes to get the predicted classification results. The specific formula is as follows: Then the category of X can be expressed as the formula below: where S, C, X, Y i signify the SoftMax function, the total number of categories, the input of SoftMax, and the probability that the prediction object belongs to class C, respectively. X i similar with X j is a sample of one certain category. Therefore, the following formula can be used for the loss function of objective constraint.
p(y i1 ) · log y i1 + p(y i2 ) · log(y i2 ) + · · · + p(y ic ) · log(y ic ), where n means the total number of samples, C represents the total number of categories, and y denotes the single label (either true or false) with the same description as above.

The Total Loss of CSSVGAN
As illustrated in Figure 3, up till now, the final goal of the total loss of the CSSVGAN model can be divided into four parts: two KL divergence constraint losses and a meansquare error loss from the Encoder, two binary losses from the Generator, one binary loss from the Discriminator and one multi-classification loss from the multi classifier. The ensemble formula can be expressed as: where L 1 and L 2 represent the loss between Z 1 or Z 2 and the standard normal distribution respectively in Section 3.2. MSE Loss 1 and MSE Loss 2 signify the mean square error of y 1 and y 2 in Section 3.3 separately. MSE Loss 1_2 calculates the mean square error between y 1 and y 2 . The purpose of Binary Loss 1 and Binary Loss 2 is to assume that the virtual data F 1 and F 2 (in Section 3.3) are true with a value of one. Binary Loss D denotes that the Discriminator identifies F 1 and F 2 as false data with a value of zero. Finally, the C Loss is the loss of multi classes of the classifier.

Dataset Description
In this paper, three representative hyperspectral datasets recognized by the remote sensing community (i.e., Indian Pines, Pavia University and Salinas) are accepted as benchmark datasets. The details of them are as follows: (1) Indian pines (IP): The first dataset was accepted for HSI classification imaged by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) in Northwestern Indiana in the USA. It includes 16 categories with a spatial resolution of approximately 20 m per pixel. Samples are shown in Figure 7. The spectral of AVIRIS coverage ranges from 0.4 to 2.5 µm and includes 200 bands for continuous imaging of ground objects (20 bands are influenced by noise or steam, so only 200 bands are left for research), bring about the total image size of 145 × 145 × 200. However, since it contains a complex sample distribution, the category samples of training labels were very imbalanced. As some classes have more than 2000 samples while some have less than 30 merely, it is relatively difficult to achieve a high-precision classification of IP HSI.
(2) Pavia University (PU): The second dataset was a part of the hyperspectral image data of the Pavia city in Italy, photographed by the German airborne reflective optics spectral imaging system (Rosis-03) in 2003, containing 9 categories (see Figure 8). The resolution of this spectral imager is 1.3 m, including continuously 115 wavebands in the range of 0.43-0.86 µm. Among these bands, 12 bands were eliminated due to the influence of noise. Therefore, the images with the remaining 103 spectral bands in size 610 × 340 are normally used.
(3) Salinas (SA): The third dataset recorded the image of Salinas Valley in California, USA, which was also captured by AVIRIS. Unlike the IP dataset, it has a spatial resolution of 3.7 m and consists of 224 bands. However, researchers generally utilize the image of 204 bands after excluding 20 bands affected by water reflection. Thus, the size of the Salinas is 512 × 217, and Figure 9 depicts the color composite of the image as well as the ground truth map.

Evaluation Measures
In the experiments, the available data of these datasets were randomly divided into two parts, a small part for training and the rest for testing. Whether the training samples or the testing samples were arranged according to the pixels, whose size was in 1 × p x (p x is selected as 80 in this paper). Each pixel can be treated as a feature of a certain class, corresponding to a unique label and classified by the classifier stuck to the Discriminator. Tables 6-8 list the sample numbers for the training and testing of three datasets.   Taking the phenomenon of "foreign matter of the same spectrum in surface cover" [15,43] into consideration, the average accuracy was reported to evaluate the experiment results quantitatively. Meanwhile, the proposed method was contrasted with the comparative method by three famous indexes, i.e., overall accuracy (OA), average accuracy (AA) and kappa coefficient (KA) [44], which can be denoted as below: where m represents the number of land cover categories and M R (m×n) symbolizes the confusion matrix of the classification results. Then, diag(M) R m×1 comes to be a vector of diagonal elements in M, sum() R 1 proves to be the sum of all elements of matrices, where (, 1) means each column and (, 2) means each row. Finally, the mean() R 1 describes the mean value of all elements along with the ./, which implies the element-wise division.

Experimental Setting
In this section, for the sake of verifying the effectiveness of CSSVGAN, several classical hyperspectral classification methods such as SVM [45], Mulit-3DCNN [46], SS3DCNN [47], SSRN [15] and certain deep generative algorithms like VAE, GAN and some jointed VAE-GAN models like the CVA 2 E [33] and the semisupervised variational generative adversarial networks (SSVGAN) [34] were used for comparison.
To ensure the fairness of the comparative experiments, the best hyperparameter settings were adopted for each method based on their papers. All experiments were executed on the NVIDIA GeForce GTX 2070 SUPER GPU with a memory of 32 GB. Moreover, Adam [48] was used as the optimizer with an initial learning rate of 1 × 10 −3 for Generator and 1 × 10 −4 for Discriminator, and the training epoch was set to 200.

Experiments Results
All experiments in this paper were randomly selected train samples from the labeled pixels, and the accuracies of three datasets were reported to two decimal places in this chapter.

Experiments on the IP Dataset
The experimental test on IP Dataset was performed to evaluate the proposed CSSV-GAN model quantitatively with other methods for HSIC. For the labeled samples, 5% of each class was randomly selected for training. The quantitative evaluation of various methods is shown in Table 9, which describes the classification accuracy of different categories in detail, as well as the indicators including OA, AA and kappa for different methods. The best value is marked in dark gray. First of all, although SVM achieves good exactitude, there is still a certain gap from the exact classification because of the IP dataset containing high texture spatial information, which leads to bad performance. Secondly, some conventional deep learning methods (such as M3DCNN, SS3DCNN) does not perform well in some categories due to the limitation of the number of training samples. Thirdly, the algorithms with jointed spectral-spatial feature extraction (like SSRN, etc.) show a better performance, which indicate a necessity to combine spectral information and spatial information for HSIC. Moreover, it is obvious that the generated virtual samples by VAE tend to be fuzzy and cannot guarantee similarities with the real data. While GAN lacks sampling constraints, leading to the low quality of the generated samples. Contrasted with these two deep generative models, CSSVGAN overcomes their shortcomings. Finally, compared with CVA 2 E and SSVGAN, the two latest jointed models published in IEEE, CSSVGAN uses dual-branch feature extractions and crossed interactive method, which proves that these manners are more suitable for HSIC works. It can increase the diversity of samples and promote the generated data more similar to the original.
Among these comparative methods, CSSVGAN acquires the best accuracy in OA, AA and kappa, which improves by 2.57%, 1.24% and 3.81% respectively, at least. In addition, although all the methods have different degrees of misclassification, CSSVGAN achieves perfect accuracy in "Oats" "Wheat" and so on. The classification visualizations on the Indian Pines of comparative experiments are shown in Figure 10. From Figure 10, it can be seen that CSSVGAN reduces the noisy scattering points and effectively improves the regional uniformity. That is because CSSVGAN can generate more realistic images from diverse samples.

Experiments on the PU Dataset
Differ from the IP dataset experiments, 1% labeled samples were selected for training and the rest for testing. Table 10 shows the quantitative evaluation of each class in comparative experiments. The best accurate value is marked in dark gray to emphasize, and the classification visualizations on the Pavia university are shown in Figure 11.  Table 10 shows that, as a non-deep learning algorithm, SVM has been able to improve the classification result to 86.36%, which is wonderful to some extent. VAE shows good performance in the training of the "Painted metal sheets" class but low accuracy in the "Selfblocking bricks" class, which leads to the "fuzzy" phenomenon of a single VAE network in the training of individual classes. SSRN achieves a completely correct classification in "shadows," but it lost to the CSSVGAN overall. In the index of OA results, CSSVGAN improved 12.75%, 30.68%, 22.52%, 9.83%, 14.03%, 11.53%, 7.14% and 6.18% respectively and in the index of Kappa results, CSSVGAN improved 17.07%, 42.23%, 30.03%, 13.62%, 19.25%, 15.16%, 13.19% and 8.3% respectively compared with the other eight algorithms. In Figure 11, the proposed CSSVGAN has better boundary integrity and better classification accuracy in most of the classes because the Encoder can ensure the diversity of samples, the Generator can promote the authenticity of the generated virtual data, and the Discriminator can adjust the overall framework to obtain the optimal results.

Experiments on the SA Dataset
The experimental setting on the Salinas dataset is the same as PU. Table 11 shows the quantitative evaluation of each class in various methods with dark gray to emphasize the best results. The classification visualization of the comparative experiments on Salinas is shown in Figure 12. Table 11 shows that in the index of OA, AA and Kappa, CSSVGAN improved 0.57%, 1.27% and 0.62% at least compared with others. Moreover, it has a better performance in the "brocoli-green-weeds-1" and "stubble" class with a test accuracy of 100%. For the precisions of other classes, although SSRN, VAE or SSRN prevails, CSSVGAN is almost equal to them. It can be seen that CSSVGAN has smoother edges and the minimum misclassification in Figure 12, which further proves that the proposed CSSVGAN can generate more realistic virtual data according to the diversity of extracted features of samples.

The Ablation Experiment in CSSVGAN
Taking IP, PU and SA datasets as examples, the frameworks of ablation experiments are shown in Figure 13, including NSSNCSG, SSNCSG and SSNCDG.
As shown in Table 12, compared with NSSNCSG, the OA of CSSVGAN on IP, PU and SA datasets increased by 1.02%, 6.90% and 4.63%, respectively.  It shows that the effect of using dual-branch special-spatial feature extraction is better than not using it because the distributions of spectral and spatial features are not identical, and a single Encoder cannot handle this complex situation. Consequently, using the dualbranch variational Encoder can increase the diversity of samples. Under the constraint of KL divergence, the distribution of latent variables is more consistent with the distribution of real data.
Contrasted with SSNCSG, the OA index on IP, PU and SA datasets increase by 0.99%, 1.07% and 0.39% respectively, which means that the result of utilizing the crossed interactive method is more effective, and further influences that the crossed interactive double Generator can fully learn the spectral and spatial information and generate spatial and spectral virtual samples in higher qualities.
Finally, a comparison is made between SSNCDG and CSSVGAN, where the latter can better improve the authenticity of virtual samples by crossed manner. All these contributions of both the Encoder and the Generator put forward higher requirements to the Discriminator, optimizing Discriminator's ability to identify the true or false data and further achieve the final classification results more accurately.

Sensitivity to the Proportion of Training Samples
To verify the effectiveness of the proposed CSSVGAN, three datasets were taken as examples. The percentage of training samples was changed for each class from 1% to 9% at 4% intervals and added 10%. Figures 14-16 shows the OAs of all the comparative algorithms with various percentages of training samples. 40  It can be seen that the CSSVGAN has the optimal effect in each proportion of training samples in three datasets because CSSVGAN can learn the extracted features interactively, ensure diverse samples and improve the quality of generated images.

Investigation of the Proportion of Loss Function
Taking the IP dataset as an example, the proportion σ i (i = 1, 2, . . . 5) of loss functions and other super parameters of each module are adjusted to observe their impact on classification accuracy and the results are recorded in Table 13 (the best results are marked in dark gray). Moreover, the learning rate is also an important factor, which will not be repeated here. It can be obtained by experiments that using 1 × 10 −3 for Generator and 1 × 10 −4 for Discriminator are the best assignments. Analyzing Table 13 reveals that when σ1∼σ5 are set as 0.35, 0.35, 0.1, 0.1 and 0.1 respectively, the CSSVGAN model achieves the best performance. Under this condition, the Encoder can acquire the maximum diversity of samples. The Discriminator is able to realize the most accurate classification, and the Generator is capable of generating the images most like the original data. Moreover, the best parameter combination σ1∼σ5 on the SA dataset is similar to IP, while in the PU dataset, they are set as 0.3, 0.3, 0.1, 0.1 and 0.2.

Conclusions
In this paper, variational generative adversarial network with crossed spatial and spectral interactions (CSSVGAN) is proposed for HSIC. It mainly consists of three modules: a dual-branch variational Encoder, a crossed interactive Generator, and a Discriminator stuck with a classifier. From the experiment results of these three datasets, it showed that CSSVGAN can outperform the other methods in the index of OA, AA and Kappa in its abilities because of the dual-branch and the crossed interactive manners. Moreover, using the dual-branch Encoder can ensure the diversity of generated samples by mapping spectral and spatial information into different latent spaces, and utilizing the crossed interactive Generator can imitate the highly correlated spatial and spectral characteristics of HSI by exploiting the consistency of learned spectral and spatial features. All these contributions made the proposed CSSVGAN give the best performance in three datasets. In the future, we will develop towards to realize lightweight generative models and explore the application of the jointed "Transformer and GAN" model for HSIC.
Author Contributions: Conceptualization, Z.L. and X.Z.; methodology, Z.L., X.Z. and L.W.; software, Z.L., X.Z., L.W. and Z.X.; validation, Z.L., F.G. and X.C.; writing-original draft preparation, L.W. and X.Z.; writing-review and editing, Z.L., Z.X. and F.G.; project administration, Z.L. and L.W.; funding acquisition, Z.L. and L.W. All authors read and agreed to the published version of the manuscript. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
Publicly available datasets were analyzed in this study , which can be found here: http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_ Scenes, latest accessed on 29 July 2021.