Adaptive Weighting Feature Fusion Approach Based on Generative Adversarial Network for Hyperspectral Image Classification

Recently, generative adversarial network (GAN)-based methods for hyperspectral image (HSI) classification have attracted research attention due to their ability to alleviate the challenges brought by having limited labeled samples. However, several studies have demonstrated that existing GAN-based HSI classification methods are limited in redundant spectral knowledge and cannot extract discriminative characteristics, thus affecting classification performance. In addition, GAN-based methods always suffer from the model collapse, which seriously hinders their development. In this study, we proposed a semi-supervised adaptive weighting feature fusion generative adversarial network (AWF2-GAN) to alleviate these problems. We introduced unlabeled data to address the issue of having a small number of samples. First, to build valid spectral–spatial feature engineering, the discriminator learns both the dense global spectrum and neighboring separable spatial context via well-designed extractors. Second, a lightweight adaptive feature weighting component is proposed for feature fusion; it considers four predictive fusion options, that is, adding or concatenating feature maps with similar or adaptive weights. Finally, for the mode collapse, the proposed AWF2-GAN combines supervised central loss and unsupervised mean minimization loss for optimization. Quantitative results on two HSI datasets show that our AWF2-GAN achieves superior performance over state-of-the-art GAN-based methods.


Introduction
Due to hyperspectral images (HSIs) contain hundreds of narrow and consecutive spectral bands, which enrich surface semantic information of remote sensing images [1][2][3], accurate interpretation of HSI ground materials has received significant attention from the machine learning and remote sensing communities [4]. With their abundant spectral signatures and high-resolution spatial context, hyperspectral image data provide powerful technical support for the applications of urban road monitoring [5], crop pest control [6], and environmental protection [7]. Classification is the cornerstone of these HSI applications.
Hyperspectral image classification aims at assigning a unique identifiable target to each pixel. Many recent studies have demonstrated that supervised deep learning methods can alleviate the challenges of high-dimensional non-linear characteristics in HSIs, and promote classification performance [8][9][10][11]. However, suffering from "the curse of the dimension," if less labeled data are available, the Hughes phenomenon [12] will be observed with the increase in the number of trainable parameters. At present, three challenges still prevent deep learning methods from supplying precise and effective pixel-wise HSI classification maps. First, the redundant spectral characteristics of HSI pixels make conventional optical imagery analysis unusable if we want to extract discriminative features for hyperspectral interpretation. Second, traditional deep learning methods employ a set of spectral-spatial filter banks to represent advanced features of HSIs. However, how to effectively encode spatial and spectral information is a challenging task. Third, the generalization of deep learning models is constrained by their shortage of labeled pixels, which gives rise to insufficient classification accuracy. In this study, we analyze these challenges and offer promising suggestions to mitigate them.
The first challenge, i.e., the redundant hyperspectral characteristic of HSI pixels, originates from abundant spectral signatures and the high-similarity neighboring spatial context. Traditional models aim at employing feature engineering to explain semantic features of spectral bands. For example, Hu et al. [13] directly use the convolutional neural network (CNN) to learn the spectral domain; Chen and Zhao et al. [14,15] adopt dimensionality reduction methods to extract the principal components for training deep CNNs. Nevertheless, these methods do not possess the potential feature representation ability of deep learning frameworks. Other scholars have introduced local neighboring spatial information, and achieved promising results. Mei et al. [16] incorporate the spatial context by band-wise means and standard deviations in a neighboring HSI cuboid. Li et al. [17] utilize pairs of neighboring pixels to extract spatial semantic features, and predict the final land-cover category with a majority voting strategy. Such research highlights the universality of HSI interpretation in the deep learning community. However, the aforementioned approaches focus only on statistics of spatial features, and overlook spectral purity characteristics.
The second challenge is derived from the complexity of HSI distribution, and can be interpreted as a "matter of the same spectrum in surface cover" and the "the same objects but different spectrum in surface cover" [18]. Multiple works adopt neighboring spatial prior knowledge to improve CNNs and enrich spectral semantic information. For example, Li et al. [19] applied three-dimensional (3D) CNNs for accurate HSI classification. The approach in [20] produced a rough segmentation map by extracting CNN-based spectral-spatial features from a 3D hyperspectral cuboid. Other recent studies have shown that there are two main ways to learn a spectral-spatial representation for HSI classification. The first way is to build an explicit engineering framework of the two constituent features. For instance, Liang et al. [21] proposed a superpixel based sparse auto-encoder to extract manifold feature for classification, while Liu et al. [22] designed a superpixel-guided layerwise embedding CNN for remote sensing image classification. In the second paradigm, spectral-spatial features were implicitly learned from 3D homogeneous areas with annotations for the target pixels [23][24][25][26]. Zhong et al. [23] learned a set of spectral-spatial residual filter banks to extract continuous HSI features. Furthermore, Wang et al. [24] attempted to interpret HSI paths through dense convolutional filter banks, and obtained promising results. Zhu et al. [25] proposed an improved deep convolutional capsule neural network (Conv-CapsNet), which considered the pixel position attributes of HSIs. In addition, Cui et al. [26] utilized a multiscale pyramid to capture different spatial contexts of neighboring information for HSI classification. All the above methods improved the representation capacity of HSIs in learning spectral-spatial combinations. However, the joint spectral-spatial filter banks that introduced spatial context limit the contribution of each individual component to HSIs. Existing studies show that weighting fusion may be able to address this: Zhang et al. [27] applied a predictive weighting decision to highlight the validity of each component.
For the third challenge, obtaining a sufficient amount of labeled HSI data usually involves sustaining a higher cost, and can result in many technical issues. Deep learning models have become irreplaceable in mainstream HSI classification methods because of their representation ability. Several scholars suggest that these models require a large amount of data for generalization. For example, Chen and Li et al. [14,17] extended the training data by adding noise and pixel pairs. It is worth mentioning that in contrast with traditional visual images [28], which contain hundreds of categories, HSIs actually include much fewer land-cover categories that need to classify. Furthermore, several scenes of HSIs are extremely lacking in training samples. Therefore, the theory that deep learning models require massive data for training is not suitable for HSIs. The required amount of unlabeled data for HSIs remains a topic of discussion. Several studies aimed at semi-supervised learning employed a small number of labeled HSI samples and a large number of unlabeled HSI samples for training. Mnih et al. [29] transmitted labels from annotated HSI pixels to non-annotated ones via multilayer neural networks. Fang et al. [30] adopted a semisupervised double branch convolutional network with a resampling strategy for training sufficiently. Furthermore, Hu et al. [31] applied shape adaptive neighboring information to select valuable unlabeled data. Although such studies have obtained accurate classification results, these results may have been obtained from homogeneous areas with features with high spatial relatedness, rather than being achieved by the deep learning model.
To address these difficulties, two common semi-supervised frameworks-generative models and graph-based models-have been applied [32,33]. For instance, Wan et al. [32] constructed a graph neural network (GNN) with multiscale superpixels to reduce the calculation complexity accompanying HSIs. However, the local spatial regions built with superpixel methods cannot reconstruct the pixel-wise class boundaries. In addition, the utilization of unlabeled data only exists in homogeneous areas. If the patched neighboring HSI cuboid is small enough, the number of valuable unlabeled pixels is limited. Recently, generative adversarial networks (GANs) for HSI classification have attracted extensive attention in regards to small samples [33][34][35][36][37][38][39]. Specifically, Zhu et al. [36] built GANs with CNNs for HSI classification. Zhong et al. [37] combined graph models with a semi-supervised GAN to alleviate the challenge of limited labeled data, and refined HSI boundaries. However, this method applied two models for the solution, which is not a good end-to-end framework. Moreover, Wang et al. [40] proposed an adaptive dropblock and applied it to GANs to extract effective HSI pixels. Nevertheless, the model only took the first three principal component analysis channels of HSIs, and could not maximum spectral advantage.
Several reports have pointed out that high-quality generated samples are key to promoting the discriminator for HSI classification. However, the high-dimensional HSI spectral bands present a variety of non-linear characters in a highly spatial distribution, and thus it is difficult to reconstruct real hyperspectral cuboids through spectral-spatial engineering. Radford et al. [41] suggested to applying the transposed convolutions and convolutions without pooling layers instead of fully connected layers to construct generators and discriminators in GANs. Most GAN-based methods follow this principle, such as HS-GAN [35], and MS-GAN [42]. Although transpose convolutions generate shift-invariant local spectral-spatial information, the parameter sharing makes it unable to highlight the sensitive characteristics during the training process. Furthermore, regularization methods limit the two land-cover categories with similar spectral distributions to be regarded as a single category. Thus, it is necessary to select appropriate spectral and spatial features during the sample generation.
Inspired by [27] and [40], in this study, we build a semi-supervised GAN via an adaptive weighting feature fusion approach (AWF 2 -GAN) for HSI classification. Considering the limitation of labeled data, the discriminator is extended to double-branch networks to extract global spectral signatures and local spatial contexts. The generator contains fully connected layers for training so as to match the real HSI pixels. Moreover, we proposed an adaptive weighting feature fusion strategy for taking constraint conditions of spectral-spatial combinations in the discriminator into account; this is similar to employing an attention mechanism. By giving different fusion weights to each pixel, the fused spectral-spatial feature can be better expressed. Our fusion module can be configured through employing one of four options: by adding or concatenating spectral and spatial feature maps with similar or adaptive weights. Finally, we take the center loss and mean minimization loss into account for stabilizing GAN training. The main contributions of the paper are summarized as follows.
(1) A novel GAN-based framework for HSI classification named AWF 2 -GAN is proposed; it considers an adaptive spectral-spatial combination pattern in the discriminator, and improves the efficiency of discriminative spectral-spatial feature extraction. (2) To explore the interdependence of spectral bands and neighboring pixels, the adaptive weighting feature fusion module provides four sets of weighting filter banks to improve performance. (3) To alleviate the mode collapse of AWF 2 -GANs, we jointly optimize the framework by considering both center loss and mean minimization loss.
The remainder of this paper is organized as follows. Section 2 introduces the GANs and the center loss. Section 3 presents the details of the proposed AWF 2 -GAN classification framework and adaptive weighting feature fusion module. Section 4 evaluates the performance of our method compared with those of other HSI methods. Section 5 provides a discussion of the results, and a conclusion is presented in Section 6.

Generative Adversarial Networks
Recently, GANs have attracted significant attention from the visual imaging community, such as for image generation and translation. This stems from the fact that GANs can provide superior representation capacity to reconstruct real data distribution implicitly [43]. A GAN contains two subnets, the generator G and the discriminator D. G attempts to learn the latent mapping from the implicit distribution of the input, and synthesize the data subject to this mapping. D judges whether the input is from the real distribution or the fake one. Generally, G takes the random noise vectors z as the input, and transforms them to synthetic images X f ake = G(z). D takes the real images or the output of G as input, and outputs the true probability distribution P. D and G are competitively trained to maximize the log-likelihoods they tend to their considered accurate sources. This is expressed as follows: Under the alternating optimization, the GAN is trained to balance D and G, and is guided to the Nash equilibrium. Specifically, we freeze the weights of D, and optimize G through the minimization of the Equation (1). Then we fix G, and optimize D through the maximization of the Equation (2). Each iteration of the model forms a confrontation training mode to promote the discriminator D and generator G mutually. When the Nash equilibrium is achieved, G explores the real data distribution and D has enhanced the advanced capacity to distinguish real/fake data and identify the categories labels.
Due to the output restrictions of the discriminator, GANs are not suitable for multilabel image classification. Odena et al. [44] proposed an auxiliary classifier GAN (ACGAN) to solve this limitation, and achieved accurate prediction for HSIs. The architecture of their network is shown in Figure 1. In the ACGAN, the generator G receives the embedding vectors from random noise z, and their assigned true labels Y are used as input. The synthetic data generated from G and the real data with its corresponding annotations are fed into D. Then D predicts the input in terms of categories C with the softmax classifier, and outputs the probability to discriminate the real data from fake data. Therefore, the loss function of the ACGAN consists of the log-likelihood of its identified correct source L S and the log-likelihood of the corresponding labels L C , which can be calculated as follows: where [0] and [1] are the inputs of D derived from the real data and the fake data, respectively. During the training of the ACGAN, D is optimized to maximize L C + L S , and G is optimized to maximize L C − L S . Finally, G provides the samples of the desired category and D accurately predicts the classification map.

Center Loss for Local Spatial Context
In contrast to deep learning methods that learn a set of spectral-spatial filter banks to measure the similarity loss function, methods that measure the center loss function [45] pay more attention to local spatial feature cohesion and intra-class consistency. Cai et al. [46] first introduced the center loss into attention residual networks for HSI classification. They minimized the intra-class distribution while keeping features of different classes separate. Therefore, the similarity measurement of the class centers of deep features can be determined in terms of their corresponding annotations: where m is the size of mini-batch, x i denotes the ith feature distribution of deep characteristics, and c y i ∈ R indicates the y i th class center of deep features. Then, the updated equation of c y i and the gradients of L Cen take the following form: where η(·) denotes the execution condition, which is the updating constraint of c y i , and j is the current updating class label. If a value of 1 is returned, the condition is satisfied; if 0 is returned, the updating of c y i is stopped.

The Proposed AWF 2 -GAN Framework
To solve the challenges associated with applying deep learning to HSI classification, we proposed a semi-supervised AWF 2 -GAN framework. As shown in Figure 2, the discriminator D of the AWF 2 -GAN is comprised of three parts: spectral filter banks, spectral-spatial filter banks, and an adaptive weighting fusion module. The generator G is built with fully connected filter banks that explore the real distribution and reconstruct the synthetic cuboids implicitly.
Here, we use the Pavia University dataset to illustrate the framework. We suppose an HSI cuboid X contains n pixels X ∈ R l x ×n (l x denotes the spectral bands of each pixel). Then, the neighboring cuboids are extracted from X in the form of image patches: the labeled group X 1 = x 1 i ∈ R l x ×s×s×n l and unlabeled group X 2 = x 2 i ∈ R l x ×s×s×n u are the real inputs of D, where s, n l , and n u represent the spatial sizes of neighboring HSI cuboids, and the numbers of labeled and unlabeled samples, respectively. Since each pixel belongs to HSI cuboid groups in the X 1 i , X 2 i set, n = n l + n u . Meanwhile, the synthetic group Z = {Z i } is the fake input of D for training, and is acquired from the generator G by feeding labels y and random noise vectors z into the fully connected filter banks. Actually, D is a dual-branch fusion networks that employs the spectral filter banks to learn the advanced features of spectra, and utilizes the spectral-spatial filter banks for the local spatial semantic representation. It is worth mentioning that HSI cuboids X have W × H pixels, and retain redundant spectra l x . In contrast to image-based classification frameworks, those of patch-based methods select training samples and their neighboring pixels to explore effective local spatial characteristics. Therefore, constructing 3D spectral-spatial cuboids with s × s image patches increases the potential to generalize models. Even with a small number of HSI samples, such a sampling strategy still retains sufficient trainable parameters. In the following, we utilize 9 × 9 × l x neighboring cuboids as inputs in each filter bank, and take tensor volumes to represent outputs, embeddings, and variables in each layer of the AWF 2 -GAN.
The extracted spectral signatures X spc and spatial contexts X spa are sent into an adaptive weighting feature fusion module, and discriminative fusion features are output. This lightweight module involves feature summing and feature concatenation, and four agile options for feature fusion.
Finally, the combined feature is passed through the softmax operation, and the classification mapŶ = {ŷ i } is predicted.

Adaptive Weighting Feature Fusion Module
In the proposed AWF 2 -GANs, the generator G and the discriminator D are composed of different sets of filter banks. In D, spectral and spatial filter banks learn the spectral distribution and spatial correlation of the central pixels in neighboring cuboids of HSIs. In this section, to fuse the extracted spectral and spatial features. We commence with the basic fusion methods, i.e., element-wise addition and channel concatenation, to check their effectiveness. Furthermore, we propose a novel weighting fusion method with an adaptive feature weighting strategy, and obtain the relationship between the spectral and spatial feature with a lightweight neural network.

Basic Feature Fusion Modules
Baseline fusion models are employed to combine the spectral and spatial features of HSIs via feature addition and channel concatenation. The straightforward methods generally perform the above operations on the outputs X spc and X spa of the feature extractors. However, if X spc and X spa terms from complex feature engineering, their tensor dimensions are not aligned, and therefore they cannot be fused directly.
Instead, the simple baseline fusion models are built for tensor alignment, and transform the input X spc and X spa into output features G spc and G spa , respectively. For the feature dimensions, the intermediate layer between the inputs and outputs contains a Conv2D-BN-ReLU block. Specifically, we adopt a convolutional operation (Conv2D) to align feature tensors of both X spc and X spa with the same kernel of size 1 × 1. Batch normalization [47] and a rectified linear unit (BN-ReLU) are at every convolutional layer of each branch. Then, we acquire the normalization results G spc and G spa , and use these to standardize the training process. Finally, the coarse fusion feature F ss is formed with G spc and G spa via element-wise addition or concatenating along the channel dimension. Figure 3a,b illustrate the architectures of these two basic fusion models. The construction of the baseline fusion strategy is motivated by several points. The Conv2D-BN-ReLU block consists of three parts, Conv2D employs a 2D kernel of size 1 × 1 to construct feature maps through element-wise addition, and we can enhance the correlation between different channels. Furthermore, the block introduces much fewer order of magnitude parameters, and Conv2D share their non-linear weighting coefficients in each feature map. This will not greatly impact feature dimension reduction. BN and ReLU are important for normalization because they standardize the output to avoid either feature becoming dominant, thus encouraging contribution from both of them. Furthermore, BN and ReLU can accelerate the backpropagation of the gradients so as to improve generalization on testing sets.

Fusion Models with Adaptive Feature Weighting
The aforementioned basic fusion models regard both features as having the same tensor shape, and assumes that spectral and spatial features account equally form the fusion feature. In effect, the contribution of different features to each pixel in neighboring cuboids is different. This may be due to the semantic information or spectral content in highly textured areas or homogeneous regions. In particular, the proportion of spectral signatures depends on the abundance of surface materials, and that of spatial contexts stems from the material composition within the local neighborhood. Considering this, we propose an adaptive weighting mechanism to measure the importance of each feature to neighboring pixels. The novelty of this mechanism is that different features are predictable and assignable in all cases.
(1) Feature Adaptive Weighting: Our adaptive feature weighting strategy is inspired by the multiview fusion mechanism for 3D object monitoring and image interpretation. In this paper, we employ the same fundamental to machining the spectral and spatial features, and consider hyperspectral interpretation from the 3D perspective. Figure 4 presents the adaptive feature weighting built with multibranch neural networks. In each branch, each layer/operator contains a fully-connected (FC)-ReLU block, which maps the distributed feature representation from coarse fusion to the label space of samples. This initializes the weights of each feature in each pixel. For different branch networks, the feature map of each element contains a different weight matrix, that is, the parameter variation is provided by FC operators that are not shared.  We suppose that T f and b f denote the transform (weight) matrix and the bias of the non-linear layer that directly takes the coarse fusion feature tensors F ss as the input. Subsequently, F ss is transformed into projection vectors v through the FC-ReLU blocks of different branches. In this way, the output of the k + 1th layer in the ith branch can be represented as where R(·) indicates a non-linear activation function (which is chosen as the ReLU function in this paper). We adopt two measures to integrate all feature tensors from various branches into n branches, as shown in Figure 4. Therefore, the weighted feature V comb after branch fusion is formulated by a weighted sum as or by channel concatenation as where "," denotes the channel concatenation notation. Finally, the adaptive probabilistic predictive feature transformed with the softmax activation function is as follows Thus, each element of the fusion feature maps have a unique weight.
(2) Adaptive Feature Allocation: The fusion architectures that adopt adaptive feature weighting are described in Figure 4. Through these designs, with the predictive feature given by Equation (11), the hybrid predictive matrix can be allocated to branch fusion features by a multiplication to refine the fusion weights as follows: Whether through weighted summation or channel concatenation, each feature of each element is assigned a unique score. Thus, whether in the highly textured neighborhood or in hyperspectrally homogeneous regions, the proposed adaptive feature weighting mechanism always enhances discriminability, and thus improves the capacity for HSI interpretation.

Details of the AWF 2 -GAN Architecture
In deep learning methods, the main means to significantly improve HSI classification is to extract discriminative spectral-spatial characteristics. Thus, learning a set of efficient spectral-spatial filter banks is common in HSI feature extraction. Nevertheless, HSIs consist of complex hyperspectral channels and limited labeled trainable data, which constrains the representation of spectral-spatial filter banks. Furthermore, Ref. [27] pointed out that the effects of hyperspectral contents is obviously distinct from those of spatial contexts in highly textured local neighborhoods or homogeneous regions.
We learned a set of global spectral filter banks and neighboring spectral-spatial filter banks for HSI feature extraction, and extended them into the discriminator D of the AWF 2 -GAN. In addition, a generator G was employed in the AWF 2 -GAN to synthesize fake data. This added interference factors to D to improve its robustness. Furthermore, the fake data and the real data were passed to D to balance training data through our AWF 2 -GAN. The architecture of D and G in the AWF 2 -GAN is illustrated in Figure 5. For the input of D, it takes the raw 3D neighboring cubes of HSIs as input data without feature preprocessing. The sampling strategy of neighboring cubes are randomly selected which centered at pixels in labeled group X 1 . Also, D contains two branches, they treat the input cubes as the independent component for feature extraction from each other. For the input of G, it consists of the random noise vectors with corresponding labels y to generate synthetic data cubes Z. Figure 5. Adaptive weighting feature fusion discriminator (upper), consisting of a dense spectrum and spatially separable feature extractors. Their resulting features are fed into an adaptive weighting fusion model, which outputs a vector that indicates whether the data is fake or real and contains categorical probabilities. A generator (lower) contains consecutive spatial and spectral feature generation blocks to generate synthetic HSI cuboid Z.

Adaptive Weighting Feature Fusion Discriminator
We built two extractors to consider the composition of spectral and spatial filter banks. One is the dense spectrum feature extractor, and the other is the spatially separable feature extractor. They train the networks on 3D HSI cuboids with dimensions of 9 × 9 × l b (where l b are the bands of the spectra; we take 103 from the Pavia University dataset for illustrative purposes). The spectral signature and spatial contexts are obtained from the above two extractors, and we feed them into the adaptive weighting fusion module to predict fusion features. Then we map the fusion features to sample labels of each neighboring cuboid (or central pixel) through the FC layer with a softmax activation function. The feature extractors are described below.
(1) Dense Spectrum Feature Extractor: The main task of the spectral extractor is to capture the global spectral differences of the input neighboring cuboids, and purify the salient features of the central pixel within redundant spectral bands. Here, the obtained feature considers the overall bands of hyperspectral pixels centered at neighboring cuboids. Details of dense spectrum feature extraction are depicted above the discriminator in Figure 5.
To obtain differences between spectra, the dense spectrum feature extractor is designed as a denseNet. We suppose that the network includes l layers/operators, each of which is calculated as an FC-ReLU block with weight matrix H and bias vector b. Then the lth input feature is a dense connection mapping from the 0th to the l − 1th layer. Thus, the lth operation of the extractor ψ X l ; ξ and the output feature X spc take the form This dense feature extractor combines the advantages of spectral channels to maintain an abundance of feature maps. In addition, each FC-ReLU block contains various neural units to delay the gradient descent, which results in a deeper monitoring effect.
(2) Spatially Separable Feature Extractor: We also explore the local spatial representation, i.e., we encode the spatial relationship into the input neighboring HSI cuboids. It can be regarded as an auxiliary discriminative factor to the global dense spectral feature. The spatially separable feature extractor has two core components: an adaptive dropblock, and a set of continuous separable feature convolutional filter banks (Sep-Conv blocks). The architecture of this extractor is illustrated below the discriminator in Figure 5.
Specifically, the adaptive dropblock is utilized to regularize the neighboring texture information. This has been verified in [40] to alleviate overparameterization and standardize neurons.
Meanwhile, Sep-Conv considers the local spatial similarity between spectral channels, and begins with a depth convolution with kernels of size k × k. The output channel M of kernels is consistent with that of input image channel N. Then, the point convolutional filter banks of size 1 × 1 are applied to integrate the local spatial semantics. Again, we combine these semantics with a dense connection. If Sep-Conv blocks have L layers with h filter banks and biases b, the spatial feature extractor architecture ϕ X L−1 ; δ L can be represented as follows: where * is the convolution operation, and δ denotes the trainable parameters from various separable convolution operations. δ can be calculated as The specific configurations of D are provided in Table 1. In order to guarantee the generalization of spectral features and avoid "the curse of the dimension," four FC-ReLU blocks are employed as subcomponents. All the fully connected layers are initialized with "He normal distribution initialization method" [48] and zero bias vectors. Furthermore, we reduce the number of neurons of each subcomponent from 1024 to 128 in turn, and gather significant features from deep spectral mapping and output spectral feature cuboids with dimensions of 9 × 9 × 128. It is worth mentioning that this shallow structure is sufficient for the neighboring cuboid sizes utilized in our experiments, which is not more than 9 × 9. An ablation study to assess the classification performance with respect to the subcomponents architecture is given in Section 5.2.  Table 2 lists the generic layer specifications of the spatially separable feature extractor. All the convolution layers in the network include a kernel size of 3 × 3, a stride with 1, and padding mode of "same," and obtain shift-invariant features from patches. ReLU, as the activation function, is executed at each Sep-Conv block. Although deep feature mapping can be achieved by increasing the number of kernel, we found that more kernels may not necessarily improve the classification accuracy. A detailed ablation study on the influence of the number of kernels in Sep-Conv blocks on accuracy is provided in Section 5.2. We also discuss the selection of optimal depths of Sep-Conv blocks in Section 5.3. Table 2. Specific configurations of the proposed discriminator D of the AWF 2 -GAN.

Layer /Operator
of feature weighting schemes are used to generate adaptive weight matrix A f f . The resulting predictionŷ 1 i 1 : n y is passed through the softmax layer, and a vector outputs that shows the probabilities of an HSI cuboid belonging to the n y categories). Again,ŷ 1 i [0] indicates the genuineness of the training cuboid. To meet the conditions of softmax, the average pooling and flatten operation are generally executed on the adaptive weight matrix A f f .

Generator with Fully Connected Components
In GANs, generated samples are employed to improve the classification performance of the discriminator. Ref. [37] suggested that the discriminator allows a bad generator as the regularizer to enhance latent representation of hyperspectral cuboids. This is in contradiction with what was proposed by [39]. Actually, the structure of generators has different physical meanings in various stages of GANs. Furthermore, rough noises hinder sample generation, and result in the model collapse.
To this end, we proposed spatial and spectral feature generation blocks built by groups of FC-ReLU blocks, as shown in Figure 5. The reason for this choice is that the fully connected layer provides more trainable parameters, and allocates an unique trainable weight for each element of the feature map. During the first few iterations (nb_epoch), G acts as a regularizer of D, forcing D to approach the real distribution of HSI cuboids. With the optimization of the AWF 2 -GAN, the fitting ability of FC-ReLU blocks provide more promising results for G. The data generation is divided into two steps: spatial feature generation, and spectral feature generation. The specific configuration of each layer of the AWF 2 -GAN is given in Table 1.
We suppose a set of Gaussian noise vectors z with their homologous labels y. Then, z is sent into the spatial feature generator to generate the feature tensor with neighborhood space. Next, we use three FC-ReLU blocks to design the spectral feature generator, which gradually approaches the real sample distribution in different neurons, and generates pseudo hyperspectral cube Z.

Training Loss Functions
The semi-supervised GAN focuses on addressing the limited number of labeled HSI samples. Thus, generator G's unlabeled set and synthetic set are important regularizers of the discriminator D to improve HSI classification. The objective loss function of the semi-supervised GAN takes the form where Ω D and Ω G are the hyperparameters of D and G, respectively. L SUP , L D1 , and L D2 are the supervised and unsupervised entries of D, and the unsupervised entry of G, respectively. Given the labeled HSI cuboid X 1 = x 1 i ∈ R 9×9×103 with its corresponding annotation Y 1 = y 1 i ∈ R 1×(1+n y ) , the prediction of D can be formed asŶ 1 = D X 1 ; Ω D . Therefore, each entry of L SEMI is formulated as where Ω D = {ξ D , δ D } indicates the parameters of D, which can be updated from Equations (14) and (17). Ω G = {ξ G } denotes the parameters of G. Due to the G consists of FC-ReLU blocks, the parameters of Ω G can be updated by the Equation (14).Ŷ 1 [0] is the authenticity of the pixel x i ∈ X 1 , andŶ 1 [1 : n] dentoes the output vectors of softmax, which is the probaility of each category that y 1 i belongs to. To stabilize the training of GANs, we introduced the mean minimization loss into the unsupervised entry. This decreases the value and variance in high-dimensional features from the second to last layers of D, and inhibits overfitting. The mean minimization loss takes the form where N is the total number of batch samples, x i is the training sample, and f (x i ; Ω) indicates the high-dimensional output of the network, which in this paper is the output before the fully connected layer. For the supervised item, to constrain the intra-class feature distribution, we employed the center loss to guide the feature distribution to shift to the central classes. It can be applied as a regular term before the softmax operation, and can be formulated as where F x i ; y 1 i 1 : n y denotes the central distribution before the softmax operation. L D1 + L D2 is also part of the GAN loss for training the generator of the AWF 2 -GAN, whose corresponding loss function L G can be formulated as The training of the AWF 2 -GAN involves two alternating steps through RMS or adjacent optimization fashions at each iteration. First, the gradients of D, −∇ Ω D L SEMI are employed to update Ω D to learn discriminative characteristics of HSIs. Second, the gradients of −∇ Ω D L G are applied to update Ω G to improve the adversarial training of the AWF 2 -GAN.

Experimental Results
In this section, two challenging hyperspectral datasets are adopted for classification with the proposed AWF 2 -GAN. To verify the effectiveness of the AWF 2 -GAN, several advanced GAN-based HSI algorithms, HS-GAN [35], 3D-GAN [36], SS-GAN [37], and AD-GAN [40] are employed for comparison. Furthermore, to demonstrate the feasibility of the spectral-spatial feature fusion architecture, we also compared AWF 2 -GANs with various fusion options: F 2 -Concat (with basic concatenation), F 2 -Add (with basic addition), AWF 2 -Concat (with adaptive weighting concatenation) and AWF 2 -Add (with adaptive weighting addition).

Experiment Setup
In this paper, two hyperspectral image datasets were employed to evaluate the performance of the proposed model: Indian Pines and Pavia University.
(1) Indian Pines (IN) contains scenes from India acquired by the AVIRIS (Airborne Visible/Infrared Imaging Spectrometer) sensor. It includes 16 categories and images are 145 × 145 pixels with 20 m ground sample distance. Samples are shown in Figure 6. Since 20 bands were discarded due to atmospheric absorption, there are 200 spectral bands in the range of 400-2500 nm.
(2) Pavia University (UP) includes imagery of Northern Italy obtained by the ROSIS (Reflective Optics System Imaging Spectrometer) sensor. The images are 610 × 340 pixels with 9 urban land-cover classes, and 1.3 m spatial resolution per pixel (see Figure 7). After abandoning 20 noisy bands, the remaining 103 spectral bands in the range of 430-860 nm are employed for evaluation. To demonstrate the effectiveness of the AWF 2 -GAN algorithm, we use five representative HSI classification methods for comparison. Due to the high accuracy and robust performance in terms of classical HSI classification of SVM, the SVM-based HSIs classifier was adopted for the comparison. The EMAP(Extended Multi-Attributes Profiles) with SVM [49] was employed for the spectral-spatial classification, which considered four attributes: (1) a, the area of the regions; (2) d, the length of the diagonal of the box bounding the region; (3) i, the moment of inertia; (4) s, the standard deviation. These extended attribute profiles (EAPs) are obtained by applying thickening and thinning operations to extract spatial information on the first three components of HSI, which were computed by PCA, and retain 20 principal components. For the penalty λ and gamma γ parameters of SVM, the grid-search and 10-fold cross-validation are employed to finetune them. In this experiment, the search range was exponentially growing sequences of λ and γ (λ = 10 −5 , 10 −4 . . . , 10 5 ; γ = 10 −5 , 10 −4 . . . , 10 5 ). For fair comparison, all GAN-based methods used their optimal parameters. For HS-GAN, the kernel size was set as 1 × 3, and the number of training epochs was set to 200. For 3D-GAN, the convolutional kernel sizes were set according to the literature [36]. For SS-GAN, the spatial size of 3D input was set as 9 × 9, and other parameters were suggested in the literature [37]. For AD-GAN, the adaptive dropblock was executed one time in the discriminator.
We also compared the proposed AWF 2 -GANs with the four feature fusion options. All the parameters of AWF 2 -GANs are initialized with "He normal distribution initialization method" [48]. For convergence speed and accuracy, the framework is optimized using a RMSProp optimizer with the hyper parameters γ = 0.9 and learning rate lr = 0.0005. Taking account of the model collapse of GANs, lr decay is 0.9 per 50 steps and the total iterations are nb_epoch = 200. Each batch size is set to 16 within 9 × 9 input patches for local spatial contexts. All experiments are implemented with TensorFlow deep learning framework CUDA 9.0, an Intel Xeon Gold 6154 CPU, 256 GB of RAM, and an NVIDIA TITAN V 12GB GPU.

Experiments on the IN Dataset
The IN dataset contains a complex sample distribution in which trainable labeled samples of categories are unbalanced. Specifically, there are some categories with no more than 50 labeled samples, such as "Alfalfa", "Grass-pasture-mowed" and "Oats." Conversely, "Soybean-mintill" has more than 2000 labeled samples. Furthermore, the partial spectral bands between these classes are approximate, like those between "Corn-mintill" and "Soybean-notill", which have the similar spectrum in range from 100 to 150 spectral bands. This is caused by the phenomenon of "foreign matter of the same spectrum in surface cover" [18], which also reported this viewpoint in literatures [6,23,39]. To this end, the IN dataset can evaluate the stability of the AWF 2 -GANs. For fair comparison, all compared algorithms use their optimal parameters as suggested in the literature. Besides, to address the model collapse of GAN-based methods, the Monte Carlo sampling strategy is employed to marginalize noise during training.
The first test was a quantitative experiment to evaluate the proposed model and other state-of-the-art methods. In this test, we randomly selected 525 labeled samples to constitute the labeled group X 1 mentioned in Section 3.1; this is a small size that only accounts for approximately 5% of the total labeled samples. Only 3 samples in this class were randomly selected for training if the number of samples of one class was less than 40; meanwhile, only 5 samples were randomly selected for training when the number of samples of one class was more than 40 but no more than 100. For the unlabeled group X 2 , the sampling ratio was equal to X 1 . To assess the performance of various methods, we adopted three evaluation indices: the overall accuracy (OA), average accuracy (AA), and kappa coefficient (κ). Table 3 reports the individual classification performance of various methods, and Figure 8 presents the produced classification maps.
Firstly, although the SVM showed considerable accuracy through EMAPs, that contain high texture spatial information, it still had the worst performance. Secondly, GAN-based classifiers (e.g., HS-GAN, 3D-GAN, SS-GAN, and AD-GAN) provided higher classification accuracy than the SVM. This illustrates that GAN-based deep learning algorithms have a framework suitable for HSI classification. Thirdly, F 2 -GANs with basic fusion options (F 2 -Concat and F 2 -Add) achieved superior accuracies compared to 3D-GAN, proving that spectral-spatial combinations can improve classification performance. Lastly, AWF 2 -GANs with adaptive feature weighting options (AWF 2 -Concat and AWF 2 -Add) achieved the best results, thereby demonstrating that well-designed spectral-spatial networks combining weighting fusion features are suitable for HSI classification. Furthermore, it achieves perfect accuracy on the "Alfalfa," "Grass pasture," "Grass-pasture-mowed," "Hay-windowed," "Oats," and "Wheat" categories, three of which contain few labeled training samples. As shown in Figure 8, AWF 2 -GANs obtained more uniform regions for ground objects in contrast to other algorithms. Moreover, AWF 2 -Add preserved the most accurate boundary of the "Soybean-mintill" class. In the second test using the Indian Pines dataset, we verified the sensitivity of the models to different percentages of labeled training samples. We randomly selected 1%, 3%, 5%, 7%, and 10% of labeled samples per class. Table 4 details the overall accuracy for these classification methods with these training samples. Similar to in our first experiment, SVM (EMAPs)'s generalization performance for HSI improved by increasing the percentage of randomly selected training samples. However, the SVM displayed the slowest convergence of all methods. Secondly, GAN-based methods achieved advanced classification accuracies compared to traditional methods. Thirdly, F 2 -GANs had higher classification accuracies than previous GAN-based algorithms in all tested cases, which proved that spectral-spatial combinations extracted more effective and discriminative features than spectral-spatial representations. Besides, AWF 2 -GANs with adaptive feature weighting options provided the best performance in comparison to the other methods in all tested cases, which indicated that the method is reliable and robust for HSI classification.

Experiments Using the UP Dataset
Images from Pavia University (UP) consist of evenly distributed surface materials of 9 urban land-cover classes with highly textured characteristics. In particular, the "Painted metal sheets" and "Gravel" classes contain abundant texture information; it is difficult to predict the complete boundaries of these classes with traditional methods. Furthermore, in contrast to the IN dataset, UP images contain more uniform regions such as in the "Meadows," "Bare Soil," and "Bitumen" classes. Again, the concept of a "different body with the same spectrum" is observed in UP images like those belonging to the "Meadows" and "Bare Soil" classes. Therefore, we compare the performance of different algorithms with high-resolution images. Firstly, 350 samples were randomly selected as training data, which is approximately 0.8% of the labeled pixels. Table 5 displays the quantitative experimental evaluations with four metrics: the individual accuracy, overall accuracy, average accuracy, and κ coefficient. Figure 9 shows the classification results.   Table 5 leads us to similar conclusions as the results of our quantitative experiments on the IN dataset. AWF 2 -GANs with various feature fusion options all have a higher OA compared with the other methods. In addition, AWF 2 -Add has the highest individual classification accuracy for most classes. For instance, it achieved accurate classification for the "Asphalt," "Painted metal sheets," and "Bitumen" classes. Moreover, 3D-GAN, SS-GAN, AD-GAN, and AWF 2 -GANs provided more homogeneous classification maps than the SVM and HS-GAN. Conversely, the HS-GAN classification map contained a large amount of noisy pixels compared to other methods. This demonstrates that the HSI classifier can be significantly improved by taking advantage of spectral-spatial characteristics.
In contrast to the 3D-GAN, SS-GAN, and AD-GAN, the classification maps of AWF 2 -GANs with various feature fusion options had the most clear boundaries and most uniform regions. This illustrates that spectral-spatial characteristics extracted from spectral-spatial combinations led to greater generalization efficiency than spectral-spatial filter banks. Table 5. The overall accuracy (OA), average accuracy (AA), kappa coefficient (κ), and individual class accuracies for the UP dataset with 350 labeled and unlabeled samples for training. The best results are highlighted in bold typeface.

Class Train (Test) SVM HS-GAN 3D-GAN SS-GAN AD-GAN
To further verify the robustness and practicability of the proposed AWF 2 -GANs, we employed different training samples from the Pavia University dataset. The labeled training set was generated by randomly selecting 0.1%, 0.2%, 0.4%, 0.8%, and 1% samples per class, and the unlabeled training set was sampled equally to the labeled one. Table 6 presents the OA(%) matrix for each test case. The proposed approach achieved higher classification accuracies with a limited percentage of labeled samples than other methods. For instance, AWF 2 -Add yielded a 97.31% overall accuracy when only using 0.4% of labeled training samples per class. In contrast to other spectral-spatial classifiers, adaptive weighting fusion features can capture discriminative spectral purity and spatial neighboring contexts, and greatly improve classification accuracy. The training and testing times are investigated to assess the efficiency of the various classification methods. In particular, the SVM with EMAPs, as the classical algorithm for combining spectral-spatial features, was tuned with 10-fold cross validation to identify the optimal parameters. For a fair comparison, we set the same maximum batch size to 16 in each GAN-based method. SS-GAN and AD-GAN, as patch-based classification methods, were trained on 9 × 9 patches centered at the neighboring cuboids of chosen training pixels. For AWF 2 -GANs, each feature extractor was constructed with 128 output channels; the other parameters are the same as mentioned above. For testing, 5% of samples per class were randomly selected from the IN dataset , and 1% of samples per class were randomly selected from the UP dataset. Once the number of samples for one class was less than 50, 5 samples of that class were selected for training. Table 7 lists the training and testing times of the studied algorithms. The training times of AWF 2 -GANs were 5 to 7 times shorter than those of SS-GAN and AD-GAN. GANbased methods required more training time because adversarial learning needs to converge. Despite this fact, F 2 -GANs and AWF 2 -GANs took approximately 1 s per batch for the IN images. All remaining labeled samples were employed in the testing phase. The results revealed the SS-GANs lasted 2 to 3 times longer than the AWF 2 -GANs. This could be because the pixel-wise classification GANs with a set of spectral-spatial filter banks will require more time when processing image patches. Although the adjacent patches contain a lot of redundant information, and a large amount of computer calculations, the branch extractors treat the spectral and spatial characteristics of HSIs separately. In contrast, spectral-spatial combination based on the branch fusion strategy reduces the computational complexity, and the processing of the testing phase is faster than that of spectral-spatial filter banks.

Kernel Setting and Units Selection for Feature Extraction
The effectiveness of the kernel setting and selection of the number of neurons were evaluated when using AWF 2 -Add. as the network backbone. For the kernel setting, the kernel size and the number of kernels are important factors impacting the extraction of local spatial features. To explore the discriminative spatially separable feature along the channel dimension, the number of intermediate spatially separable convolutional kernels is between 16 to 128 (where we set three intermediate layers in the spatially separable feature extractor). For the sensitive neighboring area, we considered kernel sizes of 3 × 3, 5 × 5, and 7 × 7. Furthermore, spectral purity analysis allows us to verify the spectral feature utilization. To this end, seven combinations of the number of neurons used for the dense spectrum feature extractor were selected: (128,128,128 Figure 10 illustrates the overall accuracy with 150 training epochs and 500 samples on the IN dataset, and with 350 on the UP dataset. In the kernel setting phase, the overall accuracy peaks when there are 64 and 128 spatial convolutional kernels for the IN and UP datasets, respectively. In addition, overall accuracy does not improve upon increasing the number of kernels. Figure 10b shows that the kernel size of 3 × 3 results in the best classification performance on both datasets. For the spectral purity analysis, it can be seen that best numbers of neurons in the three intermediate layers is (1024, 512, 512). Since the neighboring HSI cuboid takes a size of 9 × 9, and the proposed extractors are dense connection structures, it is sufficient to capture the discriminative feature mapping of HSIs. Therefore, the generalization ability of AWF 2 -GANs can be effectively expressed regardless of using smaller convolutional kernel size or less neural units during training. Furthermore, small kernel sizes for training could mitigate overfitting.

Depths of the Feature Extractors
The generalization ability and stability of HSI classification networks are also subject to the capacities of spectral and spatial feature extractors, i.e., their subcomponent depths. In the feature extraction phase, the dense spectrum and the spatially separable feature extractors play important roles, with various subcomponents FC-ReLU and Sep-Conv blocks used to obtain valid features. For AWF 2 -GANs, the depths of both extractors were validated from 3 to 5 subcomponents on both datasets, as shown in Figure 11. To maintain the stability of the GAN framework, the depth of the generator was fixed to 4 FC-ReLU blocks for training. As illustrated in Figure 11, setting the depths as "4 & 4" led to the best results on both datasets. Furthermore, the overall accuracies obtained with shallow depths differed little. Therefore, the branch feature fusion strategy yields high robustness and practicability compared with the obvious spectral-spatial representation reviewed in [37].

Influence of Unlabeled Real HSI Cuboids for AWF 2 -GANs
To evaluate the influence of unlabeled real HSI cuboids, we tested the proposed AWF 2 -GANs with four feature fusion options using different numbers of unlabeled HSI samples. We used both IN and UP datasets for this evaluation. We randomly selected 0, 300, 1000, and 5000 unlabeled samples, and 300 labeled samples for training. To verify the influence by adding unlabeled samples, we also take semi-supervised GANs like HS-GAN and SS-GAN into account. Table 8 shows that adding too many real unlabeled HSI cuboids for training has less effect than that of little unlabeled sizes, and even jeopardizes HSI classification performance. This is caused by the sample distribution differences between labeled and unlabeled HSI pixels. Furthermore, it can be seen that using 300 unlabeled samples equal to the labeled samples resulted in improved accuracy compared to other methods, which is consistent with the conclusion reported in [35]. Therefore, in contrast to the negative effects with unlableld samples reported in [37], the consistent HSI classification performance of AWF 2 -GANs use unlabeled samples equal to their labeled samples demonstrated that adding unlabeled samples mitigate the small samples effects in other deep learning models. It is worth mentioning that, F 2 -Con. (with basic concatenation feature fusion option) always yielded the highest accuracy with no unlabeled samples, which is caused by its channel concatenation retains abundant spectral radiation and detail characteristics, and improve its feature effectiveness. When the equivocal neighboring distribution has been introduced, the detail characteristics would be guided in a wrong direction, and result in a accuracy-decreasing.

Conclusions
Hyperspectral image classification faces challenges due to redundant spectral information, weak spectral-spatial representation, and limited labeled samples. It has been demonstrated that generative adversarial networks have a strong ability to expand sample sets and generalize models for classification and feature representation. In this paper, we proposed patch-based and semi-supervised GAN-based classification framework with various feature fusion strategies for HSI classification and to overcome the above-mentioned issues. Experimental results revealed that the feature fusion spectral-spatial combinations are more effective than the fixed spectral-spatial extractions, which took three times longer in testing. The feature fusion model contained four fusion options to adapt complementary and interconnected information for classification. The AWF 2 -GANs were designed to integrate the global spectral signatures and separable spatial contexts via various fusion options, and provided generated data for small sample issues. Furthermore, the considered joint loss function with the center loss captured the intra-class sensitivity from local neighboring areas, and gave an efficient spatial regularization result. Quantitative experiments on two hyperspectral image datasets demonstrated that the proposed AWF 2 -GANs can achieve promising classification accuracy and robust performance.