Generative Adversarial Networks Based on Collaborative Learning and Attention Mechanism for Hyperspectral Image Classiﬁcation

: Classifying hyperspectral images (HSIs) with limited samples is a challenging issue. The generative adversarial network (GAN) is a promising technique to mitigate the small sample size problem. GAN can generate samples by the competition between a generator and a discriminator. However, it is di ﬃ cult to generate high-quality samples for HSIs with complex spatial–spectral distribution, which may further degrade the performance of the discriminator. To address this problem, a symmetric convolutional GAN based on collaborative learning and attention mechanism (CA-GAN) is proposed. In CA-GAN, the generator and the discriminator not only compete but also collaborate. The shallow to deep features of real multiclass samples in the discriminator assist the sample generation in the generator. In the generator, a joint spatial–spectral hard attention module is devised by deﬁning a dynamic activation function based on a multi-branch convolutional network. It impels the distribution of generated samples to approximate the distribution of real HSIs both in spectral and spatial dimensions, and it discards misleading and confounding information. In the discriminator, a convolutional LSTM layer is merged to extract spatial contextual features and capture long-term spectral dependencies simultaneously. Finally, the classiﬁcation performance of the discriminator is improved by enforcing competitive and collaborative learning between the discriminator and generator. Experiments on HSI datasets show that CA-GAN obtains satisfactory classiﬁcation results compared with advanced methods, especially when the number of training samples is limited.


Introduction
In the past few decades, hyperspectral data have become more convenient and inexpensive to acquire and collect [1]. The hyperspectral image (HSI) is a three-dimensional (3D) data cube, where each pixel has hundreds of spectral bands, and each spectral band corresponds to a 2D image. It combines abundant spectral information and spatial information simultaneously. HSI processing has been used for many practical applications, such as military [2], agriculture [3], and astronomy [4]. HSI classification is the foundation for these applications, which is achieved by assigning a specific class to each pixel. It mainly involves two tasks: effective feature representation and advanced classifier design.
(1) A symmetric convolutional GAN is optimized in an end-to-end manner to alleviate the over-fitting issue of HSI classification. In CA-GAN, the sample generation is guided not only by using the loss function from the discriminator but also by using the real sample information extracted from the discriminator. It prompts the generator to generate high-quality samples by using both collaborative and competitive learning.
(2) To learn complex spatial-spectral distribution of HSIs, joint spatial-spectral hard attention module emphasizes more discriminative features and suppresses less useful ones in the generation of both spatial and spectral dimensions. It guarantees the generated samples to approximate the real samples with spatial-spectral distribution.
(3) In CA-GAN, the discriminator captures global spectral dependencies instead of local correlation captured by the convolutional kernels in the existing GAN methods. The classification performance of CA-GAN is improved by extracting spatial-spectral features effectively and leveraging high-quality spatial-spectral generated samples.
The remainder of this paper is organized as follows. Section 2 briefly describes the background of GAN. The proposed CA-GAN method is expounded in Section 3. Subsequently, Section 4 exhibits the experimental results and analysis. Finally, some conclusions are drawn in Section 5.

Generative Adversarial Networks
GAN is proposed by Goodfellow et al. [35], which uses a minimax game to train the generation model from the game theory perspective. Figure 1 shows the structure of GAN. It includes two networks; one is the generator G. The goal of G is to transform the noise variable z into the generated sample G(z), which learns the distribution p data of real data x. The other is the discriminator D, whose goal is to distinguish whether a sample is real or generated. Both G and D implement non-linear mapping by using network structures, such as multi-layer perceptron.
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 27 1) A symmetric convolutional GAN is optimized in an end-to-end manner to alleviate the overfitting issue of HSI classification. In CA-GAN, the sample generation is guided not only by using the loss function from the discriminator but also by using the real sample information extracted from the discriminator. It prompts the generator to generate high-quality samples by using both collaborative and competitive learning.
2) To learn complex spatial-spectral distribution of HSIs, joint spatial-spectral hard attention module emphasizes more discriminative features and suppresses less useful ones in the generation of both spatial and spectral dimensions. It guarantees the generated samples to approximate the real samples with spatial-spectral distribution.
3) In CA-GAN, the discriminator captures global spectral dependencies instead of local correlation captured by the convolutional kernels in the existing GAN methods. The classification performance of CA-GAN is improved by extracting spatial-spectral features effectively and leveraging high-quality spatial-spectral generated samples.
The remainder of this paper is organized as follows. Section 2 briefly describes the background of GAN. The proposed CA-GAN method is expounded in Section 3. Subsequently, Section 4 exhibits the experimental results and analysis. Finally, some conclusions are drawn in Section 5.

Generative Adversarial Networks
GAN is proposed by Goodfellow et al. [35], which uses a minimax game to train the generation model from the game theory perspective. Figure 1 shows the structure of GAN. It includes two networks; one is the generator . The goal of G is to transform the noise variable into the generated sample ( ) , which learns the distribution of real data . The other is the discriminator , whose goal is to distinguish whether a sample is real or generated. Both and implement non-linear mapping by using network structures, such as multi-layer perceptron. In simple terms, G wants to deceive D and maximize the probability that D makes a mistake by generating high-quality samples, and D wants to make the best possible distinction between real samples x and generated samples G(z) . The optimization of GAN is realized by finding the Nash equilibrium between G and D .
where z p (z) represents the distribution of the noise z . (·)represents the empirical estimation of the joint probability distribution. When the inputs are real samples x , the outputs of D are indicated by D(x) . Similarly, the outputs D(G(z)) of D correspond to the inputs from the generated samples G(z) .
In the process of network optimization, the generator G and the discriminator D are optimized in an alternating way. Specifically, given G , we optimize D by maximizing In simple terms, G wants to deceive D and maximize the probability that D makes a mistake by generating high-quality samples, and D wants to make the best possible distinction between real samples x and generated samples G(z). The optimization of GAN is realized by finding the Nash equilibrium between G and D. G and D are optimized by the value function V(D, G): where p z (z) represents the distribution of the noise z. E(·) represents the empirical estimation of the joint probability distribution. When the inputs are real samples x, the outputs of D are indicated by D(x). Similarly, the outputs D(G(z)) of D correspond to the inputs from the generated samples G(z).
In the process of network optimization, the generator G and the discriminator D are optimized in an alternating way. Specifically, given G, we optimize D by maximizing E x∼p data (x) [logD(x)] + E z∼p z (z) [ log(1 − D(G(z)))] . Then, after arriving at a fixed D value, G is optimized by minimizing E z∼p z (z) [ log(1 − D(G(z)))] . After many iterations, the entire network has reached an optimal balance. Remote Sens. 2020, 12, 1149 5 of 24 Through the competition of two networks, D achieves the best evaluation results, and G generates the data that learns the real distribution.

The Proposed CA-GAN Method
The structure of CA-GAN is based on a symmetric convolutional GAN. CA-GAN consists of three parts: the generator based on a joint spatial-spectral hard attention module, the discriminator based on convolutional LSTM, and the classification of CA-GAN based on collaborative and competitive learning. The conceptual framework of CA-GAN is shown in Figure 2. As shown in Figure 2, in the first part, the noise and the class labels are used as the input of the generator. Then, the transposed convolutional layer and joint spatial-spectral hard attention module are constructed to generate high-quality samples both in spatial and spectral dimensions. In the next part, the discriminator is constructed to capture joint spatial-spectral features by merging a convolutional long short-term memory (ConvLSTM) layer after the convolutional layer. In the final part, the collaborative learning mechanism is constructed based on the generator and discriminator with symmetrical structure. It impels the generator to generate high-quality samples by using the shallow to deep features of real samples extracted by the discriminator. The discriminator can collaborate with the generator to optimize the objective function of the generator. At the same time, the objective of the discriminator is to classify the generated samples as true classes, while the objective of the generator is to make the discriminator mistake. The classification performance of the discriminator is improved through competitive learning.
Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 27 optimized by minimizing ∼ ( ) [ ( 1 − ( ( )))]. After many iterations, the entire network has reached an optimal balance. Through the competition of two networks, D achieves the best evaluation results, and G generates the data that learns the real distribution.

The Proposed CA-GAN Method
The structure of CA-GAN is based on a symmetric convolutional GAN. CA-GAN consists of three parts: the generator based on a joint spatial-spectral hard attention module, the discriminator based on convolutional LSTM, and the classification of CA-GAN based on collaborative and competitive learning. The conceptual framework of CA-GAN is shown in Figure 2. As shown in Figure 2, in the first part, the noise and the class labels are used as the input of the generator. Then, the transposed convolutional layer and joint spatial-spectral hard attention module are constructed to generate high-quality samples both in spatial and spectral dimensions. In the next part, the discriminator is constructed to capture joint spatial-spectral features by merging a convolutional long short-term memory (ConvLSTM) layer after the convolutional layer. In the final part, the collaborative learning mechanism is constructed based on the generator and discriminator with symmetrical structure. It impels the generator to generate high-quality samples by using the shallow to deep features of real samples extracted by the discriminator. The discriminator can collaborate with the generator to optimize the objective function of the generator. At the same time, the objective of the discriminator is to classify the generated samples as true classes, while the objective of the generator is to make the discriminator mistake. The classification performance of the discriminator is improved through competitive learning.

The Generator in CA-GAN Based on Joint Spatial-Spectral Hard Attention Module
In GAN, the classification performance of the discriminator is improved by utilizing the generated samples. Generating high-quality samples is pivotal for GAN-based HSI classification. However, it is difficult to approach the real HSI data in spectral and spatial domains because of highdimensional spectral bands and various spatial distribution in HSIs. Radford et al. [47] suggested using transposed convolution and convolution without pooling layers and fully connected layers to construct the generator and discriminator in GAN. Most GAN-based HSI methods adopt this kind of architecture, such as HSGAN [53] and MSGAN [58]. In the generator, the transposed convolution operation can generate local spatial and spectral information of HSIs. However, it treats all the

The Generator in CA-GAN Based on Joint Spatial-Spectral Hard Attention Module
In GAN, the classification performance of the discriminator is improved by utilizing the generated samples. Generating high-quality samples is pivotal for GAN-based HSI classification. However, it is difficult to approach the real HSI data in spectral and spatial domains because of high-dimensional spectral bands and various spatial distribution in HSIs. Radford et al. [47] suggested using transposed convolution and convolution without pooling layers and fully connected layers to construct the generator and discriminator in GAN. Most GAN-based HSI methods adopt this kind of architecture, such as HSGAN [53] and MSGAN [58]. In the generator, the transposed convolution operation can generate local spatial and spectral information of HSIs. However, it treats all the features equally during the generation process. Actually, some features facilitate the distribution of generated samples Remote Sens. 2020, 12, 1149 6 of 24 to approximate that of real samples, which further promotes the classification performance of the discriminator. On the contrary, some poor or noisy features hinder the generation of high-quality samples. Therefore, it is necessary to select appropriate spatial and spectral features in the process of sample generation.
In the generator of CA-GAN, the objective function of the generator is to maximize the probability that the discriminator classifies the generated samples as true classes. A new joint spatial-spectral hard attention module is devised in the generator to reserve meaningful features and suppress less useful ones along the spatial and spectral dimensions. It refines the features by using an adaptive spatial-spectral attention map. This attention map is calculated based on a multi-branch convolutional network by using a dynamic activation function and an element-wise subtraction operation. The spatial-spectral hard attention module is added before each transposed convolutional layer of the generator. It pays varied attention to spatial and spectral contextual features simultaneously. Finally, after adaptive feature selection, the features of the generated samples whose distribution is approximate to the real sample distribution are retained, and the confused and misleading ones are eliminated. The main structure of the joint spatial-spectral hard attention module is illustrated in Figure 3. It contains three branches: the conversion branch, the mask branch and the original branch. The spatial-spectral attention map is obtained by using element-wise subtract operation between the conversion and mask branches and mapping with the dynamic activation function. Then, features extracted from the original branch are refined by multiplying to the spatial-spectral attention map.
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 27 features equally during the generation process. Actually, some features facilitate the distribution of generated samples to approximate that of real samples, which further promotes the classification performance of the discriminator. On the contrary, some poor or noisy features hinder the generation of high-quality samples. Therefore, it is necessary to select appropriate spatial and spectral features in the process of sample generation.
In the generator of CA-GAN, the objective function of the generator is to maximize the probability that the discriminator classifies the generated samples as true classes. A new joint spatialspectral hard attention module is devised in the generator to reserve meaningful features and suppress less useful ones along the spatial and spectral dimensions. It refines the features by using an adaptive spatial-spectral attention map. This attention map is calculated based on a multi-branch convolutional network by using a dynamic activation function and an element-wise subtraction operation. The spatial-spectral hard attention module is added before each transposed convolutional layer of the generator. It pays varied attention to spatial and spectral contextual features simultaneously. Finally, after adaptive feature selection, the features of the generated samples whose distribution is approximate to the real sample distribution are retained, and the confused and misleading ones are eliminated. The main structure of the joint spatial-spectral hard attention module is illustrated in Figure 3. It contains three branches: the conversion branch, the mask branch and the original branch. The spatial-spectral attention map is obtained by using element-wise subtract operation between the conversion and mask branches and mapping with the dynamic activation function. Then, features extracted from the original branch are refined by multiplying to the spatial-spectral attention map. In the joint spatial-spectral hard attention module, the converted map X and the mask map θ are obtained by using the convolution and softmax layers in the conversion and mask branches, respectively. Here, the softmax layer normalizes the feature maps in the interval of [0, 1]. The converted map X measures the effectiveness of features at different spatial and spectral locations in the original feature map. The mask map θ is the corresponding dynamic threshold, which can In HSIs, the training samples are 3D cubes and can be represented as X train = {x 1 , · · ·, x m , · · ·, x M } in an R n×n×d feature space, where M is the number of training samples, n × n indicates the size of the spatial neighborhood windows, and d is the number of spectral bands. The labels of the training samples are denoted as Y = y 1 , · · ·, y m , · · ·, y M , y m ∈ {1, 2, · · ·, K}, where K is the number of classes. In the generator of CA-GAN, a random noise z, which follows the uniform distribution µ(−1, 1), is used as the input. Moreover, the class label y m is also used as the input. After reshaping and transposing convolution operations on the input, the generated features are represented as g(z, y) ∈ g 1 (z, y), · · ·, g q (z, y), · · ·, g Q (z, y) , where 1 ≤ q ≤ Q and q is the corresponding number of layers. These generated features are input to the joint spatial-spectral hard attention module.
In the joint spatial-spectral hard attention module, the converted map X and the mask map θ are obtained by using the convolution and softmax layers in the conversion and mask branches, respectively. Here, the softmax layer normalizes the feature maps in the interval of [0, 1]. The converted map X measures the effectiveness of features at different spatial and spectral locations in the original feature map. The mask map θ is the corresponding dynamic threshold, which can implement the feature elimination in the hard attention module. In the original branch, the convolutional layer uses 1 × 1 kernels to obtain the original feature map F ori . Then, an element-wise subtraction operation is implemented between the conversion map X and mask map θ. The different value (X − θ) is in the range of [−1, 1]. Subsequently, rectified linear unit (ReLU) is used to produce the spatial-spectral attention map A atte by mapping the difference value in the non-linear space. The activation function can be adjusted dynamically by the change of the threshold θ. After the mapping, the spatial-spectral attention map A atte is constrained in the range of [0, 1]. Finally, the output feature map O output of this attention module is acquired by performing the Hadamard product between the spatial-spectral attention map A atte and the original feature map F ori . It can be formulated as follows: where ' ' indicates the Hadamard product, ' * ' denotes the convolution operator, and W c , W m , and W o are the weight matrixes of the conversion branch, the mask branch, and the original branch, respectively.
The spatial-spectral attention map can pay various amounts of attention to different spatial and spectral features of the generated samples. When meaningful and discriminative features are generated, the output of the activation function is positive. In this case, the spatial-spectral attention map forces the conversion map X to learn a larger score and the mask map θ to learn a smaller threshold. Thus, these meaningful and discriminative features are retained and emphasized in the generator. On the contrary, when confused and misleading features are generated, the spatial-spectral attention map makes the mask map θ learn a larger threshold. In this case, the value of (X − θ) is negative. After the activation function, the negative value becomes zero. Thus, these confused and misleading features can be eliminated in the generator. The dynamical activation function is formulated as follows.
In CA-GAN, the generator has four transposed convolutional layers. Each transposed convolutional layer is constructed based on the convolutional kernel of 5 × 5, and each transposed convolutional layer is followed by a batch normalization layer. Before each transposed convolutional layer, the joint spatial-spectral hard attention module is incorporated into the generator. The sizes of generated feature maps inputting to each attention module are 2 × 2 × 128, 4 × 4 × 64, 7 × 7 × 32, 14 × 14 × 16, respectively.
By analyzing the experiment, we found that embedding the joint spatial-spectral hard attention module in the generator has a better effect than embedding it in the discriminator. The reason may be that the discriminator easily outperforms the generator in most GANs. Therefore, embedding the joint spatial-spectral hard attention module in the discriminator has little effect on improving the classification ability of the discriminator, while embedding it in the generator will improve the generator significantly and assist the generator in generating high-quality samples.

The Discriminator in CA-GAN Based on Convolutional LSTM for Joint Spatial-Spectral Feature Extraction
HSIs often include hundreds of spectral bands, which have provided valuable information to identify different land-cover classes. However, it is worth noting that the usage of only spectral information easily causes the degradation of classification performance, especially for the samples of the same class with different spectrums and the samples of different classes with similar spectrum. In the discriminator of CA-GAN, HSIs are considered as spatial-spectral sequences. The convolutional long short-term memory (ConvLSTM) [59] model is attempted to construct and extract joint spatial-spectral features for HSI classification. ConvLSTM is a modification of LSTM. LSTM can deal with the temporal sequence. The hyperspectral data are densely sampled from the visible to infrared spectrum. Since the spectral bands are approximately continuous, adjacent spectral bands have high correlation. Moreover, non-adjacent spectral bands may have long-term correlation. Thus, in ConvLSTM, the LSTM model is used to extract long-term spectral dependence in the spectral domain, and the convolution operator is incorporated into the LSTM network to extract spatial features across the spatial domain.
In CA-GAN, the input of the discriminator is the training sample x i and the generated sample G(z, y i ). The main construction of the discriminator in CA-GAN is shown in Figure 4. In the discriminator, hierarchical features of input samples are extracted by four convolutional layers. d(·) represents the features extracted by these convolutional layers, which is considered from the perspective of the spatial-spectral sequence. These features are input to ConvLSTM along the spectral channel sequentially. ConvLSTM captures the long-range dependencies among spectral bands by using the memory cell, and it extracts spatial information by using the convolution operator in the forget and input gates. convolutional long short-term memory (ConvLSTM) [59] model is attempted to construct and extract joint spatial-spectral features for HSI classification. ConvLSTM is a modification of LSTM. LSTM can deal with the temporal sequence. The hyperspectral data are densely sampled from the visible to infrared spectrum. Since the spectral bands are approximately continuous, adjacent spectral bands have high correlation. Moreover, non-adjacent spectral bands may have long-term correlation. Thus, in ConvLSTM, the LSTM model is used to extract long-term spectral dependence in the spectral domain, and the convolution operator is incorporated into the LSTM network to extract spatial features across the spatial domain.
In CA-GAN, the input of the discriminator is the training sample i x and the generated sample ( , ) . The main construction of the discriminator in CA-GAN is shown in Figure 4. In the discriminator, hierarchical features of input samples are extracted by four convolutional layers. (· ) represents the features extracted by these convolutional layers, which is considered from the perspective of the spatial-spectral sequence. These features are input to ConvLSTM along the spectral channel sequentially. ConvLSTM captures the long-range dependencies among spectral bands by using the memory cell, and it extracts spatial information by using the convolution operator in the forget and input gates. Specifically, features (·) are divided into several 3D cubes ( (·) ,⋅⋅⋅, (·) ,⋅⋅⋅, (·) ) along the spectral channel, where S is the number of cubes. ( (·) ,⋅⋅⋅, (·) ,⋅⋅⋅, (·) ) is used to input to ConvLSTM in sequence. At the s -th moment, (·) is input to ConvLSTM. and represent the memory cell and hidden state of the 1 -s -th moment, respectively. The current memory cell is updated by calculating the input (·) , the memory cell , and the hidden state through the forget and input gates and . The current hidden state is computed via the forget gate , the input gate , and the output gate . Then, at the 1 + s -th moment, the output of the 1 + s -th moment is calculated by the hidden state of the previous moment and the input of the 1 + s -th moment (·) . The memory cell and hidden state of the 1 + s -th moment are updated in the same way as that of the s -th moment. Finally, long-term spectral dependencies are extracted through the recursion of the previous cell to the next cell. At each moment, spatial information is extracted by the convolution operation of the input gate from the current moment and the forget gate from the previous hidden state. Thus, the spatial contextual correlation and long-term spectral dependencies of generated samples and real samples can be captured simultaneously in the discriminator of CA-GAN.
In the discriminator of CA-GAN, the input is the real samples and the generated samples with the same size of 27 × 27 × 20 . The discriminator extracts hierarchical features by using four convolutional layers with the convolutional kernel size of 5 × 5. The sizes of the feature maps Specifically, features d(·) are divided into several 3D cubes (d(·) 1 , · · ·, d(·) s , · · ·, d(·) s along the spectral channel, where S is the number of cubes. (d(·) 1 , · · ·, d(·) s , · · ·, d(·) s is used to input to ConvLSTM in sequence. At the s-th moment, d(·) s is input to ConvLSTM. c s−1 and h s−1 represent the memory cell and hidden state of the s − 1-th moment, respectively. The current memory cell c s is updated by calculating the input d(·) s , the memory cell c s−1 , and the hidden state h s−1 through the forget and input gates f s and i s . The current hidden state h s is computed via the forget gate f s , the input gate i s , and the output gate o s . Then, at the s + 1-th moment, the output o s+1 of the s + 1-th moment is calculated by the hidden state h s of the previous moment and the input of the s + 1-th moment d(·) s+1 . The memory cell c s+1 and hidden state h s+1 of the s + 1-th moment are updated in the same way as that of the s-th moment. Finally, long-term spectral dependencies are extracted through the recursion of the previous cell to the next cell. At each moment, spatial information is extracted by the convolution operation of the input gate from the current moment and the forget gate from the previous hidden state. Thus, the spatial contextual correlation and long-term spectral dependencies of generated samples and real samples can be captured simultaneously in the discriminator of CA-GAN.
In the discriminator of CA-GAN, the input is the real samples and the generated samples with the same size of 27 × 27 × 20. The discriminator extracts hierarchical features by using four convolutional layers with the convolutional kernel size of 5 × 5. The sizes of the feature maps extracted by convolutional layers are 14 × 14 × 16, 7 × 7 × 32, 4 × 4 × 64, 2 × 2 × 128, respectively. Then, the ConvLSTM layer is merged after the convolutional layer to extract joint spatial-spectral information. In ConvLSTM, the padding operation is used during the convolution process, and the size of the convolutional kernel is 2 × 2. Next, a fully connected layer is added after the ConvLSTM layer. Finally, the classification is implemented through a softmax layer in the discriminator. The softmax classifier predicts the class y ∈ {1, 2, · · ·, K, K + 1} of input samples. In this process, the objective function of the discriminator is to maximize the probability of classifying the real samples as true K classes and the generated samples as the K + 1-th class.

Classification of CA-GAN Based on Collaborative and Competitive Learning
In HSIs, the generation task is notoriously difficult due to the increasing data complexity, such as high dimension and complex spatial distribution. In GAN, the quality of the generated samples is not guaranteed, which may further degrade the classification performance of the discriminator. In addition, when the samples are generated by the generator, the generator itself has no way to evaluate the generated samples directly. GAN only uses the judgment of the discriminator to learn the distribution of real samples, which acts as a loss function to provide a learning signal to the generator. The generator is improved through the competition process between the generator and the discriminator. However, it is difficult to generate complex HSI data by only using the objective function. Moreover, the classification ability of the discriminator is easily superior to the generation ability of the generator. It indicates that there is information in the discriminator that the generator can use to assist sample generation. Inspired by this idea, CA-GAN uses additional information from the discriminator to assist sample generation in the generator.
In CA-GAN, a collaborative learning mechanism is devised between the generator and the discriminator, which is achieved by adding shallow and deep features of real multiclass samples in the discriminator to the generator. It is constructed by fusing each corresponding feature map of the same size in the generator and the discriminator. In the generator, the fused generated features are input to the next layer. This mechanism brings many advantages. It breaks the way of traditional optimization of only using competition between the generator and the discriminator. By utilizing additional information from the discriminator, the generator of CA-GAN can not only compete but also collaborate with the discriminator. Additionally, it alleviates the problem that the generator is optimized only by using the objective function from the discriminator. By utilizing the collaborative learning, the diversity of the generated samples can be improved. In this way, it is not easy to suffer from mode collapse.
The specific process of the collaborative learning mechanism is as follows. In the discriminator of CA-GAN, the generated samples and real samples are used as the input. The features extracted by four convolutional layers from real samples are represented as d( In the generator of CA-GAN, features generated by four transpose convolutional layers have the same sizes as the features extracted by four convolutional layers in the discriminator. By summing the features from real samples in the discriminator and the corresponding generated features of equal sizes in the generator, the new fused generated features g * (z, y i ) = g 1 * (z, y i ), g 2 * (z, y i ), g 3 * (z, y i ), g 4 * (z, y i ) are generated. These features are formulated as follows: where d j (x i ) represents the real sample features of the discriminator with the same size as the generated features g u (z, y i ), and '⊕' represents the element-wise summation operation.
In CA-GAN, the novel adversarial and collaborative objective functions of G and D are defined as follows: where l D and l G represent the objective functions of the discriminator and the generator. D(·) indicates the discriminator output, and l(·) expresses the cross entropy.
As shown in Equation (5), for the real samples, the first term N i=1 l(D(x i ), y i ) of l D indicates that the discriminator expects to have a high probabilities to their true classes. For the generated samples, l G and l D are not only adversarial, but also collaborative to each other. On the one hand, l G indicates that the generator expects the discriminator to classify the generated samples as true classes, while l D expects to classify these generated samples as y K+1 . On the other hand, the real sample features from the discriminator are used to collaborate the sample generation in the generator. By using the collaborative learning, high-quality samples are generated. At the same time, the classification ability of the discriminator is facilitated by using competitive learning. Finally, after the generator and discriminator are updated by alternating optimization, the well-trained discriminator in CA-GAN is used for HSI classification.

The Procedure of CA-GAN
The proposed CA-GAN method combines a joint spatial-spectral hard attention module, convolutional LSTM, and collaborative learning mechanism into a unified optimization procedure. The detailed process of the designed CA-GAN method is described in Table 1. Table 1. The procedure of convolutional GAN based on collaborative learning and attention mechanism (CA-GAN) method.

1.
INPUT: The training data X train = {x 1 , · · ·, x m , · · ·, x M } and the test data X test = x test 1 , x test 2 , · · ·, x test R from K classes, the class labels of training samples y ∈ y 1 , · · ·, y k , · · ·, y K , the mini-batch size B, the number of training epochs E 2. Begin 3.
Initialize: randomly initialize the parameters θ d and θ g of the discriminator and the generator 4.
For E epochs do 5.
Input the training samples into the discriminator to obtain the real sample features
Generate samples G(z, by using the fused generated features 13. Input generated samples and training samples to the discriminator 14. Compute the objective function l D of the discriminator 15. Update the parameters θ g of the generator G by minimizing l G 17. Update the parameters θ d of the discriminator D by minimizing l D

Data Description
The detailed description of three hyperspectral datasets is displayed as follows.
(1) Indian Pines: This scene was obtained in 1992 from Northwest Indiana. It contains 145 × 145 pixels and 224 spectral bands. In this paper, 200 spectral bands are adopted for analysis. The Indian Pines dataset contains 16 vegetation classes. The false-color image (bands 50, 27, 17) and its ground truth are shown in Figures 5a and 6a.  3) Washington: The Washington dataset was obtained at the Washington DC mall in 1995. It includes 750 × 307pixels, and the geometric resolution of each pixel is 2.8 m. In the experiments, 191 spectral bands are used for analysis. It includes 7 different categories. Figure 5c and Figure 6c show the false-color composite image (bands 70, 53, 50) of the Washington dataset and the ground truth.

Experimental Setting
To demonstrate the effectiveness of the CA-GAN algorithm, seven representative HSI (2) Pavia University: Pavia University was captured in 2002 from northern Italy. It is composed of 610 × 340 pixels and 115 spectral bands. It includes 9 classes. In this paper, 103 spectral bands are analyzed after removing 12 noise bands. Figures 5b and 6b show the false-color composite image (bands 53,31,8) and the ground truth of this dataset.
(3) Washington: The Washington dataset was obtained at the Washington DC mall in 1995. It includes 750 × 307 pixels, and the geometric resolution of each pixel is 2.8 m. In the experiments, 191 spectral bands are used for analysis. It includes 7 different categories. Figures 5c and 6c show the false-color composite image (bands 70, 53, 50) of the Washington dataset and the ground truth.

Experimental Setting
To demonstrate the effectiveness of the CA-GAN algorithm, seven representative HSI classification methods are used for comparison, including RBF-SVM [16], SAE [20], DBN [24], PPF-CNN [34], CRNN [30], HSGAN [53], 3D-GAN [57]. In the experiment, the size of inputs will affect the classification performance. For fair comparison, all the comparison algorithms use their optimal parameters. For RBF-SVM, five-fold cross-validation is utilized to obtain the penalty and gamma parameters. In SAE, the radius of the spatial window is set as 7. For DBN, the spatial window of 5 × 5 is used as the input to the network. For PPF-CNN, the value of the spatial window size is set according to the literature [34]. For CRNN, the batch size is set as 128, and other parameters are suggested in the literature [30]. For HSGAN, as suggested in [53], the convolutional kernel size is set as 1 × 3 and 1 × 5, and the number of training epochs is set as 200. For 3D-GAN, the spatial window of 3D input is set as 64 × 64 × 3, and the convolutional kernel sizes are set according to the literature [57].
In CA-GAN, the main architecture and parameters are listed in Table 2. In Table 2, G and D represent the generator and the discriminator. As suggested in the literature [57], the dimension of input noise z is 100 × 1 × 1, and the number of training epochs is 600. By using a trial-and-error procedure, the learning rates of the discriminator and generator are 0.008 and 0.035. In the process of data acquisition, PCA is used to reduce the dimensionality and retain 20 principal components of HSIs. Then, each sample of reduced HSI data is represented by using a 27 × 27 spatial window centered on this sample. In this way, a 27 × 27 × 20 cube is extracted to represent each sample in HSIs.  In this paper, the overall accuracy (OA), average accuracy (AA), and Kappa coefficient (Kappa) are adopted to evaluate the classification performance of each algorithm. The final results are acquired by training 30 times independently. The experiments are based on the TensorFlow library on NVIDIA 2080Ti graphics card and are completed by Python language.

Experimental Results
(1) Classification results of the Indian Pines dataset: For the labeled samples, we randomly selected 5% from each class for training. Table 3 lists the number of training and test samples in the experiment. The quantitative evaluations of various methods are displayed in Table 4. Table 4 includes the classification accuracies of different classes, and OA, AA, and Kappa for different methods. Among eight algorithms, the best accurate values are emphasized by marking with gray.  As shown in Table 4, deep learning-based methods are superior to RBF-SVM by extracting hierarchical non-linear features. PPF-CNN achieves better classification results than SAE and DBN by expanding the training samples. CRNN obtains better classification results than PPF-CNN by using recurrent neural network (RNN) to capture the spectral dependence of HSIs. Compared with HSGAN, 3D-GAN improves the classification performance because it fully use joint spatial-spectral information. Among these comparison methods, CA-GAN obtains the best classification performance in most classes by leveraging generated samples with high quality, especially in the classes having fewer samples. Additionally, among all the comparison methods, CA-GAN achieves the best classification accuracies in the OA, AA, and Kappa, which improve by at least 3.9%, 3.1% and 3.9%, respectively.
The classification visualization of various algorithms on the Indian Pines is shown in Figure 7. From Figure 7a,h we can see that RBF-SVM, SAE, DBN, PPF-CNN, and HSGAN have some visual noisy scattered points and misclassify many samples in the alfalfa, grass-pasture-mowed, oats, and buildings-grass-trees-drives classes. Compared with these methods, CRNN, 3D-GAN, and CA-GAN significantly reduce the noisy scattered points and effectively improve the regional uniformity. In comparison with other methods, CA-GAN has better regional uniformity in the wheat and corn-mintill classes, and it shows more accurate boundary of the grass-trees class. 2) Classification results of the Pavia University dataset: We randomly selected 2% of the labeled data to train the network. The number of training and test samples is shown in Table 5. Table 6 shows the quantitative results of various methods. The most accurate results of the eight algorithms are marked by gray.  (2) Classification results of the Pavia University dataset: We randomly selected 2% of the labeled data to train the network. The number of training and test samples is shown in Table 5. Table 6 shows the quantitative results of various methods. The most accurate results of the eight algorithms are marked by gray.  As shown in Table 6, PPF-CNN and CA-GAN have classified the painted metal sheet class completely correctly. The classification result of gravel and bitumen classes is significantly improved by CA-GAN. CA-GAN improves by at least 23.8% compared with PPF-CNN in the bitumen class. For the gravel class, CA-GAN improves by 37.6%, 29.0%, 30.8%, 31.8%, 11.8%, 15.8%, 9.6% compared with the other seven methods by using high-quality generated samples. The classification accuracies of CA-GAN for all the classes are over 96%. Moreover, CA-GAN exhibits the best classification performance in three evaluation indexes.
The classification visualization of various algorithms on the Pavia University is shown in Figure 8. As shown in Figure 8, the bare soil class is misclassified by RBF-SVM, SAE, DBN, PPF-CNN, and HSGAN. Compared with these methods, CA-GAN shows greater regional uniformity in this class. Many samples in the bitumen class have been misclassified due to the similar spectral signature with the asphalt class. CA-GAN improves the classification of these two classes. Compared with other seven algorithms, CA-GAN has better boundary integrity in the shadows class and better regional uniformity in the gravel and self-blocking bricks classes.
(3) Classification results of the Washington dataset: we randomly picked 3% of the labeled samples to train the CA-GAN. The number of training and test samples is listed in Table 7. Table 8 shows the quantitative results of various methods. From Table 8 3) Classification results of the Washington dataset: we randomly picked 3% of the labeled samples to train the CA-GAN. The number of training and test samples is listed in Table 7. Table 8 shows the quantitative results of various methods. From Table 8, RBF-SVM misclassifies many samples in the roofs class, and CRNN misclassifies many samples in the water class. Compared with RBF-SVM, CA-GAN improves by 10.6% for the roofs class. Compared with CRNN, CA-GAN improves by 13.4% for the water class. Compared with other seven methods, CA-GAN obtains the highest OA, AA, and Kappa values. It improves by 5.8%, 4.9%, 5.4%, 3.8%, 3.6%, 7.0%, and 2.3% compared with the other seven methods in the OA index.    Figure 9 shows the classification visualization of various algorithms on the Washington dataset. From Figure 9, we can see that DBN and CRNN misclassify the water and shadows classes. The proposed CA-GAN method achieves better classification performance for these two classes. For the roads class, all the RBF-SVM, SAE, DBN, CRNN, HSGAN, and 3D-GAN methods have different degrees of misclassification. In contrast to these methods, PPF-CNN and CA-GAN show better regional uniformity in the roads class. Compared with PPF-CNN, CA-GAN performs better regional uniformity in the roofs class. In addition, compared with other seven methods, CA-GAN shows better boundary integrity in the trees class.   Figure 9 shows the classification visualization of various algorithms on the Washington dataset. From Figure 9, we can see that DBN and CRNN misclassify the water and shadows classes. The proposed CA-GAN method achieves better classification performance for these two classes. For the roads class, all the RBF-SVM, SAE, DBN, CRNN, HSGAN, and 3D-GAN methods have different degrees of misclassification. In contrast to these methods, PPF-CNN and CA-GAN show better regional uniformity in the roads class. Compared with PPF-CNN, CA-GAN performs better regional uniformity in the roofs class. In addition, compared with other seven methods, CA-GAN shows better boundary integrity in the trees class.  Tables 9-11 show the training and test time of various methods on three datasets. From Tables 9-11, RBF-SVM and DBN consume less time than the other methods in the training procedure due to the 1D input. HSGAN, 3D-GAN, and CA-GAN require less training time to optimize the network than PPF-CNN and CRNN, and they take longer than the other methods. This is because GAN needs lots of time to optimize the generator and discriminator alternately. Compared with HSGAN and 3D-GAN, CA-GAN spends longer time due to the increasing parameters of the attention module and convLSTM. Among all the methods, PPF-CNN and CRNN are the most time-consuming in terms of the training time. The computing time of PPF-CNN is mainly consumed in the augmentation of training samples, especially for numerous training samples. CRNN is time-consuming due to the recurrent neural network. In the testing procedure, PPF-CNN and CRNN cost more time because PPF-CNN adopts the voting strategy with the surrounding samples and CRNN adopts a complex recurrent network. CA-GAN takes similar time as 3D-GAN and convLSTM. It costs 0.3 s, 0.6 s, and 0.3 s on three datasets, respectively.  From Figure 10, the classification accuracy of the eight methods goes up quickly with the increase of the percentage of training samples. When the training samples are large enough, the classification accuracy of all the comparison methods changes slowly and tends to be stable. 3D-GAN and CA-GAN outperform RBF-SVM, SAE, DBN, CRNN, PPF-CNN, and HSGAN in three datasets with different percentages of training samples. Compared with PPF-CNN, HSGAN, and 3D-GAN, CA-GAN consistently provide excellent classification performance with different percentages. When the proportion of training samples is only 1%, CA-GAN increases by at least 6.1%, 5.6%, and 5.5% on three datasets, respectively. Thus, CA-GAN is suitable for the limited number of training samples.

Influence of different number of principle components in CA-GAN
To verify the effectiveness of the proposed method with different numbers of principal components, we change the number of principal components in PCA. Tables 12-14 record the classification results and training time of the proposed method under various numbers of PCA components and the proposed method without PCA-based pre-processing.
As shown in Tables 12-14, the classification accuracy of CA-GAN on the three datasets increases firstly and then decreases with the increasing dimensionality of PCA. Compared with CA-GAN with PCA-20, CA-GAN with PCA-50 improves by 0.2%, 0.2%, and 0.3% on the three datasets, respectively. Although the classification accuracy is improved to some extent, more principal components lead to higher computational complexity and a longer training time. The training time of CA-GAN with   As shown in Tables 12-14, the classification accuracy of CA-GAN on the three datasets increases firstly and then decreases with the increasing dimensionality of PCA. Compared with CA-GAN with PCA-20, CA-GAN with PCA-50 improves by 0.2%, 0.2%, and 0.3% on the three datasets, respectively. Although the classification accuracy is improved to some extent, more principal components lead to higher computational complexity and a longer training time. The training time of CA-GAN with PCA-50 is much longer than that of CA-GAN with PCA-20. When the principal components of PCA are further increased, the classification performance deteriorates slightly.

Effectiveness of Each
Step in CA-GAN Table 15 Table 15, compared with CA-GAN-WCAC, CA-GAN-WCA increases by 2.0%, 1.4%, and 1.5% in the OA index on three datasets. It shows that collaborative learning can effectively improve the classification performance. Compared with CA-GAN-WCA, CA-GAN-WC improves by 1.0%, 1.3%, and 1.3% in the OA index on three datasets. It indicates adding the joint spatial-spectral hard attention module can facilitate the classification performance by improving the quality of generated samples. Compared with CA-GAN-WC, CA-GAN uses ConvLSTM to promote the classification performance by extracting joint spatial-spectral features of HSIs. Compared with CA-GAN-WC, CA-GAN-WCA, and CA-GAN-WCAC, CA-GAN shows the best classification results in the AA, OA, and Kappa on three datasets.

Conclusions
In this paper, a novel CA-GAN method has been designed to solve the small sample problem in HSI classification. In the generator, a joint spatial-spectral hard attention module is devised to discard misleading and confounding features of the generated samples and impel the distribution of generated samples to approximate the distribution of real HSIs. In the discriminator, a convolutional LSTM layer is merged in the discriminator to extract joint spatial-spectral information of HSIs. Additionally, a collaborative learning mechanism is designed to assist the sample generation in the generator by using the real sample information extracted by the discriminator. It enables the generator and discriminator to be optimized alternately not only through the competition but also in a collaborative manner. These designs enable CA-GAN to improve the classification performance of HSIs with limited training samples by using the high-quality generated samples. The experiment results invalidated that CA-GAN can obtain greater HSI classification results compared with other advanced methods. In the future, we will investigate how to determine the positions and numbers of various modules in CA-GAN more effectively and automatically. In addition, we will try other types of sampling strategies to reduce the overlap between the training and testing sets of HSIs.