Generative Adversarial Networks for Zero-Shot Remote Sensing Scene Classiﬁcation

: Deep learning-based methods succeed in remote sensing scene classiﬁcation (RSSC). How-ever, current methods require training on a large dataset, and if a class does not appear in the training set, it does not work well. Zero-shot classiﬁcation methods are designed to address the classiﬁcation for unseen category images in which the generative adversarial network (GAN) is a popular method. Thus, our approach aims to achieve the zero-shot RSSC based on GAN. We employed the conditional Wasserstein generative adversarial network (WGAN) to generate image features. Since remote sensing images have inter-class similarity and intra-class diversity, we introduced classiﬁcation loss, semantic regression module, and class-prototype loss to constrain the generator. The classiﬁcation loss was used to preserve inter-class discrimination. We used the semantic regression module to ensure that the image features generated by the generator can represent the semantic features. We introduced class-prototype loss to ensure the intra-class diversity of the synthesized image features and avoid generating too homogeneous image features. We studied the effect of different semantic embeddings for zero-shot RSSC. We performed experiments on three datasets, and the experimental results show that our method performs better than the state-of-the-art methods in zero-shot RSSC in most cases.


Introduction
Over the past decades, the epoch of remote sensing big data arrived, and remote sensing technology has made excellent progress [1]. It has played an important role in environmental monitoring [2], urban construction planning [3], and land use classification [4]. Currently, the accuracy of remote sensing image classification is very high. Most of the current remote sensing scene classification approaches rely on supervised learning [4][5][6], which can obtain excellent results when trained on large-scale datasets [7]. Usually, it requires labeling a lot of image data for each remote sensing image class. When we classify an image, the classifier does not work well if the category corresponding to the image is not in the training set, which is referred to as a zero-shot problem [8]. With the explosive growth of remote sensing image classes, we will encounter many new classes, and it is unrealistic to collect sufficient image samples for each class. It makes sense to identify the unseen class images when the training set does not contain the corresponding class, for instance, a rare aircraft species.
In order to overcome these problems in remote sensing images, it is of great interest to study zero-shot learning methods for RSSC. Zero-shot classification methods for remote sensing images are designed to address the classification for unseen category images, and more and more researchers are focusing on this topic. This can alleviate manual labeling, reduce a lot of labor, and make the existing classification model more scalable.
Humans can recognize about 30,000 object categories of information [8], and humans can transfer their knowledge to identify new classes when only textual descriptions of the new classes are available. The key of zero-shot learning to identify unseen classes is that zero-shot learning uses semantic and visual information from all image classes. Semantic information builds a bridge of knowledge transfer between seen and unseen categories, breaking the boundary of category mutual exclusion between training and testing sets.
The concept of zero-shot learning can be traced back to 2008 when Larochelle first introduced zero-shot learning [8]. Lampert et al. [9] first extended this approach to the field of computer vision. We can summarize the ZSL methods into two types: The first is embedding-based methods that map image features and semantic embeddings into one space and perform metric learning for ZSL. Zhang et al. [10] solved the zero-shot learning task by using the semantic similarity embedding (SSE) method. SSE fuses multiple semantic information into the same semantic space and calculates the similarity of them. Frome et al. [11] used a novel visual-semantic embedding approach to use annotated image data and semantic information from the unannotated text to recognize visual objects. Li et al. [12] presented a dual visual-semantic mapping framework for zero-shot classification (DMaP), which studies the relationship between the semantic space manifold and the transferability of visual-semantic mapping. However, they cannot work well in the generalized zero-shot learning (GZSL) setting because they overfit to seen classes. The second is GAN-based methods which generate fake image features for unseen classes and train KNN [13] or softmax classifier to perform ZSL. Xian et al. [14] first used generative adversarial networks (GANs) [15] to convert semantic features to visual features, providing a new idea for zero-shot learning. Felix et al. [16] mainly added a cycle consistency loss term, allowing GAN to generate more realistic images. Li et al. [17] introduced soul samples for generative zero-shot learning and used the cascade classifier to classify the unseen class. Yu et al. [18] presented an episode-based training pattern to improve the model's performance for zero-shot learning. Generative adversarial networks (GANs) [15] is a popular zero-shot classification method. GAN-based methods generally work better than embedding-based methods.
For the remote sensing field, Li et al. [19] first proposed a zero-shot remote sensing scene classification (RSSC) method called ZSL-LP. ZSL-LP constructs a semantic-directed graph, then uses a label-propagation algorithm for zero-shot classification. Quan et al. [20] introduced a novel zero-shot RSSC method, which relies on Sammon embedding and spectral clustering. They modified semantic feature space class prototypes by Sammon embedding, which ensures the consistency with the visual feature space class prototypes. Wang et al. [21] introduced a distance-constrained semantic autoencoder for zero-shot remote sensing scene classification. Sumbul et al. [22] conducted a zero-shot study for fine-grained remote sensing image recognition. They first learned a compatible function and then showed how to transfer knowledge for unseen classes. Although these approaches have achieved promising results, most are devoted to designing visual-semantic embedding models only with seen classes, in which it is difficult to guarantee an excellent extension to unseen classes. In addition, these models trained only with seen data tend to misclassify unseen test instances into the seen category, generating a special imbalanced classification problem. Generative adversarial networks can alleviate the above issues to some extent [14,17,23]. As far as we know, no one has tried to apply generative adversarial networks for zero-shot RSSC. The GAN-based method designed for ordinary images cannot apply to remote sensing images well because remote sensing image scenes are complex. Remote sensing images generated by the GAN model may have instability and mode collapse problems. In addition, remote sensing images have inter-class similarity and intra-class diversity. Overall, zero-shot RSSC deserves more exploration.
We propose a novel approach for zero-shot RSSC with the mentioned considerations. Since the remote sensing image dataset does not directly provide class attribute information, we used four natural language processing models pre-trained on Wikipedia to obtain word vectors as the semantic information we need. Word2vec [24], Glove [25], Fasttext [26], and Bert [27] are the four natural language processing models. Because of the complexity of remote sensing scenes, we employed the conditional Wasserstein generative adversarial network [28] for generating image features directly instead of images to avoid instability and mode collapse problems in training. We trained a generator that can generate an arbitrary number of class image features from class semantic information, converting the zero-shot classification problem into a traditional classification problem. Following the characteristics of remote sensing images, we used the classification loss, semantic regression module, and class-prototype loss to constrain the generator to ensure the generator can generate image features close to the real image with the class semantic information. Our model is referred to as CSPWGAN. The classification loss is used to preserve inter-class discrimination. We used a semantic regression module to ensure that the image features generated by the generator can represent the semantic features. We introduced classprototype loss when training the generator to constrain itself to ensure intra-class diversity of the synthesized image features and avoid generating too homogeneous image features. The contributions of this study are as follows:

1.
We trained a generator that can generate class image features close to the real image through class semantic information. We propose well-designed modules to constrain the generator, including classification loss module, class-prototype loss module, and semantic regression module. To the best of our knowledge, we are the first to employ generative adversarial networks for zero-shot remote sensing scene classification (RSSC); 2.
We explored the effect of different semantic embeddings for zero-shot RSSC. Specifically, we investigated various natural language processing models, i.e., Word2vec, Fasttest, Glove, and Bert, to extract semantic embeddings for each class either from the class name or from the class sentence descriptions. Our conclusion may help future work in understanding and choosing semantic embeddings for zero-shot RSSC; 3.
We conduct experiments on three benchmark datasets, UCM21 [29], AID30 [30] and NWPU45 [7]. The experimental results show that our method performs better than most state-of-the-art methods in zero-shot RSSC.

Methods
In this section, we first define the remote sensing image zero-shot classification task. Then we present our generative framework for zero-shot RSSC. Finally, we introduce each part of the model in detail.

Problem Definition
where D denotes the dataset, x is the features of remote sensing images, y is the class label of remote sensing images, e(y) denotes the classes semantic feature. The dataset D is split into the seen and unseen datasets.
represent the seen datasets and D u = x i u , y i u , e(y i u ) N u i=1 represent the unseen datasets. Class label set y includes the seen class label set y s and unseen class label set y u . The seen class label set y s is disjointed y u . The relationship can be represented by y = y s ∪ y u and y s ∩ y u = φ. The purpose of the ZSL task is to predict the class label y u , and the purpose of the GZSL task is to predict the class label y.

Overall Framework
In this section, we present the overall framework of our CSPWGAN model. As shown in Figure 1  Class Semantic Feature Representation Module. Typical zero-shot task datasets usually contain class attribute vectors [9]. Since there are no class attribute vectors in the remote sensing image dataset, we used the natural language pre-training model to generate the word vectors we need.
Feature Generation Module. We used a generative model to train a generator that can generate class image features by class semantic features, which can convert the zero-shot remote sensing scene classification problem into a traditional classification problem and avoid the hubness problem [31]. In addition, using generative adversarial networks can prevent the class imbalance problem under the GZSL setting. Meanwhile, we considered that remote sensing images have inter-class similarity and intra-class diversity. We used the classification loss, semantic regression module, and class-prototype loss to constrain the generator. The classification loss was used to preserve inter-class discrimination. We used semantic regression module to ensure that the image features generated by the generator can represent the semantic features. To ensure intra-class diversity of the synthesized image features and avoid generating too homogeneous image features, we introduced class-prototype loss to constrain the generator. The key to solving this problem is to obtain a high-performance generator, and multiple losses to jointly constrain the generator to obtain better results.

Class Semantic Feature Representation Module
Zero-shot learning aims to solve the problem of recognizing unseen classes that cannot be accomplished by traditional supervised learning. The key is that zero-shot learning uses not only visual features but also class semantic features. Class semantic information builds a bridge of knowledge transfer between seen and unseen categories, breaking the boundary of category mutual exclusion between training and testing sets. There are usually two ways to extract semantic features: attribute vectors and word embeddings.
Attributes vectors are manually annotated using expert knowledge and share attribute space among all classes of objects. It is the most common and efficient method of semantic feature construction. Attribute vectors can leverage prior human knowledge with good interpretability and accuracy. At the same time, the disadvantage of attribute vectors is that they are highly dependent on manual annotation and are challenging to annotate in the absence of prior human knowledge.
Using natural language pre-training models to generate the word vectors we need has the advantage of being fast, simple, and does not require prior knowledge. Word2vec [24], Glove [25], Fasttext [26], and Bert [27] are the most common methods used. These methods can be trained with open-source corpora (e.g., Wikipedia), which significantly saves the cost of manual annotation. We can use them to convert classes name or class text descriptions into word vectors. For example, we can use Word2vec to convert remote sensing image classes name into 300-or 500-dimensional word vectors and use Bert to convert the description of remote sensing image classes into 768-dimensional word vectors.

Feature Generation Module
Since there are no samples of unseen categories in the training data, it is difficult to train a classifier for unseen categories. Our method synthesizes unseen classes' image visual features through class semantic information and noise vectors. We trained a classifier that can classify unseen class images based on the synthesized image features and their corresponding unseen class labels. Our approach was inspired by [14]. CSPWGAN is composed of a generative adversarial network. The generative adversarial network model converts the zero-shot RSSC problem into a traditional classification problem.
We used the conditional generative adversarial net [32] as our baseline model. The network is composed of a generative model G and a discriminative model D that compete in a two-player minimax game. G generates the visual image representationx by its semantic feature e(y) and a noise vector sampled from a normal distribution z ∼ N(0, I). The generation process can be represented as: y × z → x. We used the semantic feature e(y) and a noise vector z as inputs of G, andx of the class label y as outputs of G. θ G denotes the parameters of G. The discriminative model D can be represented as: e × x → [0, 1], θ D denotes the parameters of D. D aims to accurately distinguish the real image visual feature x from the generated visual featuresx. θ D denotes the parameters of the discriminative model D. G aims to cheat the discriminator D by generating images that can be mistaken for the real ones. This is the training process for estimating the parameters of θ D and θ G .
Our model only uses the seen class data for training, but it can also generate image features of unseen classes. Generating models are usually hard to train and not easy to stabilize. We used a stable training methods called WGAN [28], the loss function of WGAN is as follows: where E[.] denotes the expectation function.x = G(z, e(y)|θ G ),x = βx + (1 − β)x with β ∼ U(0, 1). Both G and D are multilayer networks. λ represents the penalty factor. We set λ = 10 in this study. To achieve good optimization, we applied classification loss, semantic regression module and class-prototype loss to train the network.
Classification Loss. Because of the complexity of remote sensing images, the WGAN model cannot ensure that the generated samples are discriminative enough. Generating inaccurate samples results in the bad performance of the classifier. To alleviate this issue, we added a classifier to identify whether the generated image features can be classified into the correct class. By doing this, the generated features are more discriminative. We added a simple classifier that uses the negative log-likelihood: wherex = G(z, e(y)|θ G ), the class label ofx is y, and P(y|x; θ) represents the probability that the class label ofx predicted by the classifier is its true label y. θ is pre-trained on the seen classes dataset. It can be represented as: Semantic Regression Module. We used class semantic features to guide the generator to generate image features for the corresponding class. The generator is trained using seen class semantic features and visual features in the training phase, and using unseen class semantic features synthesizes unseen visual features in the testing phase. Finally, we converted the zero-shot classification problem into a traditional classification problem. Our model relies on a high-performance generator that aims to generate image features similar to real image features. We used the semantic features of the unseen class to synthesize image features; if it differs significantly from the real image feature distribution, the synthesized image features cannot represent the real image features of the class. The classifier trained with synthetic image features of unseen categories misclassifies the real unseen class images. We must ensure that the image features generated by the generator are similar to the real image features, which allows us to achieve higher recognition results for the unseen categories. We were inspired by [16], and used the semantic regression module to constrain the generator. This can be represented as: where R denotes the regressor, and θ R denotes the parameters of R. θ R is pre-trained on seen classes using the following function: Class-Prototype Loss. In remote sensing images, the features of different sample images in the same class vary greatly and have diversity within the class. We used classification loss, the semantic regression module cannot guarantee that the image features synthesized by the generator are diverse. To ensure that the synthesized image features are more consistent with the distribution of real remote sensing image features and avoid the generated image features being too homogeneous, we introduced class-prototype loss in training to constrain the generator. We were inspired by [17]. First, we clustered each class of real remote sensing image features to obtain k clusters. We obtained class prototype sample image features by averaging all sample image features in each group, and can obtain k class prototype samples for each real remote sensing image class. In the training process, we hoped that the image features synthesized by the generator would be close to at least one class prototype vector in the same class.
Let X y n denote the nth cluster of category y, each category has k clusters. In this study, we set k = 5 for simplicity. Let p y n denote the nth prototype vector of category y, each category has k prototype vectors. The p y n is defined as: where N y n denotes the number of samples in cluster X y n . For the generated fake featuresx, we also define the prototype vectorp y n as: We hope that each samplex generated for the category y is be close to at least one prototype vector p y . This loss can be defined as: where n represents the number of generated samples for the category y and k represents the number of prototype vectors per class. We also hope that the prototype vector of generated fake featuresp y n is close to at least a real prototype vector p y n for the same class, which is formulated as: where n y represents the number of all categories. Our class-prototype loss can be expressed by the following equation: With this class-prototype loss for constraint, our model can avoid generating singleview image features and ensure the intra-class diversity of the synthesized image features for remote sensing images. Finally, our model was trained with the following overall loss function: where α, β, and λ are hyperparameters of the model to balance the importance of each term.

Training and Testing
In the training phase, the generator was trained using class semantic features and visual features from the seen categories. We used classification loss, semantic regression module, and class-prototype loss to constrain the generator. We used classification loss to allow the generator to learn to generate discriminative image features. We used the semantic regression module to ensure that the generated image features represent semantic features. Class-prototype loss ensures that the synthetic image features are more consistent with the distribution of real remote sensing image features. The model is trained with Equation (11).
After the training is complete, we repeated the generator to generate an arbitrary number of image features for each unseen category. We obtained a training set of synthetic images. In the ZSL task, we can classify the unseen images by the classifier trained on this dataset. In the GZSL task, we added the synthetic images dataset to the seen class dataset and trained a classifier to recognize both seen and unseen class images. We can choose a softmax classifier or KNN, etc. In this study, we used the softmax classifier.

Results
In this section, we designed experiments to evaluate our method. First, we introduced the remote sensing image datasets used in our experiments. Second, we introduced our experimental setup. Third, we exploited the effect of our method on different semantic vectors. Fourth, we compared our approach with the current state-of-the-art zero-shot RSSC method. Finally, we performed ablation studies.
The UCM21 dataset has 21 classes, and each scene class consists of 100 images. For each scene class, the size of the image is 256 × 256 pixels. The AID30 dataset has 30 categories, each category has 200 to 400 samples, and each sample has a pixel size of 600 × 600. The NWPU45 dataset has 45 classes, and each scene class consists of 700 images. For each scene class, the size of the image is 256 × 256 pixels. As shown in Figure 2, we can see the diversity of images within the same class.

Evaluation Protocols
In the ZSL task, we used the class average accuracy [14] as the performance evaluation metric, i.e., the classification accuracy within a class is first counted for each class. Then the class average accuracy is calculated by finding the mean value.
In the GZSL task, harmonic mean accuracy [14] is used to evaluate the classification effect. The formula for calculating the harmonic mean accuracy is as follows: where u denotes the accuracy of unseen classes, s denotes the accuracy of seen classes, and H denotes the harmonic mean accuracy.

Implementation Details
We first must extract image features, obtain the corresponding class semantic information, and then use the GAN network for training. We used the ResNet-101 model pre-trained on ImageNet to extract 2048-dimensional features of remote sensing images. We obtained the word vectors of the corresponding class by using Word2vec, Glove, Fasttext, and Bert. We implemented our model via multilayer neural networks. The generator G contains a 4096 nodes hidden layer activated by LeakyRelu [33], the output layer has 2048 nodes activated by Relu [34]. The discriminator D also contains a 4096 nodes hidden layer activated by LeakyRelu, and the output layer has no activation. The regressor R contains a 2048 nodes hidden layer, the dimensionality of the output layer is the same as the semantic information dimension. We used the adam [35] optimizer to optimize our model , lr = 0.0001, β 1 = 0.5, and β 2 = 0.999. In this study, we set α = 0.001, β = 10, γ = 1, γ 1 = 0.01, γ 2 = 0.001. The above algorithms were implemented using the PyTorch, and the experiments were completed on four RTX2080s.

Ablations on Different Word Vectors
We used four natural language processing models pre-trained on Wikipedia to obtain word vectors for remote sensing datasets. Moreover, we obtained different dimensional word vectors for each method as class semantic features. To find the word vectors with the best results, we compared the word vectors obtained by these four methods with different dimensions. The classification results on the UCM21 dataset are shown in Figure 3.  The Word2vec model pre-trained on Wikipedia has three dimensions, i.e., 100, 300, and 500, and they correspond to an accuracy of 0.6097, 0.5698, and 0.6266, respectively.
We used Bert in two ways to obtain word vectors, one using class names and the other using class description texts, which correspond to an accuracy of 0.4536 and 0.3978, respectively.
We also explored the effect of different word vectors in GZSL tasks. We used Word2vec-500, Glove-100, Fastetext-100, and Bert-768 as semantic information In Table 1. When we used the Word2vec method to extract 500-dimensional word vectors as semantic information for remote sensing image datasets, we achieved the best results in both ZSL and GZSL tasks.

Comparison with State-of-the-Art
To show the superiority of CSPWGAN, we present a comparison of our method with some previous methods. We chose these baselines: SSE [10], DMaP [12], SAE [36], ZSL-LP [19], ZSC-SA [20], DASE [21], VSC [37], VSOP [38], f-CLSWGAN [14], CYCLEW-GAN [16], and RBGN [39]. SSE [10] fuses multiple semantic information into the same semantic space and calculates the similarity of them. DMaP [12] studies the relationship between the semantic space manifold and the transferability of visual-semantic mapping. SAE [36] is based on learning a Semantic autoencoder for solving the label prediction problem. SAE adds a constraint from visual mapping to semantic features to alleviate the domain drift problem. ZSL-LP [19] constructs a semantic-directed graph, then uses a label-propagation algorithm for zero-shot classification. ZSC-SA [20] is a novel zero-shot RSSC method, which relies on Sammon embedding and spectral clustering. VSC [37] employs a novel visual architecture constraint for ZSL. VSOP [38] proposes to match latent visual and semantic representations in a shared subspace. DASE [21] proposes a distance-constrained semantic autoencoder to handle ZSRSSC. RBGN [39] employs an adversarial attack and bidirectional generation into GZSL to improve the generalizability and robustness of the model.
For a fair comparison with prior methods, we divided the dataset into four seen/unseen ratios, and each ratio was randomly divided 25 times, and 25 zero-shot classification experiments were performed. The average accuracy of the categories is taken as the result. Regarding the splitting ratio setting of the remote sensing dataset, our method is consistent with previous zero-shot classification methods for remote sensing images, such as articles DSAE, ZSC-SA, and ZSL-LP. Tables 2-4 show the classification results for UCM21, AID30, and NWPU45 datasets. As shown in Tables 2-4, our method significantly improved the classification accuracy compared with the state-of-the-art(sota) approaches for zero-shot RSSC in most cases. On the UCM21 dataset, our method improved 4.03%, 8.69%, 9.58%, and 5.99% under different seen/unseen ratios (e.g., 16/5, 13/8, 10/11, and 7/14, respectively). On the AID30 dataset, our method improves 2.37%, 2.61%, 1.05%, and 1.78% under different seen/unseen ratios (e.g., 25/5, 20/10, 15/15, and 10/20, respectively). On the NWPU45 dataset, our method improves 0.24% and 1.45% under two seen/unseen ratios (e.g., 25/20 and 20/25, respectively). The standard deviation of our method is smaller than DSAE in most cases, which also proves the superiority of our method. In conclusion, the results show that our method is more adapted for remote sensing images. The results also show that our method works better with a higher ratio of unseen classes and higher accuracy.  To compare the classification performance for each unseen category, we fixed the unseen classes for a given split ratio according to the DSAE setting. We present the confusion matrix of our approach and the DASE method. Notably, each column of the matrix represents an instance of the real class, and each row of the matrix represents an instance of the predicted class. On the UCM21 dataset, we have five unseen classes with the seen/unseen ratio set to 16/5, i.e., "freeway", "golf course", "intersection", "medium residential", and "storage tanks". Figure 4 shows the confusion matrix of the DASE method and our method on the UCM21 datasets. We found that we obtained 27% and 22% improvement on the classes "freeway" and "storage tanks", respectively. Our method has a decrease in classification accuracy on other unseen classes, but the average accuracy of the five unseen classes is higher than DASE. The average accuracy of our method improved from 63.6% to 68.8% on the five unseen categories. On the AID30 dataset, we have five unseen classes with the seen/unseen ratio set to 25/5, i.e., "dense residential", "desert", "forest", "industrial", and "pond". Figure 5 shows the confusion matrix of the DASE method and our method on the AID30 datasets. We achieved 31% and 80% improvement on the classes "dense residential" and "forest", respectively. The average accuracy of our method improved from 64.8% to 72.4% on the five unseen categories. On the NWPU45 dataset, we have ten unseen classes with the seen/unseen ratio set to 35/10, i.e., "airport", "basketball court", "circular farmland", "cloud", "dense residential", "desert","harbor", "intersection", "medium residential" and "sparse residential". Figure 6 shows the confusion matrix of the DASE method and our method on the NWPU45 datasets. We achieved 2%, 5%, 60%, 16%, and 1% improvement on the classes "airport", "circular farmland", "dense residential", "medium residential" and "sparse residential", respectively.

Ablation Studies
Our approach is based on generative adversarial networks. Following the characteristics of remote sensing images, we used classification loss, class-prototype loss, and semantic regression loss to constrain the generator, to ensure that we can obtain a high-performance generator. We constructed ablation experiments to understand our model further and to evaluate their effects. We set λ = 0 in (12), our model without classification loss term can be referred to as CSPWGAN-λ; we set β = 0 in (12), our model without semantic regression term can be referred to as CSPWGAN-β; we set γ = 0 in (12), our model without classprototype loss term can be referred to as CSPWGAN-γ. We executed ablation experiments on the UCM21, AID30, and NWPU45 datasets. Table 5 shows the result of ablation studies on the UCM21, AID30, and NWPU45 datasets. On the UCM21 dataset, when we removed the classification loss item, the model accuracy decreased by 2.82%; when we removed the semantic regression term, the model accuracy decreased by 2.23%; when we removed the semantic regression term, the model accuracy decreased by 2.99%. The results show the effectiveness of the three items for zero-shot RSSC. We can find the same conclusion on the AID30 and NWPU45 datasets.

Conclusions
This study presents a novel method for zero-shot remote sensing scene classification (RSSC) named CSPWGAN. We are the first to apply generative adversarial networks for zero-shot RSSC. Since the remote sensing image dataset does not directly provide class attribute information, we used four natural language processing models pre-trained on Wikipedia to obtain word vectors as the class semantic information we need. We used generative adversarial networks to train a generator that can generate class image features through class semantic features, converting the zero-shot RSSC problem into a traditional classification problem. Following the characteristics of remote sensing images, we used the classification loss, semantic regression module, and class-prototype loss to constrain the generator. The classification loss was used to preserve inter-class discrimination. We used a semantic regression module to ensure the image features generated by the generator can represent the semantic features. We introduced class-prototype loss when training the generator to constrain itself to ensure intra-class diversity of the synthesized image features and avoid generating too homogeneous image features. We conducted experiments on two benchmark datasets. The results demonstrate the superiority of our proposed CSPWGAN method for remote sensing images. Our method can work better with a high ratio of unseen classes. In the experiments, we found that the class semantic information significantly affects the classification results. In future work, we will try to obtain better class semantic vectors and use a better approach to reduce classification standard deviation and improve the classification accuracy for remote sensing images.
In future work, we can also try to use active learning [41] to solve the problem of unseen remote sensing image classes in the training data. The key idea of active learning (AL) is that if a machine-learning algorithm is allowed to select the data it learns, then it can achieve higher accuracy with fewer labeled training instances. Active learning retrieves the most useful unlabeled data from a large set of unlabeled data, hands it over to a professional for labeling, and then uses that sample to train the model to improve its accuracy. Many excellent methods have emerged in the field of remote sensing and deep learning using active learning, such as [42][43][44]. Our approach uses semantic and visual information from all image classes. Class semantic information builds a bridge of knowledge transfer between seen and unseen categories, breaking the boundary of category mutual exclusion between training and testing sets. Active learning and our model are two different ideas for solving the problem of addressing unseen classes in the training data.

Conflicts of Interest:
The authors declare no conflict of interest.