Remote sensing has been a standard tool for producing land use and land cover maps for decades [1
]. These products play an important role in various applications, such as natural disaster management [2
], environmental monitoring [4
], urban planning [6
] precision farming [8
] and vegetation monitoring [12
]. With advances in sensor engineering science, a deluge of high-resolution image data has been generated using various sensing technologies [13
]. However, processing such a huge amount of data requires extensive training samples and advanced image-processing techniques, which are not always accessible [14
]. One approach to overcome this problem is the use of classification models based on Zero-Shot Learning (ZSL). Such an approach is used to construct classification models for unseen classes that have not been labelled for training. It uses a semantic representation of classes (e.g., attributes, word vectors) as aside information and transfers knowledge from source (seen) classes to unseen classes. The applications of ZSL are becoming popular among researchers, both for classification and object recognition problems, and recently have received attention from the remote sensing community.
Before ZSL, a wide range of object recognition and classification models have been proposed and implemented, such as pixel-based and object-based approaches. Nevertheless, these common classification models can only classify objects that are seen during the training stage. These methods, for example, object-based image analysis (OBIA), have a powerful ability of classification, but they failed to detect and classify new (unseen) objects. In addition, they have their limitations such as finding an ideal segmentation scale, complex workflow, and carefully optimisation process [15
]. Convolutional Neural Networks (CNNs) have also shown to be successful for land cover classification and other related applications [20
]. Zhang et al. [20
] used CNNs for high-resolution image classification with different feature learning strategies. Sumbul et al. [21
] presented a CNN model to classify high-resolution aerial photographs for land cover classification. Al-Najjar et al. [1
] used CNN for land use mapping from drone images. Chen et al. [22
] detected various species of forest mangrove with a patch-based CNN and found that CNN is superior to support vector machine (SVM). However, these models are only successful if the test images do not contain unseen classes in the training stage [14
]. Other limitations of these methods include the requirement of a large number of labelled samples to train the models efficiently and over-fitting issues. Considering that sufficient training samples with labels are expensive to collect, these traditional classification models are still inefficient for many real-world applications, including remote sensing image classifications.
To overcome these limitations, the present study proposes a classification framework based on ZSL for unseen land cover classes from high-resolution orthophotos. ZSL is a classification approach that employs the semantic word vector of each class as a bridge to the semantic word vector of the nearest class name, to learn the model from known samples to obtain unknown (unseen) classes. This technique has recently been applied to a few remote sensing applications such as scene classification [14
], street tree classification [21
], vehicle recognition [22
], and land cover mapping [23
]. Further explanation will be presented in the following section of related studies.
The current framework consists of (i) a Word2Vec model [25
] for label embedding, (ii) a CNN model for image feature learning, and (iii) a K-nearest neighbour (KNN) model that provides label predictions for unseen classes that are not included in the training data.
In summary, although numerous ZSL works have been carried out in the field of computer vision, only a few works have been implemented in the remote sensing domain. Therefore, the main contribution of this research is to integrate the recent techniques in computer vision such as ZSL, Word2Vec models and CNN to address the existing problem of traditional classification systems in the field of remote sensing, such as the lack of sufficient training data for every class. The remainder of this article is organised as follows: Section 2
summarises the related work. Section 3
describes the proposed framework and methods. Section 4
presents the experiments and results. Section 5
provides the discussion. Lastly, Section 6
explains the conclusion and future works.
Typical classification models in remote sensing can only classify objects that are seen during the training stage. These methods are unsuccessful in the classification of unseen objects in the testing stage. Unseen object classification is a challenging topic, in which plenty of studies have attempted to develop models to address this problem. For example, in recent years, ZSL has been widely implemented in computer science due to its potential to identify unseen objects without obtaining training samples by assistance from semantic information [14
]. These models extract abstract features from the image pixels. From the class labels, the semantic information is often retrieved as vectors using models like Word2Vec. Unseen object classification is a hot topic, especially for the remote sensing field, due to its data variety and scalability [20
]. To the best of our knowledge, these models are rarely applied to geoscience applications, especially for high-resolution land cover classification from aerial photos.
Therefore, this paper presented a ZSL framework based on CNN and Word2Vec for unseen land cover mapping. The CNN was found to be a robust feature extraction technique that achieved relatively high accuracies on training (0.953 F1-score, 0.941 precision, 0.882 recall), testing (0.904 F1-score, 0.869 precision, 0.949 recall) and second test dataset (0.898 F1-score, 0.870 precision, 0.838 recall). For the robustness of the network, various cases and models were tested. Despite including the Gaussian noise to input samples, no improvement was observed in the networks. This implies that the training strategies used in this experiment were good enough, and also, no over-fitting effects were observed. In the feature extraction phase, the best performance status was recorded, when the pooling layer was not included in the architecture. Therefore, we hypothesised that the dimensionality or complexity of the dataset was not an obstacle to train the network. Nevertheless, the case study including CNN with batch normalisation and pooling layers had comparable accuracy with the superior case (without pooling). Given that the feature extraction is a separate step in our ZSL, the proposed framework is flexible and can be further improved and customised for other applications. This process can be achieved by replacing the proposed CNN model with other deep-learning methods, depending on a new problem.
In the second phase, Word2Vec with 300 dimensions is used as word embedding to gather class attributes. Our ZSL approach could obtain (0.798 F1-score, 0.766 recall, 0.838 precision, 0.778 top-one, 0.890 top-two and 0.942 top-three) mean accuracies for different unseen classes on the first test area, accompanied by (0.729 F1-score, 0.676 recall, 0.790 precision, 0.737 top-one, 0.906 top-two and 0.924 top-three) mean accuracies for the second test area; however, a standard terminology or word model (exclusively for remote sensing domain) might further improve the results. Moreover, the obstacle of distance structure distinction between the word vectors and visual models of remote sensing image classification seriously impacts the operation and efficiency of the ZSL image classification [47
]. Thus, special embedding attributes for remote sensing data could positively affect the model’s performance.
In a previous generalised ZSL application on land cover classification with 8 m PolSAR images, the overall accuracy of 73% and the ratio of 1/3–3/3 unseen/seen classes were reported; however, the exact category missing with attributes in the SUN database of rural areas, wetland, and agricultural land was reported [24
]. In another application utilising label propagation and label refinement approaches [14
], the overall accuracy was reported as 58% and 70.4% for 5/16 and 1/7 unseen/seen ratio, respectively. In another work of street tree detection from areal images [21
], CNN structures were employed and the accuracy of 14.3% for 16 tree classes was reported. While the overall accuracy evaluation metric was the main performance evaluation used in the previous remote sensing studies on zero-shot unseen land cover mapping, we further evaluated the robustness of the current framework considering the imbalanced class distribution via six evaluation metrics, namely F1-score, precision, recall and top-k categorical accuracy for k = [1,2,3] with 1/6 unsee/seen ratio. Table 13
presents a comparison among some related studies that have a similar scope with the current study in the remote sensing domain.
Although ZSL models achieve relatively high accuracy in computer vision, it is not fully explored in remote sensing applications. For this purpose, we need further specific word embedding models that can be trained on geospatial data. This case can introduce new research areas of ZSL and other fields requiring word embedding.
The most influential step in the current proposed ZSL framework is the feature extractor, which is based on CNN models. It has shown a promising result that could be extended to other geospatial applications. Accordingly, more robust CNN architectures like graphical CNNs, capsule-based CNNs, and others can help to improve the accuracy of the ZSL classification. A second model (CNN) is required to perform class signature prediction. This model can also be replaced with any other machine learning methods, such as SVMs or random forests. The class label is predicted by mapping the output vector to the nearest one and matching it with the vectors of all classes. Nevertheless, as part of the limitations that exist in CNN and deep learning models, they often require a large number of training samples for their learnings, and also their architectures contain enormous tuneable parameters to train the classifiers [36
]. Another limitation is that the classifiers usually employed as black boxes. To solve these drawbacks, tensor-based learning models (tensor algebraic operations) [48
], could provide promising solutions including a reduction in weight parameters (especially for high-dimensional/noisy data), allow physical interpretation and preserve the spatial and spectral coherency of input samples. This technique can generally be applied to CNN’s structure by replacing fully-connected layers with tensor contractions [49
In future works, several subjects should be additionally contemplated. First, a more efficient semantic data related to the land cover mapping in different remote sensing utilisation should be investigated, including orthophotos, multispectral/hyperspectral images, synthetic aperture radar (SAR), and other remote sensing products. Second, the potential of different semantic models for land cover mapping assisted with diverse machine learning methods need to be further expanded. Third, there is no specific standard or agreed upon ZSL benchmark in the geospatial field [27
]. Thus, the potential application of ZSL in a variety of popular applications such as change detection, land use/land cover mapping, detecting the types of landslides and more operations can be explored.
In this research, although some degrees of confusion are still existing among classes (e.g., croplands, forest lands and agricultural lands), the overall performance is still satisfactory. This confusion could be attributed to the likeness of the spectral and spatial properties of these classes. In such a case, using data with additional spectral bands (e.g., hyperspectral) could relatively improve the result. Besides, it is expected that establishing special standard embedding attributes for remote sensing data could decrease such confusion effects among classes.
Overall, the applicability of the adopted framework in the second study area showed that our ZSL scheme exhibited relatively similar classification results. Most of the unseen classes in different scenarios are excellently detected, given that such detection of unseen classes is impossible via traditional classification methods due to the absence of specific samples in training data. We performed various CNN models, techniques and sensitivity analysis (e.g., Gaussian noise additive) to our networks to assure its robustness using two different datasets. Both experiments on the datasets demonstrated their agreement to the framework’s performance.
ZSL is an important concept that has been recently developed because of its potential to solve classification problems that lack sufficient training data for every class. In this paper, we present a ZSL framework for land cover mapping using orthophotos. The framework is built based on CNN and Word2Vec. The former is applied for the feature extraction process, whereas the latter is used to learn class attributes from the class labels.
The proposed models and the framework are tested on two subset datasets obtained for the Cameron Highlands as the first dataset and the Ipoh area as the second dataset, in Malaysia. The results show that the proposed feature extraction model achieves high accuracies on the training of the first dataset (0.953 F1-score, 0.941 precision, 0.882 recall), the first test dataset (0.904 F1-score, 0.869 precision, 0.949 recall) and 0.898 F1-score, 0.870 precision, 0.838 recall for the second test dataset. The ZSL model achieves accuracies of 0.778 top-one, 0.890 top-two and 0.942 top-three, 0.798 F1-score, 0.766 recall and 0.838 precision on average for different unseen classes on the test area of the first dataset and 0.737 top-one, 0.906 top-two, 0.924 top-three, 0.729 F1-score, 0.676 recall and 0.790 precision average accuracies for the second dataset. This outcome could help the experts in the remote sensing field, supporting them in recognising the correct class among two or three possible classes, especially when those classes are not included in the training set.
Transforming remote sensing imagery to a new embedding and using this strategy to predict seen and unseen classes could be a useful approach to ZSL in remote sensing data. Further developments will be considered, making the proposed framework more efficient for learning to predict unseen classes by using novel Word2Vec models specifically for remote sensing applications and various types of CNN models, including residual and graph CNNs. Moreover, the same number of samples for the seen and unseen classes will be considered for further assessment.