Unseen Land Cover Classification from High-Resolution Orthophotos Using Integration of Zero-Shot Learning and Convolutional Neural Networks

: Zero-shot learning (ZSL) is an approach to classify objects unseen during the training phase and shown to be useful for real-world applications, especially when there is a lack of su ﬃ cient training data. Only a limited amount of

domain embeddings. Considering the good performance of their method, they suggested using it for person recognition. In another study, the authors in [19] combined image matching, object detection, image retrieval and ZSL to overcome the semantic matching problem. They utilised the ILSVRC 2014 dataset for a single-shot semantic matcher with CNN architecture based on GoogleNet and YOLO/DetectNet architectures. They concluded that their semantic matcher approach is beneficial for real-time multi-class object recognition.
Despite the success of the previous ZSL models using standard datasets, most of these models performed the prediction by the absence of the seen label in the testing stage [24,30]. In these ZSL models, the research gap of the unseen class was exclusively narrow to the test label, whereas the problem sets wherein both train and test labels are concurrent, which is called transductive or generalised ZSL (GZSL), that is more challenging [24]. Additional details on this approach, including the positive and negative sides, can be found in the reference [15].

Transductive or GZSL
ZSL assumes that only the classes from testing samples are categorised into potential unseen classes. This assumption is invalid in the real-world because the samples from seen classes could be present in the testing set [15]. Nevertheless, GZSL can allocate (classify) the testing samples whether into seen and unseen classes [15,30]. Alternatively, GZSL is more complicated than ZSL because the information must be relocated from the main domain to the target. In addition, GZSL should be able to differentiate between seen and unseen classes. This is challenging as the majority of testing samples can be classified into a seen family, not a true unseen family. ZSL techniques can be used in GZSL, but they have a considerably lower accuracy [31]. Few studies have established a standard guideline for GZSL [32].

ZSL on Remote Sensing Data
Although notable developments of ZSL models have been published for various computer vision tasks, these models are rarely applied to remote sensing applications. The authors in [14] developed a novel zero-shot scene classification approach to recognise images from unseen scene classes. The authors used the Word2Vec model to map labels of seen/unseen scene classes to semantic vectors for describing the relationships between seen and unseen classes. Then, to transfer knowledge from seen classes to unseen classes, they used a label-propagation algorithm incorporating the semantic-directed graph and an unsupervised domain adaptation model. They applied the method on the UC Merced benchmark dataset and remote sensing data. The authors in [23] proposed the integration of ZSL into a dual-memory LSTM framework for land cover prediction. The method contains two memories that can capture both long-term and short-term variation patterns, which can effectively resolve the temporal variation. In another ZSL remote sensing application [33], synthetic aperture radar (SAR) was employed for target recognition with a ratio of unseen per seen classes of 1/7. Authors in [21] performed street trees classification using a multi-source region attention network for fine-grained object recognition tested on RGB, multispectral (MS), and LiDAR (light detection and ranging) data. The best performance of 17.7% was obtained by the proposed model trained using RGB and MS data. The more complex models that use all three sources had slightly lower performances due to a limited number of training samples. Chen et al. [22] utilised generalised ZSL for vehicle detection via a coarse-to-fine framework with latent attributes using ISPRS Potsdam 2D Semantic data. Their method showed the effectiveness of GZSL for unseen vehicles. Unseen object classification remains challenging, especially for remote sensing applications, because remote sensing data have particularly diverse structures and huge volumes compared to the data generally used in other fields such as computer science [20,34]. Therefore, in this study, a classification framework based on ZSL is proposed to detect and classify unseen objects by utilising seen classes and the success of CNN for feature learning.

Methods
ZSL enables land cover mapping in areas that contain novel (unseen) objects, which are useful in real-world applications. This research proposed a framework for ZSL-based land cover classification using orthophotos. The framework employed three main techniques, including Word2Vec, CNNs, and KNN. The models above were used for three main functions, namely class embedding, feature extractions, and classification of unseen objects based on their semantic similarity, respectively.

Theory of ZSL
ZSL is part of transfer learning due to heterogeneous transfer learning [15]. The main goal of ZSL is to perform a task without using any sample of that task at training data. For example, the land cover classification task in areas that contain novel classes not included in the training data can be regarded as an example of ZSL. In a simple term, ZSL allows classifying and recognising the unseen objects. ZSL consists of seen and unseen classes, which refer to labelled training and unlabelled testing samples, respectively. Each sample is assumed to be a part of one class and represented by a vector in feature space. Feature space is generally a real number space. S = c s i | i = 1, . . . , N s } represents the group of seen classes, where each c s i is a seen class. U = c u i | i = 1, . . . , N u } denoted the group of unseen classes, where each c u i is an unseen class. Note that S ∩ U = ∅.
Remote Sens. 2019, 11, x FOR PEER REVIEW ZSL enables land cover mapping in areas that contain novel (unseen) objects, whi in real-world applications. This research proposed a framework for ZSL-based classification using orthophotos. The framework employed three main technique Word2Vec, CNNs, and KNN. The models above were used for three main functions, embedding, feature extractions, and classification of unseen objects based on th similarity, respectively.

Theory of ZSL
ZSL is part of transfer learning due to heterogeneous transfer learning [15]. [32]. T of ZSL is to perform a task without using any sample of that task at training data. For land cover classification task in areas that contain novel classes not included in the train be regarded as an example of ZSL. In a simple term, ZSL allows classifying and rec unseen objects. ZSL consists of seen and unseen classes, which refer to labelled unlabelled testing samples, respectively. Each sample is assumed to be a part of o represented by a vector in feature space. Feature space is generally a real numbe {c |i = 1, … , N } represents the group of seen classes, where each c is a seen {c |i = 1, … , N } denoted the group of unseen classes, where each c is an unseen cla S ∩ U = ∅.
Ӽ indicates the feature space, which is a real number space R . G = {(x Ӽ × S} represents the group of labeled training data for seen classes; for each lab (x , y ), x is the sample in the feature space, and y is the correspondin indicates the corresponding class label for X , w predicated. ZSL attains to learn the zero-shot classifier f (0): Ӽ → U , which catego samples X (i.e., to predict Y ) corresponding to the unseen classes U. Thus, traini shot classes exist. No samples from zero-shot classes are used during training.

Theory of CNN
CNN is a special type of neural network that was designed for image (or array-like the concept of convolution and has shown significant success in the field of compute recently in the remote sensing domain. CNN was first introduced by the paper [35] [ improved further by other researchers through advances in computing and software t CNN utilises local connections, shared weights and a wide range of computing laye Consequently, CNN can efficiently extract features from the input image data without intervention from humans. In comparison with feed-forward neural networks and ot learning models, CNN has shown a strong predictive capability given that adequate train are available. CNN is also computationally efficient because it can compute convolutio with multiple GPU cores [36]. [38]. CNN consists of a series of convolutional and pooling layers, followed by a classif (e.g., softmax). Other layers, such as dense (fully connected), dropout and batch-normali were optionally added to the model to improve its generalisation and predictive capaci adding these layers does not necessarily improve the model's predictions unless proper is considered. The main component of CNN is the convolutional layer, which is compos convolutional kernels. The convolutional layer is associated with a small area of the inpu known as an image patch. If I , is the given image, then the convolution is performed t I , . K , where x, y shows the spatial locality and K represents the l convolutional k layer. ZSL enables land cover mapping in areas that contain novel (unseen) objects, which are useful in real-world applications. This research proposed a framework for ZSL-based land cover classification using orthophotos. The framework employed three main techniques, including Word2Vec, CNNs, and KNN. The models above were used for three main functions, namely class embedding, feature extractions, and classification of unseen objects based on their semantic similarity, respectively.

3.1.
Theoretical Background 3.1.1. Theory of ZSL ZSL is part of transfer learning due to heterogeneous transfer learning [15]. [32]. The main goal of ZSL is to perform a task without using any sample of that task at training data. For example, the land cover classification task in areas that contain novel classes not included in the training data can be regarded as an example of ZSL. In a simple term, ZSL allows classifying and recognising the unseen objects. ZSL consists of seen and unseen classes, which refer to labelled training and unlabelled testing samples, respectively. Each sample is assumed to be a part of one class and represented by a vector in feature space. Feature space is generally a real number space. S = {c |i = 1, … , N } represents the group of seen classes, where each c is a seen class. U = {c |i = 1, … , N } denoted the group of unseen classes, where each c is an unseen class. Note that represents the group of labeled training data for seen classes; for each labelled sample (x , y ), x is the sample in the feature space, and y is the corresponding class label. X = {(x ∈ Ӽ} denotes the group of testing samples, where x is a testing sample in the feature space. Y = {(y ∈ U} indicates the corresponding class label for X , which is to be predicated. ZSL attains to learn the zero-shot classifier f (0): Ӽ → U , which categorises testing samples X (i.e., to predict Y ) corresponding to the unseen classes U. Thus, training and zeroshot classes exist. No samples from zero-shot classes are used during training.

Theory of CNN
CNN is a special type of neural network that was designed for image (or array-like) data under the concept of convolution and has shown significant success in the field of computer vision and recently in the remote sensing domain. CNN was first introduced by the paper [35] [37] and was improved further by other researchers through advances in computing and software technologies. CNN utilises local connections, shared weights and a wide range of computing layers [36]. [38]. Consequently, CNN can efficiently extract features from the input image data without considerable intervention from humans. In comparison with feed-forward neural networks and other machine learning models, CNN has shown a strong predictive capability given that adequate training samples are available. CNN is also computationally efficient because it can compute convolutions in parallel with multiple GPU cores [36]. [38].
CNN consists of a series of convolutional and pooling layers, followed by a classification layer (e.g., softmax). Other layers, such as dense (fully connected), dropout and batch-normalisation layers, were optionally added to the model to improve its generalisation and predictive capacity. However, adding these layers does not necessarily improve the model's predictions unless proper optimisation is considered. The main component of CNN is the convolutional layer, which is composed of several convolutional kernels. The convolutional layer is associated with a small area of the input data-also known as an image patch. If I , is the given image, then the convolution is performed through F = I , . K , where x, y shows the spatial locality and K represents the l convolutional kernel of the k layer. ZSL enables land cover mapping in areas that contain novel (unseen) objects, whic in real-world applications. This research proposed a framework for ZSL-based classification using orthophotos. The framework employed three main technique Word2Vec, CNNs, and KNN. The models above were used for three main functions, n embedding, feature extractions, and classification of unseen objects based on the similarity, respectively.

Theory of ZSL
ZSL is part of transfer learning due to heterogeneous transfer learning [15]. [32]. Th of ZSL is to perform a task without using any sample of that task at training data. For land cover classification task in areas that contain novel classes not included in the train be regarded as an example of ZSL. In a simple term, ZSL allows classifying and rec unseen objects. ZSL consists of seen and unseen classes, which refer to labelled t unlabelled testing samples, respectively. Each sample is assumed to be a part of on represented by a vector in feature space. Feature space is generally a real number {c |i = 1, … , N } represents the group of seen classes, where each c is a seen {c |i = 1, … , N } denoted the group of unseen classes, where each c is an unseen cla represents the group of labeled training data for seen classes; for each lab (x , y ), x is the sample in the feature space, and y is the corresponding X = {(x ∈ Ӽ} denotes the group of testing samples, where x is a testing s feature space. Y = {(y ∈ U} indicates the corresponding class label for X , w predicated. ZSL attains to learn the zero-shot classifier f (0): Ӽ → U , which catego samples X (i.e., to predict Y ) corresponding to the unseen classes U. Thus, traini shot classes exist. No samples from zero-shot classes are used during training.

Theory of CNN
CNN is a special type of neural network that was designed for image (or array-like the concept of convolution and has shown significant success in the field of compute recently in the remote sensing domain. CNN was first introduced by the paper [35] [ improved further by other researchers through advances in computing and software t CNN utilises local connections, shared weights and a wide range of computing laye Consequently, CNN can efficiently extract features from the input image data without intervention from humans. In comparison with feed-forward neural networks and ot learning models, CNN has shown a strong predictive capability given that adequate train are available. CNN is also computationally efficient because it can compute convolution with multiple GPU cores [36]. [38]. CNN consists of a series of convolutional and pooling layers, followed by a classif (e.g., softmax). Other layers, such as dense (fully connected), dropout and batch-normalis were optionally added to the model to improve its generalisation and predictive capacit adding these layers does not necessarily improve the model's predictions unless proper is considered. The main component of CNN is the convolutional layer, which is compos convolutional kernels. The convolutional layer is associated with a small area of the inpu known as an image patch. If I , is the given image, then the convolution is performed th I , . K , where x, y shows the spatial locality and K represents the l convolutional k layer.  [36]. [38].
CNN consists of a series of conv (e.g., softmax). Other layers, such as d were optionally added to the model t adding these layers does not necessar is considered. The main component o convolutional kernels. The convolutio known as an image patch. If I , is the I , . K , where x, y shows the spatia k layer. Pooling is another importa adjacent spatial regions. For each pat functions, e.g., max and min) value.
→ U , which categorises testing samples X test (i.e., to predict Y test ) corresponding to the unseen classes U. Thus, training and zero-shot classes exist. No samples from zero-shot classes are used during training.

Theory of CNN
CNN is a special type of neural network that was designed for image (or array-like) data under the concept of convolution and has shown significant success in the field of computer vision and recently in the remote sensing domain. CNN was first introduced by the paper [35] and was improved further by other researchers through advances in computing and software technologies. CNN utilises local connections, shared weights and a wide range of computing layers [36]. Consequently, CNN can efficiently extract features from the input image data without considerable intervention from humans. In comparison with feed-forward neural networks and other machine learning models, CNN has shown a strong predictive capability given that adequate training samples are available. CNN is also computationally efficient because it can compute convolutions in parallel with multiple GPU cores [36].
CNN consists of a series of convolutional and pooling layers, followed by a classification layer (e.g., softmax). Other layers, such as dense (fully connected), dropout and batch-normalisation layers, were optionally added to the model to improve its generalisation and predictive capacity. However, adding these layers does not necessarily improve the model's predictions unless proper optimisation is considered. The main component of CNN is the convolutional layer, which is composed of several convolutional kernels. The convolutional layer is associated with a small area of the input data-also known as an image patch. If I x,y is the given image, then the convolution is performed through F k l = I x,y . K k l , where x, y shows the spatial locality and K k l represents the l th convolutional kernel of the k th layer. Pooling is another important component of CNN, which summarises information across adjacent spatial regions. For each patch in the feature map, pooling calculates the average (or other functions, e.g., max and min) value. Global pooling can be performed to sample the entire feature map Remote Sens. 2020, 12, 1676 6 of 26 into a single value. Pooling is expected to improve model invariance to local translation and reduce model parameters. These layers are calculated using Z l = f p F l x,y , where Z l represents the l th output feature map, F l x,y is the l th input feature map and f p (.) defines the type of pooling operation [1,37].

KNN Model
K-nearest neighbour (KNN) is one of the simplest classification algorithms that stores all available cases and is often utilised as the first choice for a classification task when there is little or no previous knowledge about the distribution of the data. It is simple with no assumptions about data, relatively accurate and easy to implement, which makes it more attractive than other approaches. Moreover, In ZSL studies, the interpretation is generally performed based on the nearest neighbour scheme [15,24]. Therefore, KNN was used in the current study due to the aforementioned advantages.
KNN search is performed in the embedding space to match the projection of an image feature vector against that of an unseen class. In KNN, the output is the participation of a certain class. An object is categorised by a majority vote of its neighbours, with the object being designated to the class with the most frequent one among its K-nearest neighbours, in which the K value is a positive integer and often small [38,39].

Class Embedding
In ZSL, the label of each class is a text (word or phrase). The class labels are assumed to provide semantic side information about the classes. This aside information is considered by learning the semantic (numeric) vector representation of the class labels. These semantic representations will then play a key role in transferring knowledge from the seen to unseen classes. For this purpose, we use the commonly used Word2Vec model published by Google. Word2Vec is an efficient model for learning semantic vector representations of words and phrases. Similar words or phrases are embedded as nearby vectors, whereas dissimilar words or phrases are embedded as far vectors.
In this research, we use the pre-trained Word2Vec model published by Google to create vector representations for class labels. This contains 300-dimensional representations of the vectors for around 3 million terms and phrases. Since this pre-trained model is case-sensitive, we cast the text label in a lower case. For the single-word labels, we use the semantic word vector as the class label embedding for a class. For multi-word labels, an embedding of a class label is obtained by calculating the mean of the individual semantic term vectors. For more information about this model, a recent work for label embedding can be found in reference [40].

Overall Workflow
The overall workflow of the proposed ZSL framework for the land cover mapping from orthophotos is shown in Figure 1. In the training phase, the training image with training classes (e.g., agricultural land, urban areas, barren land, forest lands, road, and water body) is fed into CNN-1 for feature extraction. The output of this step is image vectors. The training classes are fed into the Word2Vec model to convert the class labels into vectors. These word vectors are trained based on Google News data, which is prepared by Google, with a dimension of 300 unique numbers [25]. The output of this step is a class signature with a size of 300, multiplied by the number of training classes.
Then, CNN-2 is used to map the image vectors to the class signature. In the test area, first, the image is fed to the trained CNN-1 to extract the image features. The ZSL classes are then fed into another Word2Vec model (which acts here as an auxiliary label's bank) to extract the class signature. To avoid predicting labels for unseen classes that are not relevant to land cover applications, we used the KNN model to find labels from a label bank created for the purpose of land cover mapping. It contains labels only suitable for land cover e.g., agricultural land, barren land, road, forest, and urban area etc. Thereafter, the CNN-2, which is trained on the training dataset, is used along with the image features to predict the class signature from the label's bank [41]. Once the class signature is predicted, the class Remote Sens. 2020, 12, 1676 7 of 26 labels are determined by the KNN algorithm to measuring the similarity between the predicted class signature and the class signature, which was pre-computed from the label's bank for all possible class labels. Lastly, the predicted class labels are used to produce the land cover map for the area.
Remote Sens. 2019, 11, x FOR PEER REVIEW 7 of 28 all possible class labels. Lastly, the predicted class labels are used to produce the land cover map for the area.

Details of CNN-1 Used for Image Feature Extraction
In this research, CNN-1 is designed to extract features from input orthophotos. It is trained on image pixels (patches) with the corresponding pixel-level class labels (from the ground truth data). In this research, CNN is used other than any machine learning model because of its capability to utilise spectral and spatial information effectively [20]. The spectral content can be obtained by three visible bands (RGB) of the orthophotos. Instead, the spatial content is attainable in considering neighbouring pixels [42]. The weights in CNNs are adjusted during training to result in learned filters or nodes, which are equivalent to handmade filters, the outputs of which during the feed-forward stage are the feature maps. Based on the experiments above, the best model then was selected among the other models. Thus, after conducting several experiments (i.e., explained in detail in the experimental set-up of CNN-1 section), the network was established based on a single convolutional layer (Conv2D) with a kernel size of 3×3 and a filter size of 64. After Conv2D, we used batch normalisation with ReLU activation to accelerate convergence and introduce regularisation within the network. A dropout of 0.3 is utilised to regularise the network further. Then, a flatten layer is adopted to convert the features from 2D to 1D shape. A dense layer is added on top of the dropout layer with ReLU activation. The number of units used is 32. Finally, a softmax layer is used to perform classification. We train the model with an Adam optimiser (initial learning rate = 0.001) with a batch size of 2048 for 100 iterations.
The CNN-1 model is trained on image pixels (patches) with the corresponding pixel-level class labels (from the ground truth data). It aimed at extracting image features from a given image pixel. Figure 2 presents the architecture of the model.

Details of CNN-1 Used for Image Feature Extraction
In this research, CNN-1 is designed to extract features from input orthophotos. It is trained on image pixels (patches) with the corresponding pixel-level class labels (from the ground truth data). In this research, CNN is used other than any machine learning model because of its capability to utilise spectral and spatial information effectively [20]. The spectral content can be obtained by three visible bands (RGB) of the orthophotos. Instead, the spatial content is attainable in considering neighbouring pixels [42]. The weights in CNNs are adjusted during training to result in learned filters or nodes, which are equivalent to handmade filters, the outputs of which during the feed-forward stage are the feature maps. Based on the experiments above, the best model then was selected among the other models. Thus, after conducting several experiments (i.e., explained in detail in the experimental set-up of CNN-1 section), the network was established based on a single convolutional layer (Conv2D) with a kernel size of 3 × 3 and a filter size of 64. After Conv2D, we used batch normalisation with ReLU activation to accelerate convergence and introduce regularisation within the network. A dropout of 0.3 is utilised to regularise the network further. Then, a flatten layer is adopted to convert the features from 2D to 1D shape. A dense layer is added on top of the dropout layer with ReLU activation. The number of units used is 32. Finally, a softmax layer is used to perform classification. We train the model with an Adam optimiser (initial learning rate = 0.001) with a batch size of 2048 for 100 iterations. The CNN-1 model is trained on image pixels (patches) with the corresponding pixel-level class labels (from the ground truth data). It aimed at extracting image features from a given image pixel. Figure 2 presents the architecture of the model.  Figure 3 shows the architecture of CNN-2 used for class signature prediction given the image vectors. This model was chosen due to the general robustness of CNNs over the traditional classification techniques in land cover classification [20]. The model consists of a convolutional layer with ReLU activation. The number of filters in this layer was 128. We included a batch normalisation with ReLU to expedite convergence and introduce regularisation within the network. A dropout layer is then added to avoid over-fitting with a drop probability of 0.3. Thereafter, we added two dense layers with ReLU activations. The number of units in the first dense layer was 64. The second dense layer takes the number of attributes (300), which is pre-trained by Google News. The target kernel is initialised with the class signature pre-computed by the Word2Vec model. The final layer is softmax with aims at classification. The model is trained with an Adam optimiser (initial learning rate = 0.001) with a batch size of 2048 for 100 iterations. Table 1 presents the parameters of the CNNs set in this study. Different and several hyper-parameters were tested during the design and optimisation process for the CNNs configuration. The best result was obtained using the aforementioned parameters. Moreover, by adding more layers, no constructive progress was observed (experiments with details are provided in the experimental set-up of CNN-2).    Figure 3 shows the architecture of CNN-2 used for class signature prediction given the image vectors. This model was chosen due to the general robustness of CNNs over the traditional classification techniques in land cover classification [20]. The model consists of a convolutional layer with ReLU activation. The number of filters in this layer was 128. We included a batch normalisation with ReLU to expedite convergence and introduce regularisation within the network. A dropout layer is then added to avoid over-fitting with a drop probability of 0.3. Thereafter, we added two dense layers with ReLU activations. The number of units in the first dense layer was 64. The second dense layer takes the number of attributes (300), which is pre-trained by Google News. The target kernel is initialised with the class signature pre-computed by the Word2Vec model. The final layer is softmax with aims at classification. The model is trained with an Adam optimiser (initial learning rate = 0.001) with a batch size of 2048 for 100 iterations. Table 1 presents the parameters of the CNNs set in this study. Different and several hyper-parameters were tested during the design and optimisation process for the CNNs configuration. The best result was obtained using the aforementioned parameters. Moreover, by adding more layers, no constructive progress was observed (experiments with details are provided in the experimental set-up of CNN-2).  Figure 3 shows the architecture of CNN-2 used for class signature prediction given the image vectors. This model was chosen due to the general robustness of CNNs over the traditional classification techniques in land cover classification [20]. The model consists of a convolutional layer with ReLU activation. The number of filters in this layer was 128. We included a batch normalisation with ReLU to expedite convergence and introduce regularisation within the network. A dropout layer is then added to avoid over-fitting with a drop probability of 0.3. Thereafter, we added two dense layers with ReLU activations. The number of units in the first dense layer was 64. The second dense layer takes the number of attributes (300), which is pre-trained by Google News. The target kernel is initialised with the class signature pre-computed by the Word2Vec model. The final layer is softmax with aims at classification. The model is trained with an Adam optimiser (initial learning rate = 0.001) with a batch size of 2048 for 100 iterations. Table 1 presents the parameters of the CNNs set in this study. Different and several hyper-parameters were tested during the design and optimisation process for the CNNs configuration. The best result was obtained using the aforementioned parameters. Moreover, by adding more layers, no constructive progress was observed (experiments with details are provided in the experimental set-up of CNN-2).

KNN Algorithm for Finding the Nearest Class/Label
In this algorithm, the assumption is based on the idea that the adjacent instances should have a similar label. Considering the initial clustering centres, K-means clustering is performed on the testing instances. Then, the one-to-one similarity between clustering centres and unseen class samples are calculated using linear programming. Occurrences in each cluster are categorised into the related unseen class [39]. In the current research, KNN was separately applied to six scenarios of different unseen classes to find the closest class (one unseen class for each scenario).

Evaluation Metrics (Precision, Recall and F-measure)
The F-measure is the weighted normal or symphonious mean of two proportions, known as exactness or precision (p). The recall (r) metric is another presentation measure used to evaluate the class-specific exactness accuracy from recovered information/data. Its computation is based on Equation (1), based on the average of p and r. The F-measure esteem ranges from Zero (0) as lowest to one (1) as the highest value [33,43].
The p or the certainty of an unseen land cover class is defined by dividing the true positives values (number of pixels having a place with the real class) with the total number of objects categorised as the positive class (for example, the sum of true positives and false positives, which are objects/pixels erroneously categorised as having a place with the class). The sensitivity (r) represents the extent of true positive objects/pixels that are correctly predicted and distinguished, and can be characterised as the number of true positives divided by the total number of objects/pixels that belong to the positive class (for example, the sum of true positives and false negatives). The determination of p and r can be performed utilising the Equations (2) and (3). An ideal predictor's value for p and r would be depicted as 1 [43].

Datasets
In this research, orthophotos acquired over the Cameron Highlands, Malaysia, on January 15, 2015, by using an airborne system (RIEGL) with an RGB camera, were used as the first dataset. The average height of the system whilst collecting data was 1510 m. The spatial resolution of the data was 1 m. Two subset areas were selected for training and testing the proposed ZSL framework. The training area consisted of six types of land covers, namely, agricultural land, barren land, road, forest, urban area and water body. The test area included five types of land covers (agricultural land, barren land, road, forest and urban area) with an additional class (croplands). Figure 4 presents details about the number of pixels for each class in the first training and test areas. For the training area, agricultural land contained 2,021,470 pixels, followed by barren land with 1,142,705 pixels. Road, forest lands, urban areas and water body contained 395,206, 7,819,007, 558,300 and 142,896 pixels, respectively. In total, the number of training pixels was reported at 12,079,584 for all classes with a dimension of 4464 × 2706 × 3. The test area contained 1,423,087 pixels with dimensions of 863 × 1649 × 3. Agricultural land contained 229,060 pixels, followed by barren land with 113,040 pixels. Road, forest lands, urban areas and croplands contained 76,436, 829,964, 152,786 and 21,801 pixels, respectively. Figure 5 exhibits the maps of the first training and test areas, including the orthophotos and ground truth data.
Remote Sens. 2019, 11, x FOR PEER REVIEW 10 of 28 To further assess the performance and robustness of the current ZSL framework, a second dataset also was tested. The new dataset was taken from the Ipoh area (in Peninsular Malaysia) on January 15, 2015, by using an airborne system (RIEGL) with an RGB camera and the average height of the system whilst collecting data was 1000 m. The area contained seven land cover types including green space, road, urban area, barren land, forest, water body and agricultural land. Figure 4 presents the number of ground truth pixels for each class. The maps of the second test dataset, including the orthophotos and ground truth data, are shown in Figure 6.  To further assess the performance and robustness of the current ZSL framework, a second dataset also was tested. The new dataset was taken from the Ipoh area (in Peninsular Malaysia) on January 15, 2015, by using an airborne system (RIEGL) with an RGB camera and the average height of the system whilst collecting data was 1000 m. The area contained seven land cover types including green space, road, urban area, barren land, forest, water body and agricultural land. Figure 4 presents the number of ground truth pixels for each class. The maps of the second test dataset, including the orthophotos and ground truth data, are shown in Figure 6. To further assess the performance and robustness of the current ZSL framework, a second dataset also was tested. The new dataset was taken from the Ipoh area (in Peninsular Malaysia) on January 15, 2015, by using an airborne system (RIEGL) with an RGB camera and the average height of the system whilst collecting data was 1000 m. The area contained seven land cover types including green space, road, urban area, barren land, forest, water body and agricultural land. Figure 4 presents the number of ground truth pixels for each class. The maps of the second test dataset, including the orthophotos and ground truth data, are shown in Figure 6.

Results of Feature Extraction (CNN-1)
CNN-1 was used for feature extraction, which is an important step in the proposed ZSL framework. These features were extracted from the second last layer (dense) before the softmax layer. The number of extracted features was 32 per pixel. The high-level features performed better than the features extracted from shallow layers [44]. This model was tested separately using the ground truth datasets for the training and test areas. The model accuracy was measured by the recall, precision, F1-score and top-k categorical accuracy on training and test areas.

Impact of Network Architectures
CNN-1 model was used for feature extraction, which is an important step in the proposed ZSL framework. These features were extracted from the second last layer (dense) before the softmax layer. For this purpose, we developed and examined various CNNs architectures applied to the first and second datasets. The results are demonstrated in Tables 2-5. Table 2 shows the impact of four architectures applied to the first and second datasets, including: 1) CNN without batch normalisation layer, 2) CNN without pooling layer, 3) CNN without batch normalisation and pooling layers, and 4) CNN with batch normalisation and pooling layers. The performance of CNN without the pooling layer was superior, with an F1-score of 0.953 for the training dataset, 0.904 for test dataset and 0.898 for the second test dataset, respectively. This was followed by the architecture of CNN using both batch normalisation and pooling layers with an F1score of 0.939 for the training dataset, 0.899 test dataset, and 0.875 for the second test dataset. Therefore, the layer with superior performance was used in the current research.

Results of Feature Extraction (CNN-1)
CNN-1 was used for feature extraction, which is an important step in the proposed ZSL framework. These features were extracted from the second last layer (dense) before the softmax layer. The number of extracted features was 32 per pixel. The high-level features performed better than the features extracted from shallow layers [42]. This model was tested separately using the ground truth datasets for the training and test areas. The model accuracy was measured by the recall, precision, F1-score and top-k categorical accuracy on training and test areas.   Table 3. Impact of the number of convolutional filters on the feature extraction (CNN-1) for the first and second datasets.  Table 2 shows the impact of four architectures applied to the first and second datasets, including: (1) CNN without batch normalisation layer, (2) CNN without pooling layer, (3) CNN without batch normalisation and pooling layers, and (4) CNN with batch normalisation and pooling layers. The performance of CNN without the pooling layer was superior, with an F1-score of 0.953 for the training dataset, 0.904 for test dataset and 0.898 for the second test dataset, respectively. This was followed by the architecture of CNN using both batch normalisation and pooling layers with an F1-score of 0.939 for the training dataset, 0.899 test dataset, and 0.875 for the second test dataset. Therefore, the layer with superior performance was used in the current research.

Impact of Filters Size
Moreover, the influence of filter size in convolutional layers was also examined using several configurations. As mentioned in Table 3, the experiment with 64 filters demonstrated better performance, with an F1-score of 0.953 and 0.904 for the training area and first test area, as well as an F1-score of 0.898 for the second test area. Therefore, the filter size of 64 was set for CNN-1's network.

Impact of Network's Depth
To assess the impact of the network's depth, three depths (1, 2, 3 Conv2D layers) were examined based on the training and test datasets. The results demonstrated that using one Conv2D layer outperformed the other structures with an F1-score of 0.953 and 0.904 for the training and first test dataset, and 0.898 for the second test dataset (Table 4).

Impact of Gaussian Noise
In this work, to investigate the sensitivity to the quality of training samples, the Gaussian noise technique was employed. This additive noise can have a regularising effect and reduce over-fitting [44][45][46]. The performance analysis from Table 5 shows that the model performed well for the case without noise by (0.953 F1-score for the training dataset, 0.904 for the first test dataset, 0.898 for the second test dataset). The results also indicated that the network's performance was slightly dropped by adding the Gaussian noise to the input samples of the first and second datasets. This could imply that the network was not very sensitive by additional noise, inferring that the quality of the training data was good.

Results of Classification (CNN-2)
The CNN-2 model was utilised as a classifier for signature prediction, which is an essential stage in the proposed ZSL framework. These signature classes (vectors) were extracted from the softmax layer. The second dense layer before the softmax layer takes the number of attributes (300), which is pre-trained by Google News. The target kernel is initialised with the class signature pre-computed by the Word2Vec model. The softmax as a final layer was responsible for the classification (signature prediction). To this end, we conducted diverse experiments on CNN's architectures applied to the training and test dataset. The following subsections describe the experiments in detail.
The proposed ZSL model was used to generate land cover maps of the training and testing areas. For this purpose, six scenarios were applied to the testing area of the first dataset by considering different unseen classes separately. In other words, only one class was considered an unseen class in each scenario. Figure 7c-h displays the classification results obtained from the unseen class of road, urban areas, barren land, agricultural land, forest lands, and croplands for the first testing area, respectively (each singular unseen case was shown by a reddish box around the corresponding unseen class in the legend). As shown in Figure 7c, the road was the unseen class that was detected in the test image; the other testing classes consisted of urban areas, barren land, agricultural land, forest lands, and croplands. The next scenario was the detection of the urban area as an unseen class which is shown in Figure 7d. Likewise, the remaining unseen scenarios for the barren land, agricultural land, forests land, and croplands are shown in Figure 7e-h, respectively. Table 6 shows the accuracy of each corresponding unseen scenario. In general, the average accuracy of the proposed model for prediction of the unseen classes based on the top-K categorical accuracy was 0.778 top-one, 0.890 top-two, 0.942 top-three, as well as 0.798 F1-score, 0.766 recall, 0.838 precision, respectively.
As illustrated in Figure 7d, it can be seen that the highest accuracy is achieved when the unseen class is designated as an agricultural land with 0.862 F1-score, 0.827, 0.899 precision, 0.831 top-one, 0.918 top-two and 0.952 top-three. This result might be attributed due to the good semantic matching of agricultural lands in the pre-trained Word2Vec model and good detection of this class in the feature extraction phase. However, the lowest accuracy based on the top-one belongs to the scenario wherein the unseen class was croplands with an accuracy of top-one 0.665 (Figure 7h). That is because the areas of croplands could be misclassified with other spectrally closed classes such as agricultural and forest lands. Likewise, the lowest accuracy of croplands could be observed considering the other metrics including the F1-score, precision, recall, top-two and top-three with values of 0.639, 0.747, 0.558, 0.837 and 0.927, respectively. All the test cases retrieved the orange colour for the croplands in the map. However, as the cropland's pixels are limited and highly scattered within the map, for better visualisation, we pointed out the croplands class by a yellowish rectangular box in Figure 7h.
1 field to select and advise the most appropriate class among two or three possible classes. Overall, although the average top-one accuracy for all unseen classes did not reach 80%, the top-two and top-three approximately raised to 90% and above, which is a promising result given that the same probability in the traditional classification systems is almost nil due to absence of the label in the training stage. This can infer that two potential classes could be designated as correct classes (top-two), and three potential correct classes based on (top-three). Thus, this outcome could help experts in the geoscience field to select and advise the most appropriate class among two or three possible classes.

Impact of Batch Normalisation
Batch normalisation is a technique used in the training of CNN. Its function is to standardise the input to a layer for every mini-batch by adjusting and scaling the activation. It stabilises the training task which causes a reduction in the number of training epochs [1,47].
We evaluated the CNN-2 network to inspect whether the input layer is benefiting from batch normalisation during the training and prediction. To this end, two architectures were developed by considering the presence and absence of batch normalisation in the network. Table 7 shows the comparative experiment of including batch normalisation in the network for the first training dataset. It appears that using the batch normalisation layer slightly enhanced the performance of the network during the training stage. The improvement was subsequently represented by mean F1-score of 0.806 reaching 0.809, mean recall of 0.752 reaching 0.757, mean precision of 0.871 reaching 0.872, mean top-one 0.812 reaching 0.814, mean top-two 0.21 reaching 0.923 and mean top-three 0.972 reaching 0.973. The same experiment was performed on the first test dataset. The results from Table 8 show that batch normalisation also improved the network by mean values of F1-score from 0.791 to 0.798, recall from 0.740 to 0.751, top-one from 0.773 to 0.778, top-two from 0.889 to 0.890 and top-three from 0.930 to 0.942, respectively.
The experiment on the second dataset (Table 9) showed a similar result compared to the first dataset. Specifically, when using batch normalisation, the model performed slightly better than the case without batch normalisation with subsequent mean values of F1-score from 0.722 reaching 0.729, recall from 0.668 reaching 0.676, precision from 0.786 reaching 0.790 and top-one from 0.731 reaching 0.737, top-two from 0.901 reaching 0.906 and top-three from 0.920 reaching 0.924.  Table 9. Accuracy of the model without and with batch normalisation for the second test area (Ipoh). Additionally, to enrich the robustness of the network, a different number of neurons were examined over the first training and testing areas as well as the second test area. Table 10

Impact of Gaussian Noise
The experiments of adding Gaussian noise to the training and test samples were conducted to investigate the sensitivity of the models to the noise in the samples. The parameters were set as (mu = 0, sigma = 0.1). Table 11 demonstrated that the network with the absence of additional noise outperformed the case with extra noise with (0.832 F1-score, 0.892 precision, 0.780 recall) for the training of the first dataset, (0.825 F1-score, 0.874 precision, 0.872 recall) for the first test and (0.737 F1-score, 0.792 precision, 0.688 recall) for the second test dataset. This outcome shows that the models trained with noisy samples also achieved significant results. Table 11. Impact of Gaussian noise (mu = 0, sigma = 0.1).

Metric
Training

Results of Transferability
To evaluate the transferability of the proposed framework, the entire process was applied to the second test dataset. The visual interpretation illustrated that the area consisted of diverse land cover types, including green space, urban areas, road, agricultural land, forest lands, barren land, and water body. The image was taken from a similar geographical environment to the first dataset; therefore, the same set of hyper-parameters was employed for CNNs. Figure 8 demonstrates the classification result for seven scenarios of unseen classes. Table 12 demonstrates the results of the singular unseen class for each scenario and their mean accuracies. In each scenario, the unseen class did not exist in the training set. Overall, the average categorical accuracy of unseen classes showed a promising result by scoring 0.737 top-one, 0.906 top-two, 0.924 top-three, 0.729 F1-score, 0.676 recall and 0.790 precision, respectively. This can infer that the model excellently works for the second dataset as well.

Results of Transferability
To evaluate the transferability of the proposed framework, the entire process was applied to the second test dataset. The visual interpretation illustrated that the area consisted of diverse land cover types, including green space, urban areas, road, agricultural land, forest lands, barren land, and water body. The image was taken from a similar geographical environment to the first dataset; therefore, the same set of hyper-parameters was employed for CNNs. Figure 8 demonstrates the classification result for seven scenarios of unseen classes. Table 12 demonstrates the results of the singular unseen class for each scenario and their mean accuracies. In each scenario, the unseen class did not exist in the training set. Overall, the average categorical accuracy of unseen classes showed a promising result by scoring 0.737 top-one, 0.906 toptwo, 0.924 top-three, 0.729 F1-score, 0.676 recall and 0.790 precision, respectively. This can infer that the model excellently works for the second dataset as well.

Discussion
Typical classification models in remote sensing can only classify objects that are seen during the training stage. These methods are unsuccessful in the classification of unseen objects in the testing stage. Unseen object classification is a challenging topic, in which plenty of studies have attempted to develop models to address this problem. For example, in recent years, ZSL has been widely implemented in computer science due to its potential to identify unseen objects without obtaining training samples by assistance from semantic information [14]. These models extract abstract features from the image pixels. From the class labels, the semantic information is often retrieved as vectors using models like Word2Vec. Unseen object classification is a hot topic, especially for the remote sensing field, due to its data variety and scalability [20,34]. To the best of our knowledge, these models are rarely applied to geoscience applications, especially for high-resolution land cover classification from aerial photos. Therefore, this paper presented a ZSL framework based on CNN and Word2Vec for unseen land cover mapping. The CNN was found to be a robust feature extraction technique that achieved relatively high accuracies on training (0.953 F1-score, 0.941 precision, 0.882 recall), testing (0.904 F1score, 0.869 precision, 0.949 recall) and second test dataset (0.898 F1-score, 0.870 precision, 0.838 recall). For the robustness of the network, various cases and models were tested. Despite including the Gaussian noise to input samples, no improvement was observed in the networks. This implies that the training strategies used in this experiment were good enough, and also, no over-fitting effects (i)

Discussion
Typical classification models in remote sensing can only classify objects that are seen during the training stage. These methods are unsuccessful in the classification of unseen objects in the testing stage. Unseen object classification is a challenging topic, in which plenty of studies have attempted to develop models to address this problem. For example, in recent years, ZSL has been widely implemented in computer science due to its potential to identify unseen objects without obtaining training samples by assistance from semantic information [14]. These models extract abstract features from the image pixels. From the class labels, the semantic information is often retrieved as vectors using models like Word2Vec. Unseen object classification is a hot topic, especially for the remote sensing field, due to its data variety and scalability [20,34]. To the best of our knowledge, these models are rarely applied to geoscience applications, especially for high-resolution land cover classification from aerial photos. Therefore, this paper presented a ZSL framework based on CNN and Word2Vec for unseen land cover mapping. The CNN was found to be a robust feature extraction technique that achieved relatively high accuracies on training (0.953 F1-score, 0.941 precision, 0.882 recall), testing (0.904 F1-score, 0.869 precision, 0.949 recall) and second test dataset (0.898 F1-score, 0.870 precision, 0.838 recall). For the robustness of the network, various cases and models were tested. Despite including the Gaussian noise to input samples, no improvement was observed in the networks. This implies that the training strategies used in this experiment were good enough, and also, no over-fitting effects were observed. In the feature extraction phase, the best performance status was recorded, when the pooling layer was not included in the architecture. Therefore, we hypothesised that the dimensionality or complexity of the dataset was not an obstacle to train the network. Nevertheless, the case study including CNN with batch normalisation and pooling layers had comparable accuracy with the superior case (without pooling). Given that the feature extraction is a separate step in our ZSL, the proposed framework is flexible and can be further improved and customised for other applications. This process can be achieved by replacing the proposed CNN model with other deep-learning methods, depending on a new problem.
In the second phase, Word2Vec with 300 dimensions is used as word embedding to gather class attributes. Our ZSL approach could obtain (0.798 F1-score, 0.766 recall, 0.838 precision, 0.778 top-one, 0.890 top-two and 0.942 top-three) mean accuracies for different unseen classes on the first test area, accompanied by (0.729 F1-score, 0.676 recall, 0.790 precision, 0.737 top-one, 0.906 top-two and 0.924 top-three) mean accuracies for the second test area; however, a standard terminology or word model (exclusively for remote sensing domain) might further improve the results. Moreover, the obstacle of distance structure distinction between the word vectors and visual models of remote sensing image classification seriously impacts the operation and efficiency of the ZSL image classification [47]. Thus, special embedding attributes for remote sensing data could positively affect the model's performance.
In a previous generalised ZSL application on land cover classification with 8 m PolSAR images, the overall accuracy of 73% and the ratio of 1/3-3/3 unseen/seen classes were reported; however, the exact category missing with attributes in the SUN database of rural areas, wetland, and agricultural land was reported [24]. In another application utilising label propagation and label refinement approaches [14], the overall accuracy was reported as 58% and 70.4% for 5/16 and 1/7 unseen/seen ratio, respectively. In another work of street tree detection from areal images [21], CNN structures were employed and the accuracy of 14.3% for 16 tree classes was reported. While the overall accuracy evaluation metric was the main performance evaluation used in the previous remote sensing studies on zero-shot unseen land cover mapping, we further evaluated the robustness of the current framework considering the imbalanced class distribution via six evaluation metrics, namely F1-score, precision, recall and top-k categorical accuracy for k = [1,2,3] with 1/6 unsee/seen ratio. Table 13 presents a comparison among some related studies that have a similar scope with the current study in the remote sensing domain. Although ZSL models achieve relatively high accuracy in computer vision, it is not fully explored in remote sensing applications. For this purpose, we need further specific word embedding models that can be trained on geospatial data. This case can introduce new research areas of ZSL and other fields requiring word embedding.
The most influential step in the current proposed ZSL framework is the feature extractor, which is based on CNN models. It has shown a promising result that could be extended to other geospatial applications. Accordingly, more robust CNN architectures like graphical CNNs, capsule-based CNNs, and others can help to improve the accuracy of the ZSL classification. A second model (CNN) is required to perform class signature prediction. This model can also be replaced with any other machine learning methods, such as SVMs or random forests. The class label is predicted by mapping the output vector to the nearest one and matching it with the vectors of all classes. Nevertheless, as part of the limitations that exist in CNN and deep learning models, they often require a large number of training samples for their learnings, and also their architectures contain enormous tuneable parameters to train the classifiers [36]. Another limitation is that the classifiers usually employed as black boxes. To solve these drawbacks, tensor-based learning models (tensor algebraic operations) [48], could provide promising solutions including a reduction in weight parameters (especially for high-dimensional/noisy data), allow physical interpretation and preserve the spatial and spectral coherency of input samples. This technique can generally be applied to CNN's structure by replacing fully-connected layers with tensor contractions [49].
In future works, several subjects should be additionally contemplated. First, a more efficient semantic data related to the land cover mapping in different remote sensing utilisation should be investigated, including orthophotos, multispectral/hyperspectral images, synthetic aperture radar (SAR), and other remote sensing products. Second, the potential of different semantic models for land cover mapping assisted with diverse machine learning methods need to be further expanded. Third, there is no specific standard or agreed upon ZSL benchmark in the geospatial field [27,32]. Thus, the potential application of ZSL in a variety of popular applications such as change detection, land use/land cover mapping, detecting the types of landslides and more operations can be explored.
In this research, although some degrees of confusion are still existing among classes (e.g., croplands, forest lands and agricultural lands), the overall performance is still satisfactory. This confusion could be attributed to the likeness of the spectral and spatial properties of these classes. In such a case, using data with additional spectral bands (e.g., hyperspectral) could relatively improve the result. Besides, it is expected that establishing special standard embedding attributes for remote sensing data could decrease such confusion effects among classes.
Overall, the applicability of the adopted framework in the second study area showed that our ZSL scheme exhibited relatively similar classification results. Most of the unseen classes in different scenarios are excellently detected, given that such detection of unseen classes is impossible via traditional classification methods due to the absence of specific samples in training data. We performed various CNN models, techniques and sensitivity analysis (e.g., Gaussian noise additive) to our networks to assure its robustness using two different datasets. Both experiments on the datasets demonstrated their agreement to the framework's performance.

Conclusions
ZSL is an important concept that has been recently developed because of its potential to solve classification problems that lack sufficient training data for every class. In this paper, we present a ZSL framework for land cover mapping using orthophotos. The framework is built based on CNN and Word2Vec. The former is applied for the feature extraction process, whereas the latter is used to learn class attributes from the class labels.
The proposed models and the framework are tested on two subset datasets obtained for the Cameron Highlands as the first dataset and the Ipoh area as the second dataset, in Malaysia. The results show that the proposed feature extraction model achieves high accuracies on the training of the first dataset (0.953 F1-score, 0.941 precision, 0.882 recall), the first test dataset (0.904 F1-score, 0.869 precision, 0.949 recall) and 0.898 F1-score, 0.870 precision, 0.838 recall for the second test dataset. The ZSL model achieves accuracies of 0.778 top-one, 0.890 top-two and 0.942 top-three, 0.798 F1-score, 0.766 recall and 0.838 precision on average for different unseen classes on the test area of the first dataset and 0.737 top-one, 0.906 top-two, 0.924 top-three, 0.729 F1-score, 0.676 recall and 0.790 precision average accuracies for the second dataset. This outcome could help the experts in the remote sensing field, supporting them in recognising the correct class among two or three possible classes, especially when those classes are not included in the training set.
Transforming remote sensing imagery to a new embedding and using this strategy to predict seen and unseen classes could be a useful approach to ZSL in remote sensing data. Further developments will be considered, making the proposed framework more efficient for learning to predict unseen classes by using novel Word2Vec models specifically for remote sensing applications and various types of CNN models, including residual and graph CNNs. Moreover, the same number of samples for the seen and unseen classes will be considered for further assessment.