Zero-Shot Image Classification Based on a Learnable Deep Metric

The supervised model based on deep learning has made great achievements in the field of image classification after training with a large number of labeled samples. However, there are many categories without or only with a few labeled training samples in practice, and some categories even have no training samples at all. The proposed zero-shot learning greatly reduces the dependence on labeled training samples for image classification models. Nevertheless, there are limitations in learning the similarity of visual features and semantic features with a predefined fixed metric (e.g., as Euclidean distance), as well as the problem of semantic gap in the mapping process. To address these problems, a new zero-shot image classification method based on an end-to-end learnable deep metric is proposed in this paper. First, the common space embedding is adopted to map the visual features and semantic features into a common space. Second, an end-to-end learnable deep metric, that is, the relation network is utilized to learn the similarity of visual features and semantic features. Finally, the invisible images are classified, according to the similarity score. Extensive experiments are carried out on four datasets and the results indicate the effectiveness of the proposed method.


Introduction
Thanks to the development of deep learning models, image classification and image recognition have made continuous progress. However, most of existing deep learning models [1][2][3][4] are supervised and they can only classify and recognize seen classes with labeled samples. The only way to recognize the novel classes is to retrain the classifiers with a large number of novel, labeled samples. To classify and recognize unseen classes, human reasoning process is utilized to simulate the unseen classes in human brains, reading the description of objects and recognizing them. Similarly, zero-shot learning is proposed and it makes the deep learning models have the ability to reason similar to humans, classify and recognize the new classes, even which have never been seen before.
Zero-shot learning (ZSL) has become a new direction derived from transfer learning and its training and testing samples are independent and disjointed. ZSL aims to replace the low-dimensional features of samples with high-dimensional semantic features so that the trained model has the ability to transfer.
In recent years, a kind of typical ZSL methods based on space embedding [5][6][7] uses the correlation between seen and unseen classes to complete the attribute transferring from seen classes to unseen classes. According to the mapping of vision and semantic, space embedding methods [8][9][10] can be divided into three categories: semantic space embedding methods, visual space embedding methods, and common space embedding methods. As a typical semantic space embedding method, the Semantic AutoEncoder (SAE) model [8] maps visual features to semantic space directly. But there is a hubness problem, that is, the recognition results of unseen classes are biased toward seen classes, due to the lack of visual features of unseen classes. As one visual space embedding method, the Deep Embedding Model (DEM) maps semantic features into the visual space [9]. Although DEM can alleviate the hubness problem to a certain extent, the inconsistency between the manifold of visual features and semantic features leads to the semantic gap. Consequently, common space embedding methods are proposed to map visual features and semantic features to the same embedding space for achieving good classification performance. For example, structured joint embedding (SJE) [10] is one common space embedding method. Therefore, to alleviate the semantic gap problem, we propose a new ZSL method based on the common space embedding in this paper.
In order to learn the relationship between visual features and semantic features easily, exiting ZSL methods [8][9][10][11][12][13][14] usually use the nearest neighbor search methods with predefined fixed measures. Ji et al. [11] have adopted attribute similarity to constrain the distance between categories in the same modality, meanwhile, hash codes have been generated according to the category similarity and attribute similarity to perform the approximate nearest neighbor (ANN) search. In [12], the labels have been sorted in depth according to the distance and then the ranking SVM has been directly used to perform zeroshot multi-label prediction. Sandouk et al. [13] have used the Euclidean distance between embedded concepts in the concept embedding space to reflect the semantic similarity; while the simple metric has the limitation of unlearnable and being predefined in advance. To overcome these limitations, Sung et al. [14] have proposed the relation network model (RN) to learn a learnable end-to-end deep metric for comparing the relation between visual features and semantic features with the relationship scores. Inspired by the relation network model, we propose a new ZSL method based on the learnable deep metric in this paper.
Therefore, we propose a new zero-shot image classification method based on a learnable deep metric (ZIC-LDM). ZIC-LDM model is composed of the common space embedding module and the relation module. The common space embedding module is adopted to map the visual features and semantic features into a common space and the relation module is to calculate the similarity score between the visual features and semantic features by using the end-to-end learnable deep metric to achieve good relationship matching. Our proposed method, ZIC-LDM, can learn the correlation between visual features and semantic features in the common space with the learnable deep metric, and it adjusts the correlation end-to-end in a data-driven way. This can greatly alleviate the semantic gap problem caused by the inconsistency between the manifold of visual features and semantic features. ZIC-LDM is applied to the traditional zero-shot image classification task and the generalized zero-shot image classification task, respectively. Experiments are conducted on widely used datasets and the experimental results indicate that ZIC-LDM has the ability to achieve better zero-shot image classification performance compared with other methods.

Zero-Shot Learning
ZSL relies on the labeled seen classes and the semantic information associated with unseen classes and seen classes. In the early stage, the ZSL methods, such as Direct Attribute Prediction (DAP) and Indirect Attribute prediction (IAP) [15], predict the testing samples by training an attribute classifier. With the development of ZSL, current ZSL methods mainly include two categories: space embedding methods and generative model methods. The space embedding methods [8][9][10] rely on an embedding space, in which the attribute migration from seen classes to unseen classes is completed. The generative model methods utilize different generative models, such as generative adversarial networks (GANs) [16,17], variational autoencoders (VAEs) [18,19], and flow-based models (Flows) [20] to directly generate visual features of unseen classes, and then transform the zero-shot learning problem into a traditional supervised learning problem. For example, [16] has proposed a triple discriminator GAN (TDGAN), which employs a GAN with three discriminators to synthesize visual features for images of unseen classes. Ref. [17] has proposed a multi-modal generative adversarial network (M2GAN) to fuse various types of class semantic prototypes, which are achieved in an adversarial framework. Machot et al. [21] have designed a ZSL algorithm by exploiting heterogeneous knowledge between sensor data and semantic space, and then they have spread this algorithm from recognizing unseen classes to unseen human action. Matsuki et al. [22] have proposed an extended word vector-based algorithm by analyzing several ZSL results of embedding semantic features in semantic space. Ohashi et al. [23] have considered that different classes might exist some same attributes, which would influence the classification. Therefore, they have proposed one method to calculate the importance of every attribute of each class.
Though the performance of ZSL has been constantly improved, the setting of the testing stage is too strict and it cannot truly reflect the scene of object recognition in the real world. Therefore, Chao et al. [24] have proposed the generalized zero-shot learning (GZSL) to be closer to the reality of testing stage, considering the amount of seen classes is far more than that of unseen classes. That is, the testing data come from unseen classes and seen classes in GZSL, while the testing data come from unseen classes in ZSL.

Meta Learning
Meta learning is to learn new tasks by using prior knowledge and the known experience. The existing meta methods can be roughly divided into three categories: learning to fine-tune based methods, RNN memory-based methods, and metric learning-based methods. Learning to fine-tune based method [25] learns an initial parameter first, and then it only needs a few samples for training to solve new problems through several gradient descent steps. The RNN memory-based methods use memory recurrent neural networks for meta learning. For example, an external memory model has been used to solve the problem of few samples [26]. The metric learning-based method [27] maps the input space (such as a picture) to a new embedding space by learning an embedding function, and then uses similarity measures (such as Cosine distance and Euclidean distance) to distinguish individual classes. The relation network model (RN) [14] classifies the images of new classes by calculating the relationship scores between the query image and several examples of each new class.

Semantic Features
Semantic features, representing various details of categories, such as characteristics of animals, human behaviors, and scene descriptions, are used to distinguish different objects in ZSL. In general, semantic features are usually divided into three categories: user-defined attribute features, word vectors, and text features. The most common category of semantic features is the user-defined attribute features, which are the specific description of a certain category. The user-defined attribute features have been used to construct semantic space for improving the accuracy of zero-shot classification [28]. Word vectors (Word2Vec) are converted from a text by natural languages processing technology, such as CBOW [29], Skip-gram [29], and GloVe [30]. Text features are transformed from the text description of a certain category through text coding models.

Similarity Measure for Zero-Shot Image Classificaiton
Most zero-shot image classification methods project features extracted by deep learning network into the embedding space. An end-to-end deep learning model could learn better embedding space and a more flexible model. However, these deep models estimate loss through different loss functions. Socher et al. [31] have employed Euclidean distance to match features and attributes in a simple way. Xie et al. [32] have considered the compatibility loss, which has advantages to learn local features. Ref. [33] has exploited the Margin-based loss and has integrated the language model into a neural network, and then increase the separability of features by the end-to-end learning. Ba et al. [34] have considered the binary cross-entropy loss, hinge loss, and Euclidean distance loss simultaneously to predict testing samples through text corpus. DEM [9] has exploited the least square loss, which could not only jointly learn the language model and the embedding space but also fuse text description and multiple modal data. RN [14] has exploited the MSE loss to calculate the similarity score between visual features and semantic attributes to classify unseen classes.

Task Define
In the zero-shot image classification task, the seen classes S = x p , y p n s p=1 are taken as the training set, where x p ∈ X S is the p-th image of the seen classes and y p ∈ Y S is the corresponding class label. The unseen classes U = (x q , y q n u q=1 serve as the testing set, where y q ∈ Y U is the corresponding class label. Seen classes and unseen classes constitute the whole dataset, but they don't intersect: In our paper, we adopt the user-defined attribute features as the semantic features v. v c and v d respectively represent the semantic features of seen classes and unseen classes with the number of classes c and d. In the testing stage, for testing sample x q and semantic features v d , the purpose of zero-shot image classification is to predict the corresponding y q for x q .

Model
In this paper, a zero-shot image classification framework, based on learnable deep metric ZIC-LDM is proposed. Figure 1 shows the framework of ZIC-LDM, which is mainly composed of two modules: the relation module and the common space embedding module. In the common space embedding module, the visual features of a given image x are extracted as f ϕ (x) by using the residual network ResNet101. Then f ϕ (x) is mapped into the common space through a fully connected layer and now the visual features are defined as g ϕ (x). In addition, two fully connected layers are used to map the semantic features v to the same common space, where the semantic features are expressed as h ϕ (v). In the relation module, the visual features g ϕ (x) and the semantic features h ϕ (v) are first concatenated, and then the similarity score is calculated by two fully connected layers in a data-driven way. At last, image classification can be completed according to the similarity score.
For semantic features v , which are mapped to the same common space through two fully connected layers with the parameter. At this time, the semantic features in the common space are expressed as

Common Space Embedding Module
Common space embedding module maps visual features and semantic features to the same common space.
First, the visual features f ϕ (x) of a given image x can be obtained by using the residual network ResNet101 with the parameter W f . The visual features f ϕ (x) can be expressed as: Then, the visual features f ϕ (x) is mapped to the common space through a fully connected layer with the parameter W g , and the visual features in common space can be expressed as g ϕ (x): For semantic features v, which are mapped to the same common space through two fully connected layers with the parameter. At this time, the semantic features in the common space are expressed as h ϕ (v):

Relation Module
Relation module r ω realizes zero-shot image classification by calculating the similarity score of visual features and semantic features. After visual features and semantic features are embedded into the common space and connected, the relation module calculates a scalar between 0 and 1 according to the parameters of the relation network to represent the learned relationship between visual features and semantic features in the relation module, which referred to the similarity score. This scalar refers to as the similarity score. The higher the similarity score is, the more matching visual features and semantic features are. First, the visual features and semantic features are concatenated followed by a RELU function activated fully connected layer and a sigmoid function activated fully connected layer in turn, and finally the similarity score is calculated. In the training stage, the visual features g ϕ x p of the image and the semantic features h ϕ (v c ) obtained in the common space embedding module are concatenated, and then the similarity score s p,m is calculated after two full connection layers. The similarity score s p,m can be expressed as: In the above Equation (4), C(, ) represents the operation of concatenation.
Here we expect a regression problem to calculate the similarity score with the learnable deep metric. The similarity score is a concrete value in the range of {0, 1}. However, in order to avoid restrictions, we approximately regard it as a binary classification problem. When visual features and semantic features match, the similarity score is 1, otherwise, the similarity score is 0.

Objective Function
In this paper, mean square error (MSE) is used as the loss function and it can be calculated with the similarity score s p,m and the real category label y (v c ) of the seen class. The loss function is can be expressed as follow: To make the relation module match the visual features and semantic features belonging to the same category well, the proposed ZIC-LDM needs to be trained by minimizing Formula (5):

Model Implementation
The training process of ZIC-LDM, i.e., zero-shot image classification model based on a learnable deep metric, is shown in Algorithm 1. Sampling m training samples x and corresponding label y from seen classes; 4 Mapping f ϕ (x) into common space: Mapping v into common space: Concatenate g ϕ (x) and h ϕ (v); 7 Calculate similarity score: Calculate MSE loss: L = MSE s p,m , y v ; 9 Update FC parameters for semantic features mapping, FC parameters for visual features mapping and relation module:

Testing Process
In this chapter, zero-shot image classification task and generalized zero-shot image classification task are tested respectively.

Zero-Shot Image Classification
In the zero-shot image classification task, for a given image x q ∈ x U of the unseen class, the visual features f ϕ x q are extracted and then they are mapped to the common space with the representation g ϕ x q by the trained fully connected layer FC3. Then, the fully connected layers FC1 and FC2 are used to map the semantic features v d of the unseen class into same common space to obtain the semantic features h ϕ (v d ) of the unseen classes (d is the number of unseen classes). h ϕ (v d ) and g ϕ x q are concatenated, followed by the calculation of their similarity, namely the similarity score: Finally, the class with the highest similarity score is taken as the prediction label. This can be expressed as:

Generalized Zero-Shot Image Classification
When the generalized zero-shot classification is carried out, the testing classes include both seen classes and unseen classes, that is X = X S ∪ X U . At this time, the visual features f ϕ (x) are expressed as g ϕ (x) in common space. The semantic features mapped to the common space are h ϕ (v), where v = v c ∪ v d . The matching degree of visual features and semantic features of the image, i.e., the similarity score s, can be calculated as follow: The class with the highest similarity score is taken as the label of the prediction. This is expressed as:

Experiments
The proposed method ZIC-LDM is tested and compared with several existing methods on four datasets. Experiments are conducted and the results indicate the effectiveness of ZIC-LDM.

Dataset and Settings
In our experiments, four datasets commonly used in zero-shot image classification are selected: Animals with Attributes 1 and 2 (AwA1 [15] and AwA2 [35]), CUB (CUB-200-2011) [36], and SUN (SUN Attribute) [37]. For the common space embedding module, the pooling layer of the top layer of RerNet101 without fine-tuning is used to extract the visual features f ϕ (x), whose dimension is 1024. MLP network is used to learn the semantic features h ϕ (v). For the relation module, g ϕ (x) and h ϕ (v) are concatenated and then the relationship between the visual features and semantic features of the image in the common space is calculated through two fully connected layers FC4 and FC5. In zero-shot image classification, the hubness problem often occurs in cross-modal mapping, so L 2 regularization is added to FC1 and FC2 at the fully connected layers to alleviate this problem. Besides, our framework is trained in the embedded network with a weight decay of 10 −5 , and the Adam algorithm is used to initialize the learning rate to 10 −5 .

Traditional Zero-Shot Image Classification
Top-1 accuracy is usually used as the criterion for the image classification. The prediction is accurate when the predicted class is correct. Averaging the accuracies of all images can have a good effect on the classes with dense images. What is more, for some classes with relatively rare images, the average values of each group of correct predictions are calculated, respectively, that is, the average top-1 accuracy of each class is measured. In the traditional zero-shot image classification, the mean precision of top-1 is adopted as the criterion. The experimental results on AwA1, AwA2, CUB, and SUN datasets are shown in Table 1, and the best results are in bold. As can be seen from Table 1: 1.
The results of ZIC-LDM on AwA1, AwA2, CUB, and SUN datasets are better than those of the baseline SJE with the increase of 4.0%, 5.8%, 2.9%, and 5.2%, respectively. In addition, compared with the latest models Gaussian and SELAR, ZIC-LDM also achieves excellent results, which shows that our proposed model is effective in zeroshot image classification. Therefore, the ZIC-LDM with the learnable deep metric can learn good visual-semantic relationship.

2.
Compared with the methods, such as DAP, CONSE, ESZSL, ALE, and SynC, which use the predefined fixed metrics, ZIC-LDM has achieved the best results on AwA1, AwA2, CUB, and SUN datasets. This indicates that the learnable deep metric makes ZIC-LDM learn the visual-semantic relationship well.

3.
Compared with the method SAE based on semantic space embedding and methods RN and CCSS based on visual space embedding, ZIC-LDM based on common space embedding has achieved the best results on AwA1, AwA2, CUB, and SUN datasets. It shows that common space embedding can alleviate the semantic gap problem.
In addition, the confusion matrices and visualization results of ZIC-LDM on AwA1 and AwA2 datasets are given in Figures 2 and 3. It can be seen from the diagonal values of the confusion matrices in Figure 2 that ZIC-LDM can accurately recognize most unseen classes. Figure 3 shows the visualization distribution of 10 unseen classes with t-SNE. The distribution of same class is more concentrated, and the distribution of different classes is more dispersed. Figures 2 and 3 show the feasibility and effectiveness of the learnable deep metric and the common space embedding.
All in all, the above results indicate that our proposed method ZIC-LDM can obtain good zero-shot image classification performance. This is mainly due to the common space embedding module and relation module used in ZIC-LDM. The learnable deep metric helps ZIC-LDM learn the relationship between visual features and semantic features and the common space embedding module alleviates the semantic gap problem.
In addition, the confusion matrices and visualization results of ZIC-LDM on AwA1 and AwA2 datasets are given in Figures 2 and 3. It can be seen from the diagonal values of the confusion matrices in Figure 2 that ZIC-LDM can accurately recognize most unseen classes. Figure 3 shows the visualization distribution of 10 unseen classes with t-SNE. The distribution of same class is more concentrated, and the distribution of different classes is more dispersed. Figures 2 and 3 show the feasibility and effectiveness of the learnable deep metric and the common space embedding. All in all, the above results indicate that our proposed method ZIC-LDM can obtain good zero-shot image classification performance. This is mainly due to the common space embedding module and relation module used in ZIC-LDM. The learnable deep metric helps ZIC-LDM learn the relationship between visual features and semantic features and the common space embedding module alleviates the semantic gap problem.

Generalized Zero-Shot Image Classification
In the generalized zero-shot image classification, the harmonic mean is selected as the evaluation criterion to make seen classes and unseen classes both have high accuracy. Firstly, the average top-1 accuracy per class of seen classes and unseen classes is calculated, and then the harmonic mean of seen classes and unseen classes is computed. The expression of the harmonic mean is as follow: where S represents the average top-1 precision of seen classes, and U represents the average top-1 precision of unseen classes. In the process of model training, seen class samples may make the model over-fit, resulting in an imbalance between the accuracy of seen class samples and the accuracy of unseen class samples. So, U and H are the main evaluation criterion of generalized zero-shot image classification.

Generalized Zero-Shot Image Classification
In the generalized zero-shot image classification, the harmonic mean is selected as the evaluation criterion to make seen classes and unseen classes both have high accuracy. Firstly, the average top-1 accuracy per class of seen classes and unseen classes is calculated, and then the harmonic mean of seen classes and unseen classes is computed. The expression of the harmonic mean is as follow: where S represents the average top-1 precision of seen classes, and U represents the average top-1 precision of unseen classes. In the process of model training, seen class samples may make the model over-fit, resulting in an imbalance between the accuracy of seen class samples and the accuracy of unseen class samples. So, U and H are the main evaluation criterion of generalized zero-shot image classification.
Our proposed method ZIC-LDM is compared with other 12 methods in the generalized zero-shot image classification task on AwA1, AwA2, CUB, and SUN datasets. The results are shown in Table 2 and the optimal results are in bold. It can be seen from Table 2  Compared with the latest methods MLSE, MIIR, and SELAR, ZIC-LMD has the best U and H on AwA2, CUB, and SUN datasets. It shows that the combination of learnable deep metric and common space embedding is advanced and effective in generalized zero-shot image classification task.
In general, ZIC-LDM achieves good result in generative zero-shot learning. It benefits from learning the relationship between visual and semantic features by using the learnable deep metric and using common space embedding to relieve the semantic gap.

Loss Convergence Analysis
The essence of the semantic gap problem is that the model cannot learn the good mapping relationship between visual features and semantic features because of their manifold differences. In the process of model training, the convergence of the loss function can determine whether the semantic gap problem is alleviated. The loss convergence of ZIC-LDM is analyzed on AwA1 and AwA2 datasets, respectively. In these experiments, ZIC-LDM is compared with RN, which is based on the learnable deep metric and the visual embedding space. The loss of the first 2000 iterations is selected as the comparison, and the loss output per 400 iterations is taken as the record. The loss convergence curves on AwA1 dataset and AwA2 dataset are shown in Figure 4. determine whether the semantic gap problem is alleviated. The loss convergence of ZIC-LDM is analyzed on AwA1 and AwA2 datasets, respectively. In these experiments, ZIC-LDM is compared with RN, which is based on the learnable deep metric and the visual embedding space. The loss of the first 2000 iterations is selected as the comparison, and the loss output per 400 iterations is taken as the record. The loss convergence curves on AwA1 dataset and AwA2 dataset are shown in Figure 4.  From Figure 4 we can see that ZIC-LDM has good convergence. Compared with RN, the common embedding space adopted in ZIC-LDM makes it alleviate the semantic gap problem between visual features and semantic features, and then makes its loss converge faster than that of RN. It shows that common space embedding alleviates the semantic gap problem.
In addition, the computing time of ZIC-LDM is relatively short. For example, our ZIC-LDM method takes about 125 s or 1500s with 25,000 epochs or 300,000 epochs to perform traditional and generalized zero-shot image classification, respectively, on 30,745 images of 50 categories on the AwA1 dataset, and the experimental results are good. We use NVIDIA GTX1080Ti, made by Micro Star International (Shenzhen, China), and PyTorch1.0 as the testing environment in our experiments.

Distance Metric Study
In this section, we conduct experiments on distance metrics study. Three predefined fixed distanced measures, Euclidean distance (ED), Cosine similarity (CS), and Mahalanobis metric learning (MML) are compared with the learnable deep metric used in our method ZIC-LDM. The experimental results are shown in Table 3 and the optimal results are in bold, where T is top-1 accuracy for traditional zero-shot image classification. From Table 3, we can see that the learnable deep metric obtains the best results compared with the predefined fixed distanced measures learnable deep metric ED, CS, and MML. This is attributed to its ability to learn the relationship between visual features and semantic features.

Conclusions
In this paper, the zero-shot image classification method based on learnable deep metric ZIC-LDM is proposed to overcome the limitation of similarity between visual features and semantic features of images and alleviate the semantic gap problem by using the relation module and the common space embedding module. The relation module uses the learnable deep metric to learn the good visual-semantic relationship, as well as the common space embedding module can learn the correlation between visual features and semantic features in the common space to alleviates the semantic gap problem. Compared with other methods, especially the baseline SJE, the proposed method has better performance in both the traditional ZSL task and the GZSL task on four datasets. This indicates the effectiveness and advancement of the proposed method ZIC-LDM. However, some categories with low correlation (i.e., birds and cars) have a limit when the learnable deep metric learns the relationship between visual features and semantic features in the knowledge transforming process, due to user annotated semantic features. Then our future research will exploit the learnable deep metric to learn visual-semantic relationships in graph neural networks, according to feature nodes of different categories. In addition, we will also research the domain shift problem in our future work.