HFM: A Hybrid Feature Model Based on Conditional Auto Encoders for Zero-Shot Learning

Zero-Shot Learning (ZSL) is related to training machine learning models capable of classifying or predicting classes (labels) that are not involved in the training set (unseen classes). A well-known problem in Deep Learning (DL) is the requirement for large amount of training data. Zero-Shot learning is a straightforward approach that can be applied to overcome this problem. We propose a Hybrid Feature Model (HFM) based on conditional autoencoders for training a classical machine learning model on pseudo training data generated by two conditional autoencoders (given the semantic space as a condition): (a) the first autoencoder is trained with the visual space concatenated with the semantic space and (b) the second autoencoder is trained with the visual space as an input. Then, the decoders of both autoencoders are fed by the test data of the unseen classes to generate pseudo training data. To classify the unseen classes, the pseudo training data are combined to train a support vector machine. Tests on four different benchmark datasets show that the proposed method shows promising results compared to the current state-of-the-art when it comes to settings for both standard Zero-Shot Learning (ZSL) and Generalized Zero-Shot Learning (GZSL).


Introduction
Deep-learning-based models have brought tremendous advancement in different fields, including but not limited to computer vision [1,2], natural language processing [3], and satellite image processing [4]. In these research fields, deep-learning-based models achieved human-level capabilities. In fact, these developments are subject to higher quality and large-scale data. With the exponential growth of new classes in our real world, collecting large amounts of data driven by significant variations requires much cost. It is a key challenge to annotate sufficient training data for each class to exploit supervised learning [4,5]. Therefore, different learning paradigms with limited labeled data have been presented in the literature, namely semi-supervised learning [4], life-long learning [6], and active learning [7]. However, the capabilities of these paradigms are limited in exploring variations in the limited amount of labeled data. Generally, humans can recognize over 30,000 core item types [8] and many more sub-categories. Additionally, humans are also excellent at recognizing items without seeing any visual examples. This capability is the zero-shot learning problem in machine learning.
Zero-shot learning (ZSL) models [9][10][11] have recently emerged to identify unseen categories with no training data but with semantic descriptions of classes. The ZSL models can take into account situations when data are scarce [12,13]. In general, the ZSL models address this situation by learning either a visual-to-semantic mapping [14,15] or a semanticto-visual mapping [16,17]. The general assumption is based on the observations that the visual space encodes the semantic space and that the semantic space encodes the visual space [15,[18][19][20]. However, zero-shot learning is still a challenging research field since we need to predict unseen test categories that are never used when training the models [21][22][23]. For example, most ZSL methods like Deep Embedding Model (DEM) [24][25][26] discover direct embeddings from global features to the semantic space. However, the methods cannot capture the appearance relationships between different local regions in this way. The techniques could also ruin the diversity of visual modality due to highly overlapped semantic descriptions of various categories.
To cope with these challenges, we propose a Hybrid Feature Model (HFM) based on conditional autoencoders for zero-shot learning method to identify both seen and unseen classes via transferring knowledge from seen categories to unseen categories. Based on the observations [27] where a single conditional variational autoencoder is used, our method consists of two autoencoders that are depicted in Figure 1. The first autoencoder is provided by the concatenation of the visual and semantic spaces. The second autoencoder is provided by only the visual space. Our proposed method encodes the real data distribution efficiently. Therefore, our approach identifies the unbiased projection toward seen classes and produces close relationships between unseen samples and prototypes. The proposed approach consists of two autoencoders. The first autoencoder is provided by the concatenated vectors of the visual and semantic spaces. The second autoencoder is provided by the visual features vectors only. Both autoencoders have a dense layer, followed by a dropout and a second dense layer. This is followed by another layer, which generates the values z. Activation functions are ReLU, and the activation functions for the last layer for both the encoder and the decoder are linear.
Most techniques fail to consider the discriminative information between the visual and semantic spaces. Thus, the significant insight is that our hybrid autoencoder approach may precisely represent the real data distribution of the query set in a fine-grained and dynamic manner. Especially, when the available samples are not driven by rich discriminative information. This can be exploited to enrich the diversity of data distribution and further improve the model accuracy. Furthermore, we explore both the visual and semantic spaces to encode diversified and discriminative modes of variation for learning a boosted classifier. Therefore, our method alleviates the problems when intra-class diversity and inter-class discriminability are lacking. Consequently, the proposed model presents promising results using a highly fine-grained dataset (see Section 5). In addition, the work shows that using multiple VAEs generate an improved discriminative image space where data are easier to separate for ZSL classification purposes.
The rest of the paper is divided into the following sections: in Section 2, we present the related works from the literature. In Section 3, we present our proposed method in detail. Experiments and experimental results on four benchmark datasets and a conclusion are presented in Sections 4, 5, and Section 6, respectively.

Related Work
We classify the literature into two categories: embedding space-based zero-shot learning and feature generation-based zero-shot learning. In the first category, Lampert et al. [12] presented attribute-based classification based on a high-level description that is phrased in terms of semantic attributes, such as the object's color or shape. Norouzi et al. [13] introduced an image embedding system that mapped images into the semantic embedding space via a convex combination of the class label embedding vectors. However, the methods do not provide a natural mechanism for multiple semantic modalities to be fused and optimized jointly in an end-to-end structure. In [18], authors assumed that unseen categories come from unsupervised text corpora. Their method is based on the distributions of words in texts as a semantic space for understanding what objects look like. The method does not use the distribution information of samples. Therefore, the method cannot discover the cluster structure of samples. The authors [15] presented a visual-semantic embedding model trained to recognize visual objects using both labeled image data as well as semantic features gleaned from the unannotated text. They did not exploit the cluster relationship to rectify the biased sample-prototype relationship. Akata et al. [20] learned a function considering image and class embeddings. They used supervised attributes and unsupervised output embeddings either derived from hierarchies or learned from unlabeled text corpora. Xian et al. [21] introduced a latent embedding model for learning a compatibility function between image and class embeddings. Romera et al. [22] modeled the relationships between features, attributes, and classes as a two linear layers architecture, where the weights of the top layer are not learned but are given by the surrounding features. The researchers [23] embedded each class in the space of attribute vectors. Changpinyo et al. [28] aligned the semantic space to the model space that concerns itself with recognizing visual features. Kodirov et al. [29] presented a ZSL learning model based on a Semantic AutoEncoder (SAE). They projected a visual feature vector into the semantic space. The encoder and decoder may be linear and symmetric, which could not recognize or differentiate multiple features. Zhang et al. [24] used the visual space as the embedding space by considering the subsequent nearest neighbor search. The method [30] introduced an episode-based model for zero-shot learning. They trained their model within a set of episodes, each of which is modeled to simulate a zero-shot classification task. These methods have limited abilities to scale to large numbers of object categories. This limitation is partly due to the increasing complexity of collecting sufficient training data in the form of labeled images as the number of object categories grows.
In the second category, the methods learn to consolidate the visual samples for unseen classes. These methods first learn a conditional generative model considering, for example, Variational Autoencoder (VAE) and Generative Adversarial Networks (GAN). In addition, GAN-based approaches, e.g., f-VAEGAN-D2 [25] and TF-VAEGAN [26] show a competitive performance. In [25], authors proposed f-VAEGAN-D2, which combined VAEs and GANs to learn the marginal feature distribution of unlabeled images through an unconditional discriminator. However, the method cannot discover the class-based feature distribution from the available semantic information. In contrast to f-VAEGAN-D2 model, authors in [26] proposed the TF-VAEGAN model, which combined VAEs and GANs. However, they added a semantic embedding decoder to reconstruct the embedding space, which is used as a feedback module to improve the output of the Generator of the GAN. However, GANs and their derivatives show training instability, while VAE is more stable [31].
Mishra et al. [27] generated the samples from the given attributes, using a conditional variational autoencoder, and exploited the generated samples to classify the unseen classes.
Our proposed method falls into the feature generation-based zero-shot category driven by stability during training. The approach also encodes complex data distribution efficiently. It demonstrates that for specific test situations (see Section 5), a hybrid model consisting of two VAEs can outperform a GAN-VAE model with less training effort. Excluding the Kullback-Leibler (KL) divergence from the conditional VAE loss yields enhanced discriminative image features for classifying unseen classes in ZSL settings, which is promising. A limitation of the proposed approach is that the proposed model lacks a feedback module that can be coupled with the decoder to improve the reconstructed image space. To show the strength of our proposed method, we perform a comparison with a set of methods [12,13,15,18,[20][21][22][23][24][25][26][27][28][29][30]. The reason for choosing these methods for comparison is three-fold. Firstly, they belong to both categories in the literature. Secondly, they represent different techniques. Lastly, these methods represent older and new techniques in the literature. We also compare our method with [19]. The considered approach is reinforcement learning for training image captioning methods. The comparison with this method would highlight the generalization capability of our approach.

Problem Definition
The basic idea of any ZSL approach is to build a model which maps information from the seen to unseen classes based on a semantic description of the unseen classes. In other word, zero-shot learning is needed when there are no labeled training examples for all classes under observation. Therefore, the available dataset is split into two groups, a training subset (seen classes) Y seen = {y 1 seen , y 2 seen , . . . , y n seen } , and unseen classes Y unseen = {y 1 unseen , y 2 unseen , . . . , y m unseen } subset, where n refers to the number of seen classes and m refers to the number of unseen classes. In addition, the assumption Y seen ∩ Y unseen = φ should hold. In such a situation, the task is to build a model R d → Y unseen using only the training subset and able to classify the unseen classes. Afterward, the trained classifier should be applied on test data of unseen classes under the zero-shot settings Y seen ∩ Y unseen = φ. Consequently, zero-shot learning provides a new technique to overcome obstacles, such as the lack of training examples aiming at increasing a learning system's capability to deal with unexpected events in the same way that people do.
Most state-of-the-art techniques solve the ZS problem by embedding the training data feature space and the semantic representation of class labels in some vector space to preserve the similarity. Then, unseen classes can be classified as nearest-neighbor search problems. In the generalized zero-shot case, we seek to design a more generic model R d → Y seen ∪ Y unseen , that is able to categorize or classify the seen and unseen classes appropriately.

Approach
The Variational Autoencoder [32] consists of a decoder and an encoder. The encoder and the decoder are trained to aim at maximizing a goal which is known as the Evidence Lower Bound (ELBo). In both the encoder and the decoder, the variable z represents the hidden, latent space and the variable x represents the data. In addition, the encoder q Φ (z|x) consists of parameters Φ and maps from data space to latent space and a decoder p θ (x|z) which consists of the parameters θ and maps from latent space to data space. The lower bound for p(x) can be written as: In Equation (1), KL denotes the Kullback-Leibler divergence between the encoder's distribution q Φ (z|x) and p θ (z).
Conditional Variational Autoencoders (CVAE) [33] consists of the encoder and the decoder that can be conditioned to additional variables like the variable x (data) and the condition variable c. Thus, it is possible to generate samples following desired properties that might be encoded by c also. The loss function can be given as: In this work, our loss function considers only the reconstruction term which is the Mean Squared Error (MSE).
We chose to use such a loss function because researchers in [34][35][36], showed that the KL divergence in the standard conditional variational autoencoder (see Equation (1)) does not allow the model to use the latent variables in many situations effectively. In this paper, we show that dropping the Kullback-Leibler (KL) term from the Variational Autoencoder [32] shows promising performance.
Algorithm 1 shows the training steps. Firstly, the algorithm requires the image features X seen , the labels of the image features (visual space) Y seen , and the vectors of the semantic space S seen . Then the first autoencoder Autoencoder 1 is trained using X seen combined with S seen and learns the latent space z to generatex given S seen . Then the second autoencoder Autoencoder 2 is trained using the X seen and learns the latent space z to generateX seen given S seen .

Algorithm 1 Training
Require: X seen , Y seen , S seen Ensure: Autoencoder 1 , Autoencoder 2 Train the conditional model (Autoencoder 1 , condition is S seen ) (X seen , S seen → X seen ) Train the conditional model (Autoencoder 2 , condition is S seen ) (X seen → X seen ) Algorithm 2 shows the detailed steps to classify the unseen classes. The algorithm requires the first autoencoder Autoencoder 1 , the second autoencoder Autoencoder 2 , and the semantic vectors of unseen labels S unseen . Then, the encoder of the first autoencoder Autoencoder 1 will estimate q(z (i) |x (i) , S Y i ) but the input of the encoder is the image feature concatenated with the semantic vectors. Then, the decoder of Autoencoder 1 tries to reconstruct x using a sampled z from a standard normal distribution concatenated with S unseen . Then, the encoder of the second autoencoder Autoencoder 2 will estimate q(z (i) |x (i) , S Y i ) but the input of the encoder is only the image feature space. Then, the decoder of Autoencoder 2 tries to reconstruct x using a sampled z from a standard normal distribution concatenated with S unseen . The generatedx from both autoencoders will be concatenated to form the pseudo training data for a support vector machine. Then, the Support Vector Machine (SVM) is trained, and its parameters are fitted. We use it to predict the performance using the unseen test classes.

Algorithm 2 Unseen classes classification
Require: Autoencoder 1 , Autoencoder 2 , X unseen , S unseen , Y unseen Ensure: classLabel TrainingSet Autoenc 1 = Φ for y unseen ∈ Y unseen do for i inNumO f Samples do # sample f rom a Gaussian distribution z~N (0, 1) # Concanetate z and the unseen semantic class label tmpV i = S unseen • z # Generate a pseudo − sample f rom the f irst autoencoder PseudoX i ← Decoder Autoencoder 1 (tmpV i ) # Add the sample and the unseen class label to TrainingSet Autoenc 1 TrainingSet Autoenc 1 ← TrainingSet Autoenc 1 ∪ (PseudoX i , y unseen ) end for end for TrainingSet Autoenc 2 = Φ for y unseen ∈ Y unseen do for i inNumO f Samples do # sample f rom a Gaussian distribution z~N (0, 1) # Concanetate z and the unseen semantic class label tmpV i = S unseen • z # Generate a pseudo − sample f rom the second autoencoder PseudoX i ← Decoder Autoencoder 2 (tmpV i ) # Add the sample and the unseen class label to TrainingSet Autoenc 2 TrainingSet Autoenc 2 ← TrainingSet Autoenc 2 ∪ (PseudoX i , y unseen ) end for end for S training = TrainingSet Autoenc 1 ∪ TrainingSet Autoenc 2 fit SVM model using S training Use the trained SVM model classLabel = SVM(X unseen )

Experiments
In the field of ZSL, there are well-known benchmark datasets. Therefore, we selected four of them to test the performance of the proposed approach. We used, SUN Attribute (SUN) dataset [37] which consists of 14340 images, 645 classes are seen and 72 unseen. Caltech-UCSD-Birds (CUB) [38] which consists of 11788 images, 150 classes are seen and 50 unseen. In addition, we used Animals with Attributes1 and Animals with Attributes2    Regarding the visual space, we explored the Residual Neural Network 101 (ResNet101) features [39]. Concerning the semantic space, we rely on the semantic space vectors given by the authors of those datasets. Both autoencoders have a dense layer, followed by a dropout and a second dense layer. This is followed by another layer, which generates the values z. Activation functions are ReLU, and the activation functions for the last layer for both the encoder and the decoder are linear. In addition, we use the keras [40] framework in combination with the tensorflow backend [41] for implementation.
In our model, hyper-parameters are divided into two categories. The network hyperparameters and the Support Vector Machine (SVM) cost parameter. The network hyperparameters are set to batch size equal to 50, the size of the latent variable is 50, and the optimizer is Adam [42]. The number of generated samples for each class is equal to 200. Cross-validation on training classes is used to determine the latent variable size. The SVM cost parameter is set to 100. To calculate the overall accuracy, we used the per-class average: Regarding the GZSL, we explored the generalized zero-shot situation [43]. We kept aside 20% of the data from the training images and trained the model using the remaining 80% of the data. The SVM is trained using both the seen and the unseen classes to avoid biased performance toward seen classes. For Generalized Zero-Shot Learning (GZLS), we followed the recommendation in [44] to consider the harmonic mean of the accuracy between seen and unseen classes. Table 1 shows the state-of-the-art comparison on four datasets using per-class average and the suggested splits from [39]. Our HFM model shows classification scores of 69.5%, 65.0%, 65.5%, and 53.8% on CUB, AwA1, AwA2, and SUN, respectively. For the ZSL settings, Table 1 shows that f-VAEGAN-D2 [25] and TF-VAEGAN [26] performed the best for AwA2 and SUN datasets. However, our model outperforms them using the CUB dataset. This result is promising because our model showed an improved performance using the highly fine-grained CUB dataset, which means that the generated pseudo-images gave separable output space. We attribute this to excluding Kullback Leibler divergence and to the hybrid nature of our reconstructed image feature space. Unfortunately, the authors of f-VAEGAN-D2 and Tf-VAEGA did not provide any results related to AwA1 dataset.

Results and Discussion
As shown in Figure 5, we visually inspect the image feature vectors produced by our model for each class using the t-SNE [45] technique, and we compare them to the original test image feature vectors for the AwA-1 dataset. As a result, we could observe that the proposed approach can accurately simulate the underlying images. In addition, we could observe that the reconstructed image features did not exclude many modes compared to the real distribution.   Table 2 shows the result of the Generalized Zero-Shot Learning (GZSL) compared to the well-known state-of-the-art approaches. The table shows comparable performance for the CUB and AwA2 dataset. However, the proposed approach showed better performance using AwA1 dataset. Table 2 shows that our HFM model has a harmonic mean score of 43.4%, 61.6%, 63.4%, and 29.7% on CUB, AwA1, AwA2, and SUN, respectively. The results of the Generalized Zero-Shot learning can be explained because of using ELBo without KL divergence (KL-free) is still theoretically a valid target for generative modeling using VAEs [35].  Table 3 shows the results for every autoencoder on four datasets under the ZSL setting. The results of the table confirm that combining the image feature spaces that are generated using both autoencoders improved the overall performance significantly.  Table 4 shows the results of the Generalized zero-shot setting (GZSL) that are calculated based on per-class average using seen classes, unseen classes, and harmonic mean. Furthermore, other recent works, e.g., AFRNet [46] and GEM-ZSL [47] showed competitive results compared to our approach using different experimental settings. In AFR-Net [46], authors proposed an adversarial network consisting of a residual generator, a prototype predictor, and a discriminator to synthesize compact semantic visual features for ZSL. Furthermore, authors in GEM-ZSL [47], their goal is the estimation of the real human gaze position to determine the visual attention areas for recognizing an unseen object using the semantic description of attributes. Thus, a feedback module combined with the decoder of each VAE may improve the overall performance of the GZSL problem.

Conclusions
Zero-Shot learning is related to building machine learning models that can classify or predict classes (labels) that are not included in the training set. In this work, a generative zero-shot learning model is developed. The model can be extended to different use case scenarios. In addition, this work provided intensive tests and detailed coverage of state-of-the-art technology. According to our results, the model shows promising results in some cases compared to the state-of-the-art methods considering three benchmark datasets, even in the case of generalized zero-shot learning. Our proposed method showed that: (a) excluding the Kullback-Leibler (KL) divergence from the conditional VAE loss synthesizes discriminative image features for classifying unseen classes in ZSL problem settings, (b) Using multiple VAEs generates an improved discriminative image space where data are easier to separate for classification purposes. Moreover, a limitation of the proposed approach is that the proposed model lacks a feedback module that can improve the reconstructed pseudo-image space. In our future work, we will add a feedback module and extend our generative model to combine the generative model with an additional embedding model. It means the model maps both the real and the pseudo-generated samples produced by the generative model into a new embedding space where classes are better separable.