Exploiting Semantic-Visual Symmetry for Goal-Oriented Zero-Shot Recognition

Zheng, Haixia; Zhou, Yu; Jiang, Mingjie

doi:10.3390/sym17081291

Open AccessArticle

Exploiting Semantic-Visual Symmetry for Goal-Oriented Zero-Shot Recognition

by

Haixia Zheng

¹

,

Yu Zhou

^2,3

and

Mingjie Jiang

^1,*

¹

College of Electrical and Information Engineering, Quzhou University, Quzhou 324000, China

²

College of Computer Science and Technology (College of Data Science), Taiyuan University of Technology, Taiyuan 030024, China

³

Shanxi Energy Internet Research Institute, Taiyuan 030000, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(8), 1291; https://doi.org/10.3390/sym17081291

Submission received: 4 July 2025 / Revised: 3 August 2025 / Accepted: 7 August 2025 / Published: 11 August 2025

(This article belongs to the Special Issue Asymmetry in Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

Traditional machine learning methods only classify the instances whose classes are seen during training. In practice, many applications require to recognize the classes unknown in the training stage. In order to tackle this kind of challenging task, zero-shot learning is introduced, which incorporates additional semantic information to establish a semantic-visual symmetry, thereby facilitating the transfer of knowledge from known to unknown classes. Although user-defined attributes are commonly utilized to provide prior semantic information for zero-shot recognition, their importance for discrimination is not always consistent. Motivated by the observation that there exist both latent discriminative features and attributes in the images, this paper proposes a goal-oriented joint learning architecture to establish the symmetric relationships between images, attributes and categories for zero-shot learning. To be more specific, we model the latent feature and attribute spaces using the dictionary learning architecture. To learn the symmetric relationships between latent features and latent attributes, a linear transformation is applied while maintaining the semantic information. Moreover, seen-class classifiers are trained to enhance the discriminability of latent features. Extensive experiments on three representative benchmark datasets show that the proposed algorithm outperforms existing methods, highlighting the effectiveness of modeling explicit symmetry in the semantic-visual space for robust zero-shot knowledge transfer.

Keywords:

zero-shot recognition; semantic-visual symmetry; dictionary learning; alternating optimization

1. Introduction

Visual recognition has experienced significant progress in recent years, largely due to the expansion of data scale and advancements in classification methods, especially benefiting from the capability of deep learning techniques, which provide an end-to-end solution from feature extraction to classification. Despite the exciting advances, typical recognition approaches need numerous samples and sufficient labeled training instances for each class. Additionally, standard classification methods are limited to recognizing only the samples from the seen classes during training, and they are unable to identify instances from previously unknown classes.

However, in many real-world applications, collecting large-scale labeled instances for all classes is challenging. Detailed annotation of numerous samples is time-consuming and often requires specialized domain expertise. Additionally, many categories, such as endangered birds, rare wildlife, and unusual diseases, lack sufficient samples, making it expensive or even impossible to collect thousands of samples. Furthermore, the classes in the test dataset may not be present in the training instances; for instance, in activity recognition [1], there is a wide range of human activities, yet available datasets only cover a restricted number of activity classes. As a result, many activity classes do not have annotated examples in these datasets. In image captioning [2], the training data are structured like a sequential classification issue, with a caption for each image. However, current image–text corpora only cover a small subset of object classes, resulting in the unavailability of numerous object classes; similarly, recognizing products from a specific style and brand can be challenging, as the frequent emergence of new designs and brands has made it exceedingly difficult to find labeled samples for certain new products. From the application examples mentioned above, it is evident that many classes lack labeled instances. Therefore, it is crucial and desirable for a classifier to be able to recognize the class labels of samples belonging to these unseen classes. Zero-shot learning (ZSL) techniques [3] provide a viable solution to these issues. With the growing complexity of image annotation and the increasing scale of data, zero-shot learning has emerged as a favored research area in machine learning, with a wide range of applications, including ubiquitous computing, natural language processing, computer vision, and so on.

Given prior knowledge of seen classes, we have an amazing capacity to identify new classes based on shared and unique characteristics of known and unknown classes. For example, knowing that “zebras looks like a horse with black and white stripes” enables us to identify zebras even if we have never seen it before, as long as we are familiar with the appearance of a horse and “black and white stripes” [4]. Inspired by the human capacity to identify unseen objects, zero-shot learning seeks to construct a model with a strong generalization capability that can recognize objects from unknown classes. This is achieved by transferring the knowledge from training samples to the test instance recognition task using auxiliary semantic information, as illustrated in Figure 1. Semantic information, which captures the visually distinguishing characteristics of objects, includes visual attributes [5], textual descriptions [6], word vectors of class names [7], hierarchical ontologies of classes [8] (like WordNet), and human gazes [9]. In essence, zero-shot learning (ZSL) utilizes semantic information to establish a semantic-visual symmetry, thereby facilitating the transfer of knowledge from seen to unseen classes.

A noteworthy approach for zero-shot learning is attribute-based learning, a pivotal technique in computer vision. This method typically involves a two-step process [10], which needs to train a classifier for each attribute. When classifying a new image, the first step is to predict its attributes using these trained classifiers. Subsequently, the image’s class label is determined through finding the class with the most similar collection of attributes. However, a common challenge faced by two-stage methods is the domain shift [11] between the intermediate task of learning attribute classifiers and the ultimate goal of predicting class labels.

Given that semantic descriptions act as a connection between seen and unseen classes, and that visual features are crucial for recognizing unseen objects, the core of ZSL is to develop symmetric alignment mechanisms that can effectively align the visual features of images with the semantic representations of classes [12,13]. A popular strategy is to employ a bi-linear compatibility function aligning the low-level visual representations of known classes with their associated semantic descriptions, i.e., DEVISE [14], SJE [15], ESZSL [16], ALE [17] and SAE [18]. These methods, despite their simplicity, have consistently achieved superior performance on benchmark datasets [19]. An extension of the above-mentioned methods would involve exploring a more complicated non-linear function to establish the symmetric relationships between semantic and visual domains. However, these enhanced approaches, such as CMT [7] and LATEM [8], are not as competitive as their bi-linear counterparts, potentially due to the fact that more complicated models require larger training datasets to achieve better generalization ability.

Although the majority of zero-shot learning techniques exploit discriminative loss functions to establish the cross-modal symmetric relationships between image data and class embeddings, there exist a few generative-based methods [20,21,22], which leverage examples from known classes along with semantic representations of both known and unknown classes to synthesize images or visual features for unknown classes. Once samples for unknown classes are generated, the zero-shot learning issue can then be transformed into a traditional supervised learning task.

Motivated by the observation that both feature and attribute spaces have corresponding latent representations, which are discriminative for object recognition, and that latent attributes preserve semantic information capable of bridging the gap between known and unknown classes, this paper proposes a goal-oriented joint learning approach. This method promotes a more consistent and symmetric alignment between images, attributes, and categories to tackle the challenges inherent in zero-shot learning.

The main contributions of this paper are summarized as follows:

This work leverages the dictionary learning framework to model the latent feature and attribute spaces. Specifically, images are reconstructed using components from the latent feature dictionary. The original attributes are expressed as distinct combinations of elements from the latent attribute dictionary, with each combination representing a latent attribute. Additionally, the latent attributes naturally capture the correlations among different attributes.
For the purpose of maintaining semantic information, a linear transformation is applied to establish the symmetric relationships between latent features and latent attributes. Therefore, the latent features can be regarded as various combinations of the latent attributes.
The classifiers for seen classes are trained to enhance the discriminability of latent features, with the output probabilities being interpreted as a measure of similarity to the seen classes. As a result, the image representations are transformed from the latent feature domain to the label space.
The experimental results are very encouraging. The proposed approach outperforms recent advanced ZSL approaches, demonstrating the power of jointly learning the symmetric alignment between images, categories and attributes.

The remainder of this paper is structured as follows. Section 2 provides an overview of the relevant literature. In Section 3, we describe the proposed approach in detail. Section 4 demonstrates the experimental results and comparisons. Finally, this work is summarized in Section 5.

2. Related Work

This section provides a brief overview of relevant zero-shot learning (ZSL) works.

ZSL is capable of identifying classes that have never been seen before through leveraging the cross-category property of attributes to establish the relationship between known and unknown classes. A straightforward method for zero-shot recognition involves training classifiers specifically for attributes and then identifying an image based on the predicted attributes and the descriptions of the unknown classes [5,23]. Initially, attribute learning was commonly considered as a binary classification task, with each attribute classifier being trained separately and independently. DAP [24] first trained probabilistic classifiers for each attribute, enabling it to assess the likelihood of each attribute being present in an image; subsequently, it calculated the posterior probabilities for each class and determined the class label by the MAP estimation. The authors in [10] utilized a random forest model to estimate the posterior probabilities of each class after attribute classifier learning, which proved effective for potentially unreliable attributes. The two-step methodology has been further applied to scenarios where attributes are not available. CONSE [25] first predicted the posterior probabilities of known classes and then projected image features into the Word2vec space [26] through a convex combination of the top T most possible known classes. Acknowledging the inherent interconnections of attributes, authors in [27] integrated the relationships between attributes into their learning model.

Given that semantic descriptions bridge the gap between known and unknown classes, and that visual features are crucial for recognizing unseen objects, it is natural to devise a symmetric alignment mechanism that converts the low-level visual features of known classes into their related semantic representations [12,13]. The new classes are recognized through comparing the similarity between the prototype and predicted representations of data samples in the embedding space. For instance, SOC [28] mapped image features into the semantic space and subsequently conducted a search for the closest class embedding vector. DeViSE [14] learned a linear transformation from the image domain to the semantic domain using an effective ranking loss. SJE [15] acquired bi-linear compatibility by optimizing the structural SVM loss. ESZSL [16] learned bi-linear compatibility using the square loss. ALE [17] utilized ranking loss to train a bi-linear compatibility function that connected the image and attribute space. After embedding visual features into the attribute domain, the authors in [29] learned a metric to make the semantic embeddings more consistent. SAE [18] presented a semantic autoencoder that would regularize the model through projecting image features into the reconstructed semantic domain.

The advancements to the above-mentioned methodologies are based on non-linear multimodal embeddings. CMT [7] employed a neural network architecture comprising two hidden layers to establish a non-linear mapping from an image feature space to the word2vec domain [26]. The authors in [6] utilized a deep convolutional neural network to acquire a visual-to-semantic symmetric relationship, different from previous works that built their embeddings based on fixed image features. An end-to-end deep embedding approach was presented by the authors in [30] for transforming semantic representations into the visual domain, since visual features are more discriminative than semantic representations. The authors in [31] utilized a support vector regressor to learn a projection from class-level semantic representations to visual features and then applied nearest neighbor classifiers to those projected representations. The support vector regressor was trained on visual exemplars from known classes, i.e., class centroids in the feature domain.

Both semantic and visual embedding models aim to learn how to transform information from one modal to another. However, different modals of data have distinct characteristics; thus, it is challenging to learn an symmetric alignment function between them. As an alternative approach for zero-shot learning, researchers have developed methods that embed both visual and semantic representations into an intermediate space [32]. Such approaches exploit the common semantic information across different modals of data, enabling the projections of visual and semantic representations belonging to the same class to be positioned closely together in the intermediate space. SSE [33] suggested that this common space was composed of known classes in varying proportions and that images from the same class should exhibit similar patterns of mixtures within this space. Furthermore, similarity-based methods [4] developed classifiers for unknown classes by relating them to known classes based on class-wise similarities. Motivated by the clustering characteristic of samples belonging to the same category, authors in [34] adopted the structured prediction approach to identify unseen-class samples. Hybrid models utilized a combination of seen class representations to model images and semantic embeddings. For instance, SYNC [35] constructed classifiers for unseen classes by combining base classifiers that have been learned using a discriminative learning approach. In order to link attributes with specific portions of images, the authors of [36] jointly embedded various textual representations and visual components.

Some studies [37,38] employed transfer learning techniques to transfer knowledge acquired from observed classes to unobserved ones. The authors of [38] addressed the domain shift problem between observed and unobserved classes through domain adaptation strategies. Additionally, authors in [11] introduced a multi-view embedding method that constructed graph models incorporating both observed and unobserved class samples, with the aim of minimizing the domain gap between these two sets of classes. Authors in [39] introduced a semi-supervised framework that directly learned classifiers for unseen classes. Authors in [29] proposed a metric learning method to address the ZSL issue.

The aforementioned methods excel at categorizing observed classes but struggle with recognizing unseen ones. This limitation stems primarily from the scarcity of visual training data for unseen classes, leading to a bias in the way the semantic and visual domains are mapped. Consequently, numerous test instances belonging to unobserved classes are mistakenly classified as one of the observed classes. To mitigate this issue, recent studies have utilized GANs to generate synthetic visual representations for both observed and unobserved classes, thereby facilitating the training of a classifier capable of distinguishing between both types of classes [21,22]. GLaP [40], which presumed a Gaussian distribution for each class, attempted to generate virtual samples for unobserved classes based on the learned distribution. However, the unrestrained generation of synthetic visual representations for unseen classes often leads to samples that significantly diverge from their actual distribution, particularly for those unseen classes.

Different from the previous zero-shot learning methods, the proposed goal-oriented joint learning algorithm exploits dictionary learning to construct latent spaces for features and attributes. The learned latent features not only preserve semantic information to establish the relationships among the classes but also exhibit enhanced discrimination capabilities, enabling more reliable classification during the test phase.

3. Methodology

In this section, we introduce the proposed goal-oriented joint learning approach for zero-shot recognition in detail, as illustrated in Figure 2.

3.1. Problem Definition

The zero-shot learning task is defined as follows. Let

S = {(x_{i}^{s}, y_{i}^{s})}_{i = 1}^{n^{s}}

denote the training dataset, i.e., the seen-class data, where

x_{i}^{s} \in χ^{S}

is the i-th image in the training dataset and

y_{i}^{s} \in Y^{S}

is its associated class label, which is available during training. Let

U = {(x_{j}^{u}, y_{j}^{u})}_{j = 1}^{n^{u}}

denote the test dataset, i.e., the unseen-class data, where

x_{j}^{u} \in χ^{U}

is the j-th unseen image and

y_{j}^{u} \in Y^{U}

is the label of

x_{j}^{u}

. The label sets of the seen and unseen classes are disjointed, i.e.,

Y^{S} \cap Y^{U} = \emptyset

.

Every class label has a pre-defined semantic representation, regarded as semantic class prototype. Suppose there are

C^{S}

and

C^{U}

categories in total for seen and unseen data, respectively. The semantic descriptions for seen and unseen classes can be described as

A^{S} = {a_{i}^{s}}_{i = 1}^{C^{S}}

and

A^{U} = {a_{j}^{u}}_{j = 1}^{C^{U}}

, where

a_{i}^{s}

and

a_{j}^{u}

are the semantic description vectors for the i-th seen class and the j-th unseen class, respectively. Semantic descriptions generally include word embeddings or binary/numerical class-level vectors annotated with various user-defined visual attributes.

Formally, given the labeled seen-class dataset

{(x_{i}^{s}, y_{i}^{s})}_{i = 1}^{n^{s}}

, the objective of ZSL is to predict its associated category

y_{j}^{u}

of each unseen-class sample

x_{j}^{u}

using the auxiliary information

A^{S}

and

A^{U}

for semantic knowledge transfer.

3.2. Formulation

The primary challenge in zero-shot learning task is to establish connections between known and unknown classes. Although attributes play a significant part in conventional zero-shot recognition methods, there are two factors that need to be taken into consideration. Firstly, user-defined attributes may not contribute equally to distinguishing between classes, as inter-dependencies often exist among attributes. Therefore, it may not be optimal to learn each attribute individually. Secondly, class labels are frequently ignored during the learning phase, and attribute classifiers are typically trained independently of the ultimate zero-shot learning task. Consequently, optimizing attribute prediction without considering its downstream recognition task may not produce the most effective attribute predictor.

Motivated by the observation that there exist latent spaces for both features and attributes, where latent feature representations are discriminative for object recognition and latent attributes preserve semantic information that bridges the gap between known and unknown classes, we design a joint learning approach to capture the symmetric relationships between images, attributes, and categories. The formulation of this approach is as follows:

\begin{matrix} {arg min}_{B, E, D, Z, M, W} | | X - {B E | |}_{F}^{2} + | | A - {D Z | |}_{F}^{2} + λ^{2} | | E - {M Z | |}_{F}^{2} + γ^{2} | | Y - {W E | |}_{F}^{2} \\ s . t . | | b_{i} {| |}_{2}^{2} \leq 1, | | d_{i} {| |}_{2}^{2} \leq 1, | | m_{i} {| |}_{2}^{2} \leq 1, | | w_{i} {| |}_{2}^{2} \leq 1, \forall i \end{matrix}

(1)

where X denotes the feature set; B is the dictionary for latent feature space; D is the dictionary for latent attribute space; A denotes the semantic attribute set; E denotes latent feature representations; Z denotes latent attribute representations; W can be regarded as classifiers for known classes within the latent feature domain. Class label

Y = [y_{1}^{s}, y_{2}^{s}, \dots, y_{n^{s}}^{s}] \in R^{C^{S} \times n^{s}}

,

y_{i}^{s} = {[0 \dots 010 \dots 0]}^{T}

is a one-hot vector denoting the class label of

x_{i}^{s}

. The label space

R^{C^{S}}

is composed of the column vectors

y_{i}^{s}

of Y, where each element of vector

y_{i}^{s}

signifies the similarity to a specific known class.

b_{i}

,

d_{i}

,

m_{i}

and

w_{i}

are the i-th columns of B, D, M and W, respectively.

λ

and

γ

are hyperparameters controlling the strengths of two regularization terms.

There are four terms in the formulation of Equation (1):

The first term $| | X - {B E | |}_{F}^{2}$ represents the reconstruction error, ensuring that the learned dictionary B is able to encode the feature set X as much as possible, so that the original features X can be accurately reconstructed using the latent features E.
The minimization of $| | A - {D Z | |}_{F}^{2}$ enables dictionary D to encode original attribute matrix A well, so that the semantic attributes A can be reproduced through the latent attribute elements Z.
The symmetric alignment loss $| | E - {M Z | |}_{F}^{2}$ aims to maintain semantic information. The linear transformation matrix M is applied to establish the symmetric relationships between latent features E and latent attributes Z.
The fourth term $| | Y - {W E | |}_{F}^{2}$ enhances the discriminability of the latent features E. Specifically, a linear mapping W is trained to transform the latent feature domain into label space, where instances belonging to the same class are clustered closely together, while instances from different classes are separated apart, enabling this mapping to effectively discriminate between different classes.

In summary, we utilize a dictionary learning architecture to simultaneously establish the symmetric relationships between features, attributes and object classes. As demonstrated in Equation (1), the latent feature domain is associated with two kinds of semantic domains. Firstly, the linear transformation matrix M enforces semantic-visual symmetry between the latent feature domain and the semantic attribute domain, ensuring that latent features preserve semantic information. Secondly, the classifier W associates the latent feature domain with the label space, enabling the learned latent features to be both semantically rich and discriminative for classification tasks. Essentially, our approach can be interpreted as a cohesive semantic embedding for images, attributes, and categories.

3.3. Optimization

It is obvious that the objective of Equation (1) is non-convex for B, E, D, Z, M and W at the same time; however, it is convex when considering each of these variables independently. For this reason, we apply a general alternating optimization algorithm to minimize Equation (1). In this way, the algorithm will converge to a local optimum within the limited number of iterations. Specifically, we cycle between the six subproblems described below:

(1): Compute the codes E.

When B, D, Z, M and W are fixed, the optimization problem of computing E becomes

{min}_{E} | | \hat{X} - \hat{B} E {| |}_{F}^{2}

(2)

where

\hat{X} = [\begin{matrix} X \\ λ M Z \\ γ Y \end{matrix}]

,

\hat{B} = [\begin{matrix} B \\ λ I \\ γ W \end{matrix}]

, I is the identity matrix.

Let the derivative of Equation (2) be equal to 0 and obtain the analytical solution of E as follows:

E = {({\hat{B}}^{T} \hat{B})}^{- 1} {\hat{B}}^{T} \hat{X}

(3)

(2): Compute the codes Z.

When B, E, D, M and W are given, the problem of computing Z becomes

{min}_{Z} | | \hat{A} - \hat{D} Z {| |}_{F}^{2}

(4)

where

\hat{A} = [\begin{matrix} A \\ λ E \end{matrix}]

,

\hat{D} = [\begin{matrix} D \\ λ M \end{matrix}]

.

Let the derivative of Equation (4) be equal to 0; thus, the analytical solution for Z is

Z = {({\hat{D}}^{T} \hat{D})}^{- 1} {\hat{D}}^{T} \hat{A}

(5)

(3): Update the dictionary B.

When E, D, Z, M and W are given, B can be updated by the following equation:

\begin{matrix} {min}_{B} | | X - {B E | |}_{F}^{2} \\ s . t . | | b_{i} {| |}_{2}^{2} \leq 1, \forall i \end{matrix}

(6)

Equation (6) can be optimized with the Lagrange dual. The obtained closed-form solution for B is

B = (X E^{T}) {(E E^{T} + Σ_{B})}^{- 1}

(7)

where

Σ_{B}

is a diagonal matrix constructed from all the dual variables.

(4): Update the dictionary D.

When B, E, Z, M and W are fixed, D can be updated with the following objective function:

\begin{matrix} {min}_{D} | | A - {D Z | |}_{F}^{2} \\ s . t . | | d_{i} {| |}_{2}^{2} \leq 1, \forall i \end{matrix}

(8)

Equation (8) can be solved similarly to Equation (6). The analytical solution of D is

D = (A Z^{T}) {(Z Z^{T} + Σ_{D})}^{- 1}

(9)

where

Σ_{D}

is a diagonal matrix constructed from all the dual variables.

(5): Fix B, E, D, Z, W and update M.

We fix other variables and solve M by

\begin{matrix} {min}_{M} | | E - {M Z | |}_{F}^{2} \\ s . t . | | m_{i} {| |}_{2}^{2} \leq 1, \forall i \end{matrix}

(10)

Equation (10) can be optimized in the same way as Equation (6). The analytical solution for M is

M = (E Z^{T}) {(Z Z^{T} + Σ_{M})}^{- 1}

(11)

where

Σ_{M}

is a diagonal matrix constructed from all the dual variables.

(6): Fix B, E, D, Z, M and update W.

This subproblem can be formulated as

\begin{matrix} {min}_{W} | | Y - {W E | |}_{F}^{2} \\ s . t . | | w_{i} {| |}_{2}^{2} \leq 1, \forall i \end{matrix}

(12)

Likewise, Equation (12) can also be solved similarly to Equation (6). The closed-form solution of W is

W = (Y E^{T}) {(E E^{T} + Σ_{W})}^{- 1}

(13)

where

Σ_{W}

is a diagonal matrix constructed from all the dual variables.

Algorithm 1 provides a summary of the entire algorithm. The optimization procedure reached convergence after several tens of iterations in the experiments.

Algorithm 1 Goal-oriented joint learning

1:: Input:X, A, Y, $λ$ , $γ$ .
2:: Output:B, D, M, W.
3:: Initialize B, D randomly.
4:: Initialize E based on B.
5:: Initialize Z based on D.
6:: Initialize M based on E, Z.
7:: Initialize W based on E, Y.
8:: while not converged do
9:: update E by Equation (3).
10:: update Z by Equation (5).
11:: update B by Equation (7).
12:: update D by Equation (9).
13:: update M by Equation (11).
14:: update W by Equation (13).
15:: end while

3.4. Zero-Shot Recognition

Zero-shot recognition (ZSR) can be carried out over different spaces, since the latent feature domain is linked to both the attribute domain and label space.

(1): Recognition in the latent feature domain.

In order to carry out ZSR in the latent feature domain, we must first obtain the latent feature representations of both the test sample and the prototypes of unknown classes. After that, we may calculate the distance between the test sample and unknown classes. Finally, ZSR is conducted using the nearest neighbor approach based on the cosine distance.

The latent feature representation

e^{u}

of a test sample with feature vector

x^{u}

can be determined by

e^{u} = {arg min}_{e^{u}} | | x^{u} - B e^{u} {| |}_{F}^{2} + β | | e^{u} {| |}_{2}^{2}

(14)

where B represents the latent feature dictionary that has been learned from the training data.

β

denotes the weight assigned to the regularization term.

For the unseen-class prototype

a^{u}

, its corresponding latent feature representation is

M z^{u}

, where

z^{u}

can be obtained by Equation (15), and M is learned in the training stage.

z^{u} = {arg min}_{z^{u}} | | a^{u} - D z^{u} {| |}_{F}^{2} + η | | z^{u} {| |}_{2}^{2}

(15)

where D is also obtained from the training data, and

η

denotes the weight assigned to the regularization term.

(2): Recognition in the label space.

The latent feature representation

e^{u}

can be transformed to the label space via matrix W; thus, each image can be denoted by a label vector

y^{u}

, with each element of vector

y^{u}

signifying the similarity to a specific known class. Additionally, the label vector for each unseen-class prototype can be obtained by transforming the class attribute vector into a histogram depicting the proportions of seen classes [33]. Finally, the nearest neighbor strategy can be applied to assign a test sample to an unknown class.

(3): Recognition in the attribute domain.

For a test sample with feature vector

x^{u}

, its corresponding attribute representation is

D z^{u}

, where

z^{u}

can be computed by Equation (16), and D is learned from the training data.

z^{u} = {arg min}_{z^{u}} | | e^{u} - M z^{u} {| |}_{F}^{2} + α | | z^{u} {| |}_{2}^{2}

(16)

where

e^{u}

is computed by Equation (14), M is learned during the training stage, and

α

denotes the weight assigned to the regularization term.

Once the attribute representation of test image is obtained, we can exploit the attribute representation to categorize the test image to an unknown class.

(4): Recognition by fusing multiple spaces.

The information about unseen class may be complementary in various spaces, so we can integrate the representations within different spaces to accomplish the zero-shot recognition (ZSR) task. In this paper, the final representation of an image is generated by fusing its latent feature and label vector. The same procedure is applied to the unseen-class prototypes. Subsequently, the ZSR task can also be conducted using the same method as described above.

4. Experiments

In this section, the proposed method is evaluated on three representative zero-shot learning (ZSL) benchmarks, and its effectiveness is analyzed.

4.1. Datasets

We conduct experiments on three popular benchmark datasets, including aPascal & aYahoo (aP&Y) [23], Animals with Attributes (AwA) [24], and SUN Attribute (SUN) [41]. The statistics for these three datasets are provided in Table 1.

(a) aP&Y is a small-scale coarse-grained collection of images from two different sources. The primary component, a-Pascal, comprises 12,695 images across 20 unique object categories, drawn from the PASCAL VOC 2008 challenge [42]. The secondary part, a-Yahoo, is a separate set of 2,644 images from 12 categories that do not overlap with the a-Pascal dataset, gathered using Yahoo’s search engine. Every image in these two datasets is semantically labeled with 64 attributes. In zero-shot learning, the categories from the a-Pascal part serve as seen classes for training, while those in the a-Yahoo dataset are used as unseen classes for testing.

(b) AwA is composed of 30,475 images belonging to 50 common animal categories, gathered with image search engines. It is a coarse-grained dataset with a medium number of images and a small number of classes. Each class is annotated with 85 attributes, such as black, blue, striped, etc. The default setting for ZSL involves using 40 categories as known classes for training and the remaining 10 categories as unknown classes for testing.

(c) SUN was constructed specifically for high-level scene understanding. It is a fine-grained dataset with a medium number of images and classes. The SUN dataset comprises 14,340 images over 717 different kinds of scenes. Each image in the SUN dataset is annotated with 102 real-valued attributes, generated by a vote process. We choose the same 10 classes as unknown classes in accordance with [43].

4.2. Parameter Settings

We employ MatConvNet [44] with the pre-trained “imagenet-vgg-verydeep-19” model to extract a 4096-dimensional CNN feature vector for each image across all datasets [45]. This feature vector corresponds to the activations of the top layer hidden units of the network. Similar CNN features have been utilized in prior work [15] for zero-shot learning tasks. The pre-defined attributes for the aP&Y, AWA and SUN datasets are employed as the semantic descriptors. Class embeddings are equally crucial as image features in the zero-shot learning. We choose to use the continuous attributes, ranging from 0 to 1, provided for each class in the datasets as class embeddings, because continuous attributes have been shown to outperform binary attributes in previous studies.

For the ZSL task, the primary concern is accurately predicting the class label for each test instance, so we employ the multi-class accuracy as the evaluation metric. During training, we simulate zero-shot scenarios and conduct a five-fold cross validation process to optimize the model’s hyperparameters. Specifically, we randomly select 20% of known classes for validation, train the model on the remaining 80% of known classes in the training set, and record the performance under various hyperparameter combinations. The hyperparameters that yield the best average performance on the held-out validation data are then selected. Once hyperparameters are determined, we train the final model on the entire training dataset.

4.3. Results and Discussion

4.3.1. Component Analysis

There are several key components in the proposed model. In order to evaluate their contributions, we compare their multi-class accuracy on the AwA, SUN and aP&Y datasets:

Recognition performance in the label space only, denoted as $R_{l a b e l}$ .
Recognition performance in the attribute domain only, denoted as $R_{a t t r i b u t e}$ .
Recognition performance in the latent feature domain only, denoted as $R_{f e a t u r e}$ .
Recognition performance by fusing the latent feature representation and label vector, denoted as $R_{f e a t u r e - l a b e l}$ .

Figure 3 demonstrates that the latent feature representations exhibit superior effectiveness in the zero-shot learning (ZSL) task, as evidenced by a comparison of

R_{l a b e l}

,

R_{a t t r i b u t e}

and

R_{f e a t u r e}

. Furthermore, the results of

R_{f e a t u r e - l a b e l}

indicate that combining the latent feature representations with the label vector can enhance the overall recognition performance.

4.3.2. Comparison with State-of-the-Art Approaches

In order to evaluate the effectiveness of our proposed approach, we conduct a comparative analysis with the recent advanced zero-shot learning (ZSL) methods, including f-CLSWGAN [46], TCN [47], f-VAEGAN-D2 [21], DAZLE [48], RGEN [49], APN [50], TF-VAEGAN [22], CE-GZSL [51], MSDN [52], TransZero [53] and HAS [54]. Similar to the comparison methods, we also employ the average per-class Top-1 accuracy as the evaluation metric in this experiment. Table 2 displays the performance of various methods under the conventional zero-shot learning setting on the benchmarking AwA and SUN datasets. Notably, the results for the comparison methods are sourced from the published literature. It is demonstrated from Table 2 that our approach achieves competitive performance compared to the state-of-the-art methods, despite the fact that most of these comparison methods exploit intricate non-linear models that are rooted in deep learning techniques.

4.3.3. Discrimination Evaluations

The objective function in Equation (1) facilitates the imposition of a discriminative constraint by training classifiers for seen classes within the latent feature space. For each image, the resulting probabilities from these classifiers construct a representation vector in the label space, where each element signifies the similarity to one specific known class. Figure 4 displays the three most similar seen classes to the selected unseen-class samples from the AwA dataset, based on the probabilities computed through the softmax function. Notably, most of these similarities align with human intuition. For example, chimpanzees exhibit the greatest resemblance to gorillas, while giant pandas and grizzly bears are the most alike.

According to Section 3.4, the similarity representations of the prototypes of unseen classes are required to perform the ZSL task, which can be obtained through the approach in [33]. The normalized similarity matrix between the unseen and seen classes on the Animals with Attributes (AwA) dataset is illustrated in Figure 5, where each column depicts the similarity scores between a particular unseen class and all the seen classes. It is revealed that the majority of these similarities align with human understanding and intuition. Given that similarities capture valuable semantic information about the unseen classes, we can leverage this knowledge to accomplish ZSL tasks effectively.

As mentioned above, seen-class classifiers are utilized to associate the latent feature domain with label space, enhancing the discriminability of latent features. In Figure 6, we visualize the unseen-class instances from the aP&Y dataset to evaluate the discriminability of latent features, where each color corresponds to an object category and each point represents a specific image.

Specifically, this visualization is achieved by projecting the learned latent feature representations of each unknown-class image onto a 2D plane using t-SNE [55]. It can be seen from Figure 6 that our approach produces distinguishable feature embeddings. The images belonging to the same class are tightly clustered together, while those from different classes are well separated, suggesting that the latent features are effective for recognition tasks. Furthermore, it is interesting to observe that the distance between classes in the visualization reflects their semantic similarity. For instance, the cluster of goats is located close to that of donkeys, indicating that the latent features not only distinguish between classes but also preserve meaningful semantic relationships.

4.3.4. Running Cost

In Table 3, we evaluate the running cost of the proposed method, which is implemented using MatlabR2020b software on a server equipped with 3.5 GHz CPU cores and 128 GB memory. Being a linear model, it demonstrates remarkable efficiency during the testing phase.

5. Conclusions

This paper introduces an efficient zero-shot recognition method that integrates the learning of symmetric relationships among images, attributes, and categories within a unified architecture. The latent feature domain learned by this method is inherently associated with both the attribute and label spaces, ensuring it preserves semantic information while also being discriminative enough to distinguish between classes. Additionally, the method implicitly accounts for attribute correlations within the latent attributes. Our approach, as demonstrated through extensive experiments on three benchmark datasets, outperforms the recent advanced zero-shot recognition methods.

Our proposed approach utilizes linear models as the foundation. It is generally recognized that deep learning-based non-linear models exhibit superior performance and capabilities. Consequently, we can enhance performance by incorporating non-linear transformations via deep learning methodologies. This remains a promising direction for our future research.

Author Contributions

H.Z. was responsible for the investigation, conceptualization, methodology, software development, and drafting the original manuscript. Y.Z. contributed to the validation of the research and the editing and reviewing of the manuscript. M.J. also participated in the validation process and was involved in the manuscript’s editing and reviewing. Additionally, H.Z. and M.J. were responsible for acquiring the funding. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Fundamental Research Programs of Shanxi Province (No. 202203021222120), the China Scholarship Council (No. 202408330448), and the Quzhou Municipal Bureau of Science and Technology (No. 2024K168).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zheng, V.W.; Hu, D.H.; Yang, Q. Cross-domain activity recognition. In Proceedings of the 11th International Conference on Ubiquitous Computing, Orlando, FL, USA, 30 September–3 October 2009; pp. 61–70. [Google Scholar]
Venugopalan, S.; Hendricks, L.A.; Rohrbach, M.; Mooney, R.; Darrell, T.; Saenko, K. Captioning images with diverse objects. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1170–1178. [Google Scholar]
Jiang, H.; Wang, R.; Shan, S.; Yang, Y.; Chen, X. Learning discriminative latent attributes for zero-shot classification. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4233–4242. [Google Scholar]
Fu, Z.; Xiang, T.A.; Kodirov, E.; Gong, S. Zero-shot object recognition by semantic manifold distance. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2635–2644. [Google Scholar]
Lampert, C.H.; Nickisch, H.; Harmeling, S. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 951–958. [Google Scholar]
Ba, J.L.; Swersky, K.; Fidler, S.; Salakhutdinov, R. Predicting deep zero-shot convolutional neural networks using textual descriptions. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4247–4255. [Google Scholar]
Socher, R.; Ganjoo, M.; Manning, C.D.; Ng, A. Zero-shot learning through cross-modal transfer. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’13), Lake Tahoe, NE, USA, 5–10 December 2013; pp. 935–943. [Google Scholar]
Xian, Y.; Akata, Z.; Sharma, G.; Nguyen, Q.; Hein, M.; Schiele, B. Latent embeddings for zero-shot classification. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 69–77. [Google Scholar]
Karessli, N.; Akata, Z.; Schiele, B.; Bulling, A. Gaze embeddings for zeroshot image classification. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6412–6421. [Google Scholar]
Al-Halah, Z.; Tapaswi, M.; Stiefelhagen, R. Recovering the missing link: Predicting class-attribute associations for unsupervised zero-shot learning. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 17–30 June 2016; pp. 5975–5984. [Google Scholar]
Fu, Y.; Hospedales, T.M.; Xiang, T.; Gong, S. Transductive multi-view zeroshot learning. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 2332–2345. [Google Scholar] [CrossRef] [PubMed]
Chen, L.; Zhang, H.; Xiao, J.; Liu, W.; Chang, S.-F. Zero-shot visual recognition using semantics-preserving adversarial embedding networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1043–1052. [Google Scholar]
Biswas, S.; Annadani, Y. Preserving semantic relations for zero-shot learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7603–7612. [Google Scholar]
Frome, A.; Corrado, G.S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.A.; Mikolov, T. Devise: A deep visual-semantic embedding model. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’13), Lake Tahoe, NE, USA, 5–10 December 2013; pp. 2121–2129. [Google Scholar]
Akata, Z.; Reed, S.; Walter, D.; Lee, H.; Schiele, B. Evaluation of output embeddings for fine-grained image classification. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2927–2936. [Google Scholar]
Romera-Paredes, B.; Torr, P.H.S. An embarrassingly simple approach to zeroshot learning. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 2152–2161. [Google Scholar]
Akata, Z.; Perronnin, F.; Harchaoui, Z.; Schmid, C. Label-embedding for image classification. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 1425–1438. [Google Scholar] [CrossRef] [PubMed]
Kodirov, E.; Xiang, T.; Gong, S. Semantic autoencoder for zero-shot learning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4447–4456. [Google Scholar]
Xian, Y.; Lampert, C.H.; Schiele, B.; Akata, Z. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2251–2265. [Google Scholar] [CrossRef] [PubMed]
Long, Y.; Liu, L.; Shen, F.; Shao, L.; Li, X. Zero-shot learning using synthesised unseen visual data with diffusion regularisation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2498–2512. [Google Scholar] [CrossRef] [PubMed]
Xian, Y.; Sharma, S.; Schiele, B.; Akata, Z. F-vaegan-d2: A feature generating framework for any-shot learning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10267–10276. [Google Scholar]
Narayan, S.; Gupta, A.; Khan, F.S.; Snoek, C.G.; Shao, L. Latent embedding feedback and discriminative features for zero-shot classification. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 479–495. [Google Scholar]
Farhadi, A.; Endres, I.; Hoiem, D.; Forsyth, D. Describing objects by their attributes. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1778–1785. [Google Scholar]
Lampert, C.H.; Nickisch, H.; Harmeling, S. Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 453–465. [Google Scholar] [CrossRef] [PubMed]
Norouzi, M.; Mikolov, T.; Bengio, S.; Singer, Y.; Shlens, J.; Frome, A.; Corrado, G.; Dean, J. Zero-shot learning by convex combination of semantic embeddings. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’13), Lake Tahoe, NE, USA, 5–10 December 2013; pp. 3111–3119. [Google Scholar]
Jayaraman, D.; Sha, F.; Grauman, K. Decorrelating semantic visual attributes by resisting the urge to share. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1629–1636. [Google Scholar]
Palatucci, M.; Pomerleau, D.; Hinton, G.E.; Mitchell, T.M. Zero-shot learning with semantic output codes. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 7–10 December 2009; pp. 1410–1418. [Google Scholar]
Bucher, M.; Herbin, S.; Jurie, F. Improving semantic embedding consistency by metric learning for zero-shot classiffication. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 730–746. [Google Scholar]
Zhang, L.; Xiang, T.; Gong, S. Learning a deep embedding model for zero-shot learning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3010–3019. [Google Scholar]
Changpinyo, S.; Chao, W.L.; Sha, F. Predicting visual exemplars of unseen classes for zero-shot learning. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3496–3505. [Google Scholar]
Zhang, L.; Wang, P.; Liu, L.; Shen, C.; Wei, W.; Zhang, Y.; Hengel, A. Towards effective deep embedding for zero-shot learning. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 2843–2852. [Google Scholar] [CrossRef]
Zhang, Z.; Saligrama, V. Zero-shot learning via semantic similarity embedding. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4166–4174. [Google Scholar]
Zhang, Z.; Saligrama, V. Zero-shot recognition via structured prediction. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 533–548. [Google Scholar]
Changpinyo, S.; Chao, W.L.; Gong, B.; Sha, F. Synthesized classifiers for zero-shot learning. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5327–5336. [Google Scholar]
Akata, Z.; Malinowski, M.; Fritz, M.; Schiele, B. Multi-cue zero-shot learning with strong supervision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 59–68. [Google Scholar]
Gavves, E.; Mensink, T.; Tommasi, T.; Snoek, C.G.M.; Tuytelaars, T. Active transfer learning with zero-shot priors: Reusing past datasets for future tasks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2731–2739. [Google Scholar]
Kodirov, E.; Xiang, T.; Fu, Z.; Gong, S. Unsupervised domain adaptation for zero-shot learning. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2452–2460. [Google Scholar]
Li, X.; Guo, Y.; Schuurmans, D. Semi-supervised zero-shot classification with label representation learning. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4211–4219. [Google Scholar]
Li, Y.; Wang, D. Zero-shot learning with generative latent prototype model. arXiv 2017. [Google Scholar] [CrossRef]
Patterson, G.; Xu, C.; Su, H.; Hays, J. The sun attribute database: Beyond categories for deeper scene understanding. Int. J. Comput. Vis. 2014, 108, 59–81. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes Challenge 2008 Results. Available online: http://www.pascal-network.org/challenges/VOC/voc2008/workshop/index.html (accessed on 7 May 2024).
Jayaraman, D.; Grauman, K. Zero-shot recognition with unreliable attributes. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’14), Montreal, QC, Canada, 8–13 December 2014; pp. 3464–3472. [Google Scholar]
Vedaldi, A.; Lenc, K. Matconvnet: Convolutional neural networks for MATLAB. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, Brisbane, Australia, 26–30 October 2015; pp. 689–692. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 2015 International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Xian, Y.; Lorenz, T.; Schiele, B.; Akata, Z. Feature generating networks for zeroshot learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5542–5551. [Google Scholar]
Jiang, H.; Wang, R.; Shan, S.; Chen, X. Transferable contrastive network for generalized zero-shot learning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9764–9773. [Google Scholar]
Huynh, D.; Elhamifar, E. Fine-grained generalized zero-shot learning via dense attribute-based attention. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4482–4492. [Google Scholar]
Xie, G.S.; Liu, L.; Zhu, F.; Zhao, F.; Zhang, Z.; Yao, Y.; Qin, J.; Shao, L. Region graph embedding network for zero-shot learning. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 562–580. [Google Scholar]
Xu, W.; Xian, Y.; Wang, J.; Schiele, B.; Akata, Z. Attribute prototype network for zero-shot learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems(NIPS ’20), Vancouver, BC, Canada, 6–12 December 2020; pp. 21969–21980. [Google Scholar]
Han, Z.; Fu, Z.; Chen, S.; Yang, J. Contrastive embedding for generalized zero-shot learning. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2371–2381. [Google Scholar]
Chen, S.; Hong, Z.; Xie, G.S.; Yang, W.; Peng, Q.; Wang, K.; Zhao, J.; You, X. Msdn: Mutually semantic distillation network for zero-shot learning. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 7602–7611. [Google Scholar]
Chen, S.; Hong, Z.; Liu, Y.; Xie, G.S.; Sun, B.; Li, H.; Peng, Q.; Lu, K.; You, X. Transzero: Attribute-guided transformer for zero-shot learning. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI), Virtual, 22 February–1 March 2022. [Google Scholar]
Chen, Z.; Zhang, P.; Li, J.; Wang, S.; Huang, Z. Zero-shot learning by harnessing adversarial samples. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 4138–4146. [Google Scholar]
Laurens, V.D.M.; Hinton, G. Visualizing data using t-sne. J. Mach. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. The diagram of the zero-shot learning task. During the training stage, the attributes and images of seen classes are accessible. During the test stage, ZSL recognize samples from unseen classes through utilizing attributes, which are provided as side information for each class. This allows the model to classify these unseen samples without relying on labels.

Figure 2. The proposed goal-oriented joint learning framework for ZSL.

Figure 3. Performance comparison in different recognition spaces.

Figure 4. Several unseen-class samples and their top three similar seen classes on AwA.

Figure 5. The normalized similarity matrix between unseen classes and seen classes on AwA.

Figure 6. The visualization of the unseen-class instances by their latent feature representations on aP&Y.

Table 1. The statistics for the three benchmark datasets.

Datasets	Granularity	Attributes	Classes			Images
Datasets	Granularity	Attributes	Total	Training	Test	Total	Training	Test
aP&Y	coarse	64	32	20	12	15,339	12,695	2644
AwA	coarse	85	50	40	10	30,475	24,295	6180
SUN	fine	102	717	707	10	14,340	12,900	1440

Table 2. Comparisons with the published results on AwA and SUN datasets.

Methods	AwA	SUN
f-CLSWGAN [46]	0.682	0.608
TCN [47]	0.712	0.615
f-VAEGAN-D2 [21]	0.711	0.647
DAZLE [48]	0.679	0.594
RGEN [49]	0.736	0.638
APN [50]	0.684	0.616
TF-VAEGAN [22]	0.722	0.66
CE-GZSL [51]	0.704	0.633
MSDN [52]	0.701	0.658
TransZero [53]	0.701	0.656
HAS [54]	0.714	0.632
Ours	0.756	0.77

Table 3. Running cost (unit: second).

	Training	Test
aP&Y	41.2457	0.021620
AwA	17.1541	0.038948
SUN	100.6624	0.010036

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zheng, H.; Zhou, Y.; Jiang, M. Exploiting Semantic-Visual Symmetry for Goal-Oriented Zero-Shot Recognition. Symmetry 2025, 17, 1291. https://doi.org/10.3390/sym17081291

AMA Style

Zheng H, Zhou Y, Jiang M. Exploiting Semantic-Visual Symmetry for Goal-Oriented Zero-Shot Recognition. Symmetry. 2025; 17(8):1291. https://doi.org/10.3390/sym17081291

Chicago/Turabian Style

Zheng, Haixia, Yu Zhou, and Mingjie Jiang. 2025. "Exploiting Semantic-Visual Symmetry for Goal-Oriented Zero-Shot Recognition" Symmetry 17, no. 8: 1291. https://doi.org/10.3390/sym17081291

APA Style

Zheng, H., Zhou, Y., & Jiang, M. (2025). Exploiting Semantic-Visual Symmetry for Goal-Oriented Zero-Shot Recognition. Symmetry, 17(8), 1291. https://doi.org/10.3390/sym17081291

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploiting Semantic-Visual Symmetry for Goal-Oriented Zero-Shot Recognition

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Problem Definition

3.2. Formulation

3.3. Optimization

3.4. Zero-Shot Recognition

4. Experiments

4.1. Datasets

4.2. Parameter Settings

4.3. Results and Discussion

4.3.1. Component Analysis

4.3.2. Comparison with State-of-the-Art Approaches

4.3.3. Discrimination Evaluations

4.3.4. Running Cost

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI