Leveraging Self-Distillation and Disentanglement Network to Enhance Visual–Semantic Feature Consistency in Generalized Zero-Shot Learning

: Generalized zero-shot learning (GZSL) aims to simultaneously recognize both seen classes and unseen classes by training only on seen class samples and auxiliary semantic descriptions. Recent state-of-the-art methods infer unseen classes based on semantic information or synthesize unseen classes using generative models based on semantic information, all of which rely on the correct alignment of visual–semantic features. However, they often overlook the inconsistency between original visual features and semantic attributes. Additionally, due to the existence of cross-modal dataset biases, the visual features extracted and synthesized by the model may also mismatch with some semantic features, which could hinder the model from properly aligning visual–semantic features. To address this issue, this paper proposes a GZSL framework that enhances the consistency of visual–semantic features using a self-distillation and disentanglement network (SDDN). The aim is to utilize the self-distillation and disentanglement network to obtain semantically consistent refined visual features and non-redundant semantic features to enhance the consistency of visual–semantic features. Firstly, SDDN utilizes self-distillation technology to refine the extracted and synthesized visual features of the model. Subsequently, the visual–semantic features are then disentangled and aligned using a disentanglement network to enhance the consistency of the visual–semantic features. Finally, the consistent visual–semantic features are fused to jointly train a GZSL classifier. Extensive experiments demonstrate that the proposed method achieves more competitive results on four challenging benchmark datasets (AWA2, CUB, FLO, and SUN).


Introduction
Deep learning models typically necessitate extensive, heavily labeled data during training, incurring significant human and resource costs.The introduction of Zero-Shot Learning (ZSL) effectively mitigates this constraint of deep learning models by learning the mapping relationship from auxiliary (e.g., semantic) to visual space, facilitating the classification and recognition of unseen classes [1].However, traditional ZSL settings are somewhat idealized as they assume that the test set solely comprises samples from seen classes, which is not reflective of real-world scenarios.Generalized Zero-Shot Learning (GZSL) introduces a more rigorous task where the test set can encompass samples from both seen and unseen classes, better aligning with practical needs.
Presently, research on GZSL primarily centers on two distinct strategies.Firstly, some researchers focus on methods grounded in Generative Adversarial Networks (GANs) [2][3][4][5][6][7], which employ generative models to learn the mapping relationship from semantic attributes to visual features and subsequently synthesize visual samples of unseen classes based on semantic information.Secondly, other researchers concentrate on embedding-based methods [8][9][10][11][12][13][14], striving to embed visual samples into a shared feature space to accurately reflect the semantic similarity between different classes.Through this approach, models can conduct classification reasoning using structural information in the embedding space with minimal or zero samples.Both of these strategies align visualsemantic features through either generative or embedding methods, tackling the challenges inherent in ZSL.
However, they introduce a new challenge: they often overlook the potential inconsistency in the visual-semantic features to be aligned.As illustrated in Figure 1, certain visual features, such as "fish fin", demonstrate distinctiveness in discerning image samples within the visual modality.However, these features are not encompassed within the manually annotated semantic attributes, thus termed as semantically inconsistent visual features.Moreover, within the semantic modality, there may also be redundant semantic attributes that are inconsistent with visual features.For instance, classes like "dolphin", "seal", and "killer whales" in Figure 1 share multiple redundant semantic attributes unrelated to their visual features, such as "stripe:no", "tree:no", "vegetation:no", "long legs:no", and "long neck:no".Most GZSL methods overlook these inconsistent visual-semantic features and forcibly align them, potentially introducing biases in visual-semantic feature alignment and undermining the recognition of unseen classes.Moreover, current GZSL approaches frequently employ pre-trained ImageNet models for extracting GZSL visual features and training generative models to synthesize visual features of unseen classes.However, the presence of cross-modal dataset bias [3] implies that the extracted and synthesized visual features might lack refinement and could stray from the visual features necessary for ZSL tasks, thereby worsening the problem of visual-semantic feature inconsistency.We think that extracting and synthesizing refined visual features to enhance the semantic consistency of visual features, and segregating semantic-consistent visual features and visually consistent non-redundant semantic features from raw visual-semantic features to bolster the consistency of visual-semantic features, can alleviate the aforementioned issues.Hence, this paper proposes a GZSL framework that enhances the consistency of visualsemantic features using self-distillation and disentanglement network.Specifically, we first devise a self-distillation module that leverages self-distillation technology to augment both the feature extraction model and the generative model in the context of generative GZSL.This enables them to concurrently acquire refined mid-layer features and soft label knowledge from the auxiliary self-teacher network, thereby stimulating the model to extract and synthesize refined visual features.Additionally, we devise a disentanglement network applied to the visual-semantic modality.For instance, in the visual modality, the visual disentanglement encoder projects visual features into z r and z d .To ensure the consistency of z r with semantic features, visual-semantic features are cross-reconstructed, and a semantic relationship-matching method is employed to calculate the compatibility score between z r and semantic information to guide the learning of z r .Furthermore, a latent representation-independent method is applied to enforce the independence between z r and z d .Ultimately, the disentanglement network attains consistent visual-semantic features, which are amalgamated to jointly train a GZSL classifier.
In summary, the contributions of this paper are as follows: • We identified that most models typically do not handle visual-semantic inconsistent features and directly align them, which may lead to alignment bias.We propose an approach to enhance the consistency of visual-semantic features by refining visual features and disentangling original visual-semantic features.

•
We designed a self-distillation embedding module, which generates soft labels through an auxiliary self-teacher network and employs soft label distillation and feature map distillation methods to refine the original visual features of seen classes and synthesized visual features of unseen classes from the generator, thereby enhancing the semantic consistency of visual features.

•
We proposed a disentanglement network, which encodes visual-semantic features into latent representations and promotes visual-semantic consistent features to be separated from original features through semantic relation matching and latent representation independence methods, significantly enhancing the consistency of visual-semantic features.

•
Extensive experiments on four GZSL benchmark datasets demonstrate that our model can separate refined visual-semantic features with consistency from original visualsemantic features, thereby alleviating alignment bias caused by visual-semantic inconsistency and improving the performance of GZSL models.

Generative-Based Generalized Zero-Shot Learning
In recent years, numerous studies have employed generative models to bolster the efficacy of GZSL tasks.GANs or VAEs are commonly utilized in generative GZSL to synthesize visual features for unseen classes.These synthesized visual features for unseen classes are subsequently integrated with original visual features for seen classes to train classifiers.For example, Narayan et al. [15] employed VAEs and GANs to refine the quality of synthesized visual features for unseen classes.They introduced a feedback module to regulate the generator's output, effectively diminishing ambiguity between classes.Zhang et al. [16] combined generative and embedding-based models by projecting real and synthesized samples onto an embedding space for classification, establishing a hybrid ZSL framework that effectively addresses data imbalance issues.Li et al. [17] proposed an innovative approach that integrates a Transformer model with VAE and GAN, capitalizing on the rich data representation from VAE and the diversity of data generated by GAN to mitigate dataset diversity bias, while utilizing Transformer to enhance semantic consistency.DGCNet [18] introduced a Dual Uncertainty Guided Cycle-Consistent Network, which examines the relationship between visual and semantic features through a cycle-consistent embedding framework and dual uncertainty-aware modules, effectively addressing alignment shift problems and enhancing model discriminability and adaptability.However, these methods often ignore the existence of semantically inconsistent visual features and redundant semantic attributes in the original visual-semantic features, which may affect the correct alignment of visual-semantic features.Instead, by first decoupling visual-semantic consistent features before alignment, we have improved the model's accuracy.

Knowledge Distillation
Knowledge distillation [19] serves as a model compression technique, aiming to reduce the size and computational complexity of a model by transferring knowledge from a complex neural network (referred to as the teacher network) to a smaller neural network (referred to as the student network).Initially, the concept of knowledge distillation emerged by encouraging the student network to imitate the output log-likelihood of the teacher network [20].Subsequent research introduced intermediate layer distillation methods, enabling the student network to acquire knowledge from the convolutional layers of the teacher network with feature map-level locality [21][22][23][24], or from the penultimate layer of the teacher network [25][26][27][28][29].However, these methods necessitate pre-training a complex model as the teacher network, a process consuming substantial time and resources.Some recent studies have proposed self-knowledge distillation [30,31], enhancing the training of the student network by leveraging its own knowledge without requiring an additional teacher network.For instance, Zhang et al. [32] segmented the network into several parts and compressed deep-layer knowledge into shallow layers.DLB [33] utilizes instant soft targets generated in the training process of the previous iteration for distillation, achieving performance improvement without altering the model structure.FRSKD [34] introduces an auxiliary self-teacher network to refine knowledge transfer to the student's classifier network, capable of performing self-knowledge distillation using both soft labels and feature map distillation.This paper adopts the concept proposed by FRSKD to construct a self-distillation embedding module, aiming to refine the original seen visual features and the unseen visual features synthesized by the generator.

Problem Definition
In GZSL, the dataset comprises visual features X, semantic attributes C, and labels Y, which can be divided into seen classes S and unseen classes U. Specifically, the visual feature set is defined as X = {X S , X U }, and the corresponding label set is represented as Y = {Y S , Y U }, where Y S and Y U are disjoint sets.Semantic attributes are defined as C = {C S , C U }. Visual features x i s and x i u are defined as the ith visual feature, where x i s ∈ X S and x i u ∈ X U .The corresponding labels for seen and unseen classes are denoted as y i s and y i u , while c i s and c i u represent the ith semantic feature, where c i s ∈ C S and c i u ∈ C U .Thus, the training dataset is defined as i=1 , and the testing dataset is defined as i=1 .The objective of GZSL is to learn a classifier F GZSL : X → Y S ∪ Y U .

Overall Framework
The SDDN architecture primarily comprises two key modules, as depicted in Figure 2. The first module, known as the self-distillation embedding module, utilizes feature fusion techniques and an auxiliary self-teacher network to transfer refined visual features to the student network.It then employs both soft label distillation and feature map distillation to facilitate the generation of refined features by the generative model, thereby enhancing the consistency of visual-semantic features.The second module, referred to as the disentanglement network, employs semantic relationship matching (SRM) method and latent representation independent (IND) method to guide the visual-semantic disentanglement autoencoder in decoupling semantic-consistent visual features and non-redundant semantic features, further strengthening the consistency of visual-semantic features.

Self-Distillation Embedding Module
In order to refine the seen visual features extracted by the pre-trained ResNet101 [35] and the unseen visual features synthesized by the generator, thereby improving their semantic consistency.We designed a self-distillation embedded (SDE) module, as shown in Figure 3.The SDE module, comprised of a self-distillation feature refinement (SDFR) module and a self-distillation conditional generation (SDCG) module, was designed for this purpose.The SDFR module integrates top-down and bottom-up feature fusion methods [36] to direct the auxiliary self-teacher network in generating refined intermediate feature maps and soft labels.Following this, feature distillation and soft label distillation methods refine the visual features extracted by the student network.The SDCG module encompasses a generator and a discriminator.It trains the generator to synthesize visual features through a generative adversarial approach and employs feature map distillation and soft label distillation methods for refining the generated visual features.

Auxiliary Self-Teacher Network
To refine the visual features extracted by the pre-trained ResNet101 and pass them to the generator of the SDCG module, we devised an auxiliary self-teacher network T based on the architecture proposed by Ji et al. [34].This auxiliary self-teacher network, depicted in green in Figure 3, also uses the pre-trained ResNet101.Subsequently, we utilized top-down and bottom-up feature fusion methods [36] to guide the auxiliary selfteacher network in generating refined intermediate feature maps x t T i .At the layer preceding the classification layer f t , this network outputs the final extracted visual features x t , while the last classification layer f t outputs soft labels P t .Deep neural networks excel at learning representations at various levels; hence, outputs from intermediate and output layers can both contribute to training the student network.In our methodology, we employ ResNet101 (depicted in blue in Figure 3) as the student network A, enriching the visual features extracted by ResNet101 with soft labels P t from the output of the self-teacher network and refined feature maps x t T i from the intermediate layers.The formula for generating P t in the auxiliary self-teacher network is defined as follows: Here, T is the temperature parameter [20], typically set to 1. Higher values of T result in softer class probability distributions.f t denotes the classifier of the auxiliary self-teacher network.The student network learns from P t through KL divergence, expressed as The intermediate layer feature outputted by the ith student network block is denoted as x a A i .The student network learns the refined intermediate layer features x t T i produced by the auxiliary self-teacher network through the feature distillation method, which is implemented by the loss function L F .The definition of L F is as follows: where ϕ represents the channel pooling function.Additionally, class predictions are made on the enhanced visual features, and the cross-entropy loss L CE between the prediction results and the true labels is minimized to ensure the accuracy of the enhanced visual features.Finally, the loss function for refining visual features by the auxiliary self-teacher network is denoted as L STN , which is defined as

Self-Distillation Conditional Generation Module
In order to synthesize refined visual features for unseen classes, thereby addressing the issue of lacking unseen visual features in ZSL, we devised a self-distillation conditional generation module.This module employs a conditional generator G and a discriminator D to form a generative adversarial network.During the training of this module, Gaussian noise N(0, 1) and semantic descriptors of seen classes c s are firstly utilized as conditional inputs to G, to synthesize visual features x ′ s = G(c s , N).Subsequently, the synthesized x ′ s is fed into the trained student network classifier f a to obtain soft label f a (x ′ s ).To create a clear contrast with the synthesized fake seen class visual features x ′ s , we define the real seen visual features x a obtained by the student network in Figure 3 as x s , minimizing the loss between f a (x ′ s ) and soft labels P t to ensure the consistency between x ′ s and real visual features x s .The loss function is defined as follows: Simultaneously, to ensure the accuracy of the synthesized visual features, we compute the cross-entropy loss between f a (x ′ s ) and the real labels Y S .Additionally, the discriminator D is employed to distinguish between real seen class samples (x s , c s ) and synthesized seen class samples (x ′ s , c s ), minimizing their loss L wgan to ensure that the visual features synthesized by the generator are close to the real visual features.
Here, xs , and λ represents the penalty coefficient.The loss function for the SDCG module is as follows: Finally, the overall loss of the SDE module is L SDE = L STN + L SDCG .

Disentanglement Network
To further bolster the consistency of visual-semantic features, we propose a disentanglement network.This network utilizes a semantic relation matching method and an independent latent representation method to guide the visual-semantic disentangled autoencoder in separating visually consistent features and non-redundant semantic features from the original data.Furthermore, it aligns these features using a cross-reconstruction method to further strengthen their consistency.

Visual-Semantic Disentangled Autoencoder
The visual-semantic disentangled autoencoder (VSDA) comprises two parallel variational autoencoders dedicated to processing visual and semantic modalities separately.Each variational autoencoder includes a disentangled encoder and decoder.The disentangled encoder maps the feature space to the latent space, while the decoder maps the latent space back to the feature space.The primary role of the VSDA is to acquire effective latent representations, z 1 and z 2 , for visual-semantic features and to separate z 1 along the column dimension into z r and z d , and z 2 into c r and c d .Specifically, the visual disentanglement encoder E V and the semantic disentanglement encoder E S encode visual features x = {x s , x ′ u } and semantic features c into z 1 and z 2 , respectively.Taking z 1 as an example, its row dimension represents the number of samples in a batch n, while the column dimension represents m.Let l be a value along the column dimension belonging to the interval (0, m).The elements of z 1 from column dimension 0 to l are defined as z r , and the remaining columns from l to m are defined as z d .z 2 follows the same process.This procedure can be mathematically expressed as In the case of the visual modality, it is important to note that z r and z d at this stage merely represent a simple partitioning of z 1 along its dimensions.It is necessary to subsequently use the semantic relationship matching method in Section 3.4.2and the latent representation independent method in Section 3.4.3 to make the latent representation represented by z r semantically consistent, and at the same time change the latent representation represented by z d into semantically irrelevant.
By minimizing the KL divergence loss between the latent variable distribution and the predefined prior distribution, the VSDA can effectively learn representations in latent space.Consequently, to guarantee the validity of the latent representations z 1 and z 2 obtained by the visual-semantic disentangled autoencoder, we optimize the KL divergence loss between the latent variable distribution and the predefined prior distribution.This process can be formulated as To reduce information loss between the visual and semantic modalities, we employ disentangled decoders to reconstruct the latent representations (z r , z d ) and (c r , c d ), and minimize the loss between the reconstructed visual-semantic features and the real features.Specifically, we use the visual disentangled decoder to reconstruct (z r , z d ) into visual features x, and the semantic disentangled decoder to reconstruct (z r , z d ) into semantic features c.Subsequently, we compute the loss between the reconstructed visual features x and the real visual features x, and the loss between the reconstructed semantic features c and the real semantic features c.Summing these losses yields the total reconstruction loss L REC .L REC can be expressed as Simultaneously, to enable the model to learn the association between visual and semantic features and reduce the deviation between modalities, cross-modal cross-reconstruction is designed.Specifically, we employ a semantic disentangling decoder to reconstruct the visual latent representation (z r , z d ) into semantic features cx , and a visual disentangling decoder to reconstruct the semantic latent representation (c r , c d ) into visual features xc .Subsequently, we minimize the reconstruction loss L CROSS_REC between cx , xc , and the original semantic and visual features c and x.This process can be mathematically formulated as follows: Here, the mean square error (MSE) is utilized to calculate the reconstruction loss between the original visual features and the reconstructed features.Finally, the overall loss of the VSDA is

Semantic Relation Matching
To guide the visual disentangled encoder in separating semantically consistent latent representations into z r , we designed the semantic relation matching (SRM) method.This method introduces a relation network (RN) [37] to assess the matching relationship between visual and semantic features, as illustrated in Figure 4.The RN evaluates the distance between two samples by constructing a neural network, thereby measuring their matching degree.Thus, we can constrain z r represented latent representations to match with c using RN.This means that RN will encourage the visual disentangled encoder to encode semantically consistent visual features from the original features into z r , while semantically inconsistent visual features will be encoded as z d .In the SRM method, we first concatenate z r with its uniquely corresponding semantic feature c, and compute their compatibility score CS.Specifically, when the labels of z r and c are the same, the match is successful, and CS is set to 1.When the labels of z r and c are different, the match fails, and CS is set to 0. This process can be formulated as Here, t and b represent the tth semantically consistent representation and the bth unique corresponding semantic feature in the training batch, and y (t) and y (b) represent the class labels of z r(t) and c (b) , respectively.Utilizing CS defined in Equation ( 13), a RN with a sigmoid activation function learns a compatibility score ranging from 0 to 1 for each pair (z r , c).Then, the following loss function is used to optimize z r :

Independence between Latent Representations
In Section 3.4.3,we utilized the SRM method to guide the visual disentangled encoder to preliminarily transform z r into semantically consistent visual feature latent representations, while transforming z d into semantically irrelevant latent representations.To further enhance the decoupling between visually consistent latent representations z r and visually irrelevant ones z d , while also encouraging the semantic disentangled encoder to encode visually consistent and visually irrelevant semantic features into latent representations c r and c d , respectively, we devised the latent representation independence (IND) method.Specifically, from a probabilistic perspective, z r and z d can be considered to come from different conditional distributions in the visual modality, while c r and c d can be considered to come from different conditional distributions in the semantic modality: where ψ 1 and ψ 2 are distributions for z r and z d , respectively, and ψ 3 and ψ 4 are distributions for c r and c d , respectively.Therefore, the independence between z r and z d denoted as I ND v , the independence between c r and c d denoted as I ND s , and their overall independence denoted as I ND can be expressed as (16) where ψ := ψ(z r , z d |x) is the joint conditional probability of z r and z d , and similarly for the semantic modality.Taking the visual modality as an example, when y = 1, z r and z d are dependent, they are denoted as τ(z 1 |y = 1), while when y = 0, z r and z d are independent, they are denoted as τ(z 1 |y = 0).Therefore, I ND v can be represented as We introduce a discriminator DIS v to approximate τ(y = 1|z 1 ); thus, I ND v can be approximated by the following formula: During the training of the discriminator, we randomly shuffle z r and z d in each training batch, then concatenate them to obtain ẑ1 .Finally, the loss of the discriminator on the visual modality and the semantic modality is given by In summary, the total loss of our SDDN framework is formulated as

Classification
When training the classifier, the seen visual features x s , refined by the auxiliary selfteacher network, and the synthesized unseen visual features x

Experiments 4.1. Datasets
We conducted comprehensive tests on four publicly available benchmark datasets: Caltech-UCSD Birds-200-2011 (CUB) [38], Animals with Attributes2 (AWA2) [39], SUN Attribute Dataset (SUN) [40], and Oxford Flowers (FLO) [41].All datasets and their statistics are summarized in Table 1.CUB is a fine-grained dataset comprising 11,788 images from 200 different bird species, with 150 seen classes and 50 unseen classes.Each image in CUB is annotated with 312 dimensions of attributes.FLO is another fine-grained dataset consisting of flower images, containing 8189 images across 102 classes, including 82 seen classes and 20 unseen classes.The annotation attributes in FLO have 1024 dimensions.SUN is a fine-grained image dataset featuring various scenes, with 14,340 images covering 717 classes (645 seen classes and 72 unseen classes).Each scene in SUN is associated with 102-dimensional attributes describing its characteristics, such as lighting conditions, weather conditions, and terrain.AWA2 is a coarse-grained dataset with 37,322 animal images across 50 classes (10 seen classes and 40 unseen classes), covering a wide range of animals, including mammals, birds, and reptiles.Each image in AWA2 is labeled with attributes of 85 dimensions.

Evaluation Protocol
During testing, the accuracy is assessed on the test sets for both seen classes (S) and unseen classes (U).Here, U represents the average accuracy for each class on test images of unseen classes, indicating the model's ability to classify samples from previously unseen classes.S represents the average accuracy for each class on test images of seen classes, reflecting the model's ability to classify samples from seen classes.H (defined as (H = (2 × S × U)/(S + U)) represents the harmonic mean of S and U, serving as an evaluation metric for the performance of GZSL classification.

Implementation Details
SDDN mainly consists of a SDE module and a disentanglement network.The SDE module mainly consists of a student network, a self-teacher network, a generator and a discriminator.The student network is a ResNet101 model pre-trained on ImageNet and is used to extract visual features with a dimension of 2048.The self-teacher network is composed of the student network itself and the feature fusion method.In addition, the generator is implemented using a multi-layer perceptron with a hidden layer dimension of 2048, and the discriminator is implemented using a fully connected layer and activation function.The disentanglement network consists of an encoder, a decoder, a discriminator and a semantic relationship matching model.Both the encoder and decoder are multi-layer perceptrons with a single hidden layer and 2048 hidden units.The semantic relationship matching model consists of two fully connected layers activated with Smooth Maximum Unit (SMU) [42] activation function and Sigmiod function, respectively.The discriminator is implemented using a fully connected layer and SMU activation function.
The hardware environment used by SDDN is an Intel i7-10700K CPU, RTX A5000 32GB GPU; the software environment is Ubuntu 20.04 LTS operating system, cuda 11.4.0, and cudnn 8.2.4.SDDN is implemented in PyTorch 1.10.1.The Adam optimizer is used to optimize the parameters of each module.The learning rate of the Adam optimizer is set to lr = 0.0001, β 1 = 0.9 and β 2 = 0.999.The batch size is 64.The loss weight λ 1 of the semantic relation matching method, the loss weight λ 2 of the latent representation independence method, and the weight λ 3 of the visual-semantic discriminator are set between 0.1-25.

Comparing with the State of the Art
To validate the effectiveness of our proposed SDDN model, we computed the seen class accuracy rate S, unseen class accuracy rate U, and their harmonic mean H on the aforementioned four datasets.We compared them with 15 state-of-the-art models, and the comparison results are shown in Table 2.These 15 models are categorized into methods based on generative models and methods not based on generative models.Generativebased methods typically utilize techniques such as GANs or VAEs to generate synthetic unseen class data to augment the training dataset.These synthetic data can be used to train models in ZSL to improve their generalization capability to unseen classes.Nongenerative-based methods, on the other hand, do not rely on generating synthetic data but achieve generalization to unseen classes through techniques such as feature embedding and alignment of existing data.Our method belongs to the generative-based methods.From the comparison results in the table, firstly, our SDDN achieved the highest accuracy on U, S, and H on the FLO dataset, surpassing all compared models.Specifically, we outperformed the second-best model by 2.3% in the H metric.There was a significant improvement in the U metric, where we led the second-best by 4%.In the S metric, we were ahead of the second-best by 0.3%.On the CUB dataset, we achieved the highest accuracy on the U metric, leading the second-best by 1.1%.Additionally, we obtained the second-best accuracy on both U and S metrics, leading the third-best by 6% and 1.9%, respectively.On the SUN dataset, we attained the highest accuracy on both S and H metrics, leading the second-best by 2% in the H metric and 1.4% in the S metric.
Overall, our performance was the best in the H metric on these three fine-grained datasets, FLO, CUB, and SUN; the best on U in FLO; and the best on S in both FLO and SUN.This indicates that the richer the information in the dataset, the more effectively our proposed method can capture it through self-distillation and disentanglement techniques, separating visual-semantic consistent features and aligning them effectively.

Ablation Study
In our ablation study, we aim to isolate the key components of SDDN and assess their impact on GZSL.We remove the semantic relation matching loss (LSRM) to evaluate the contribution of the Visual-Semantic Matching module to extracting semantically consistent visual features.Omitting the Independence score (IND) allows us to evaluate its contribution to further separating visual-semantic consistent features.Additionally, we exclude the loss of the self-distillation embedding module (LSDE) and then used the pre-trained ResNet101 and regular generator without employing the self-distillation technique.This evaluation helps assess the refinement effect of SDE on extracting and synthesizing visual features, while validating the effectiveness of the disentanglement network.Our ablation experiments were conducted on the FLO and CUB datasets, with the experimental results presented in Table 3 and Figure 6.The results underscore the critical importance of the semantic relation matching module (LSRM), independence score (IND), and self-distillation embedding module (LSDE) for the performance of SDDN.Firstly, LSRM is particularly crucial for visual-semantic feature alignment, as its removal leads to a significant decrease in the accuracy of seen classes (S), unseen classes (U), and the harmonic mean (H).Secondly, IND is essential for further separating visual-semantic consistent features from the original features, as its removal results in lower U, S, and H values.Additionally, LSDE helps refine the original visual features of seen classes and the synthesized features of unseen classes, as the model without LSDE performs lower in U, S, and H compared to the complete SDDN.Furthermore, when comparing Table 1, it is found that the model without SDE still achieves the highest H score on FLO and CUB, indicating the effectiveness of the disentanglement network.Finally, the complete SDDN model demonstrates superior performance across all metrics, proving its effectiveness in GZSL.

Hyper-Parameter Analysis
In this study, the optimization objective of SDDN is determined by three critical hyperparameters: the coefficient of semantic relationship matching loss (λ 1 ), the coefficient of independence of latent representations (λ 2 ), and the coefficient of discriminator loss (λ 3 ).To elucidate the influence of each hyperparameter on model performance, sensitivity analysis was conducted by varying the hyperparameter values in the experiments.Specifically, λ 1 was varied within the range of 0.3-20.0,while λ 2 and λ 3 were varied within the range of 0.1-3.0. Figure 7 illustrates the significant impact of hyperparameter values λ 1 , λ 2 , and λ 3 on the experimental outcomes.Notably, when λ 1 is set to 18, λ 2 is set to 0.5, and λ 3 is set to 2, the model achieves its highest accuracy on the FLO dataset, whereas when λ 1 is set to 1, λ 2 is set to 0.6, and λ 3 is set to 0.3, the model achieves its highest accuracy on the CUB dataset.These observations underscore the substantial influence of hyperparameter weights on model accuracy, indicating the model's high sensitivity to these hyperparameters.Based on these findings, we advocate for future experiments to focus on exploring the specific impact of minor fluctuations in these three hyperparameter values on accuracy.This systematic analysis of hyperparameters will contribute to a deeper comprehension of model behavior and offer valuable insights for optimizing model performance across diverse datasets.

Zero-Shot Retrieval Performance
To assess the practical application performance of our SDDN framework, we conducted zero-shot retrieval experiments comparing SDDN with two other state-of-the-art generative-based GZSL frameworks: DGGNet-db and DVAGAN.The experiment follows the zero-shot retrieval protocol in SDGZSL [2].In zero-shot retrieval experiments, we initially provide semantic features of unseen classes, followed by employing the generation modules of SDDN, DGGNet-db, and DVAGAN to synthesize a certain number of visual features for these unseen classes.Throughout this process, the average of the synthesized visual features for each category is computed as the retrieval feature.Subsequently, the cosine similarity between the retrieval features and the true features is calculated, and the true features are ranked in descending order based on this similarity.The performance of zero-shot image retrieval is evaluated using mean Average Precision (mAP).Experimental analyses are performed on three datasets: CUB, AWA2, and SUN.The results, illustrated in Figure 8, compare SDDN, DGGNet-db, and DVAGAN in terms of zero-shot retrieval performance.The horizontal coordinates 100, 50, and 25 represent the proportions of unseen category images in the test dataset, being 100%, 50%, and 25%, respectively, while the vertical coordinate represents the average retrieval accuracy.Results indicate significantly higher zero-shot retrieval performance of the SDDN framework on the CUB and SUN datasets compared to DGGNet-db and DVAGAN.In the AWA2 dataset, we achieve the best performance when the proportion of unseen category images reaches 50%, and it remains close to the best performance when the proportion reaches 100%.These zero-shot retrieval performance tests on the three datasets further validate the effectiveness of the model.

Conclusions
In this paper, we propose a generalized zero-shot learning framework that utilizes a self-distillation and disentanglement network to enhance visual-semantic feature consistency.Initially, for improving the semantic consistency of visual features, we develop a self-distillation embedding framework integrating self-distillation techniques with a conditional generator to prompt the synthesis of refined visual features.Subsequently, to further promote visual-semantic feature consistency, we design a disentanglement network.We use semantic relation matching networks and latent representation independence methods to facilitate the separation of visually semantically consistent features from inconsistent features.Additionally, we devise a cross-reconstruction method to align visual and semantic features within a visual-semantic common space, thereby enhancing the semantic consistency of visual-semantic features.Extensive experiments are conducted on four widely used benchmark datasets in GZSL.We compare SDNN with current state-of-the-art methods, thereby demonstrating the superiority of the proposed SDNN framework.In future work, we intend to optimize the model further and apply it in the field of medical diagnostics to assist in identifying new disease patterns.
Author Contributions: Responsible for proposing research ideas, modeling frameworks, content planning, guidance, and full-text revisions: X.L. Responsible for literature research, research methodology, experimental design, thesis writing, and full text revision: C.W. (Chen Wang).Responsible for lab instruction, guidelines, and full text revisions: G.Y. and C.W. (Chunhua Wang); Responsible for providing guidance, revising, and reviewing full text: Y.L., J.L. and Z.Z.All authors have read and agreed to the published version ofthe manuscript.

Figure 1 .
Figure 1.Illustration of visual features inconsistent with annotated attributes (highlighted in yellow boxes) and redundant annotated attributes inconsistent with visual features (highlighted in red text).

Figure 2 .
Figure 2. The framework of our SDDN model.

Figure 3 .
Figure 3. Architecture of the self-distillation embedding module.Here, A i denotes the block of the student network, T i represents the block of the auxiliary self-teacher network, G stands for the generator, and D signifies the discriminator.f t and f a denote the classifier layers of the auxiliary self-teacher network and the student network, respectively, while P t and P a represent the soft labels outputted by the auxiliary self-teacher network and the student network, respectively.
denotes the size of the training batch, and n denotes the number of unique semantic features corresponding to the training batch.In each training batch, calculate the mean square error between the output of the relation score of each pair z r(t) and c (b)and the ground truth CS, optimized by mean square error.This loss ensures that z r is a semantically consistent latent representation.

Figure 4 .
Figure 4. Architecture of semantic relational matching model.

Table 1 .
Statistics of the AWA2, CUB, FLO and SUN datasets, including visual feature dimension D x , semantic feature dimension D s , number of seen classes N s , number of unseen classes N u and number of all instances N i .

Table 2 .
Performance comparison in accuracy (%) on four datasets.Displaying the accuracies of seen and unseen classes in GZSL, denoted as U, S, and H for the harmonic mean.The methods above and below the horizontal line correspond to non-generative and generative approaches, respectively.Results in bold font indicate the highest performance.

Table 3 .
Ablation study of different component combinations on FLO and CUB datasets.Results are reported in %, with the best results highlighted in bold.Ablation study on different components combinations of the FLO and CUB datasets.