Next Article in Journal
MNATS: A Multi-Neighborhood Adaptive Tabu Search Algorithm for the Distributed No-Wait Flow Shop Scheduling Problem
Previous Article in Journal
Evaluating the Performance Impact of Data Sovereignty Features on Data Spaces
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Contextual Feature Expansion with Superordinate Concept for Compositional Zero-Shot Learning

Department of Artificial Intelligence, Hanyang University, Seoul 04763, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(17), 9837; https://doi.org/10.3390/app15179837
Submission received: 4 August 2025 / Revised: 1 September 2025 / Accepted: 3 September 2025 / Published: 8 September 2025

Abstract

Compositional Zero-Shot Learning (CZSL) seeks to enable machines to recognize objects and attributes (i.e., primitives),learn their associations, and generalize to novel compositions, enabling systems to exhibit a human-like ability to infer and generalize. The existing approaches, multi-label and multi-class classification, face inherent trade-offs: the former suffers from biases against unrelated compositions, while the latter struggles with exponentially growing search spaces as the number of objects and attributes increases. To overcome these limitations and address the exponential complexity in CZSL, we introduce Concept-oriented Feature ADjustment (CoFAD), a novel method that extracts superordinate conceptual features based on primitive relationships and expands label feature boundaries. By incorporating spectral clustering and membership function in fuzzy logic, CoFAD achieves state-of-the-art performance while using 2×–4× less GPU memory and reducing training time by up to 50× on large-scale dataset.

1. Introduction

In nature, all objects or entities are associated with their respective superordinate concepts, and the attributes applied to these objects are strongly correlated with their corresponding superordinate categories [1,2]. For instance, water and juice, both belonging to the superordinate concept of liquid, can exhibit the attribute “spilled”. In contrast, the attribute “bright” might apply to both “lamp” (an object) and “butterfly” (an animal), yet these belong to vastly different superordinate categories. Learning the relationships between concepts forms the foundation of human cognition and represents an essential challenge for machine learning. Similarly, Compositional Zero-Shot Learning (CZSL) [3,4,5,6,7] aims to replicate this nuanced understanding in machines by enabling them to associate objects, attributes, and their superordinate categories. Recent advancements have transitioned CZSL from the traditional closed-world setting [3,4], which focuses on predicting existing combinations, to an open-world setting [8,9,10] that includes impossible combinations (e.g., rusty dog) in the search space. This transition accounts for the unpredictability and variability of real-world data, making CZSL more robust and applicable to practical scenarios.
Most studies on the CZSL task employ multi-label [5,8,11], multi-class [6,12,13], or multi-path [14,15] classification approaches. Multi-label classification predicts attributes and objects for a given input image. This disentanglement of compositions reduces computational cost to the number of primitives and eliminates the need to explicitly predict unseen compositions during inference. However, when loosely related superordinate concepts share similar attributes, this approach tends to be biased by the training data, leading to poor recognition of features in unseen compositions. The multi-class classification approach improves performance by incorporating all possible compositions into the model’s search space, allowing it to learn relationships between compositions. However, as the number of attributes ( | A | ) and objects ( | O | ) increases, the search space grows exponentially to | A | × | O | , leading to a substantial increase in computational cost. Table 1 clearly quantifies the increase in the search space. This issue also arises in the multi-path classification approach, in which the model predicts primitives and compositions simultaneously. The trade-off between performance and efficiency is a significant challenge in imitating human-like intelligence in real-world scenarios.
To overcome these limitations and enable more efficient recognition of unseen compositions, we propose a method called Concept-oriented Feature ADjustment (CoFAD), which expands a primitive’s feature space to include unseen compositions using its associated superordinate concepts. CoFAD extracts the superordinate concepts of primitives based on the connection of compositions. These superordinate concepts represent shared features of associated primitives; for instance, “cat,” “tiger,” and “old” share the concept of creature or animal, while “car” and “truck” are associated with the superordinate concept of vehicle. As shown in Figure 1, CoFAD expands the feature space of primitives by assigning membership degrees to each superordinate concept, thereby allowing primitives to implicitly include other primitive features and correcting biases present in seen compositions. A significant advantage of this approach is the reduction of the model’s search space into | A | + | O | , which dramatically decreases computational cost while simultaneously improving performance.
More concretely, the novelty of CoFAD lies in its integrated design: (i) it organizes primitives into semantically coherent clusters via fuzzy spectral clustering applied to compositional co-occurrence graphs (instead of raw embeddings), (ii) it assigns soft memberships of primitives across multiple superordinate concepts, and (iii) it leverages these memberships to expand and adjust label features for improved generalization to unseen attribute–object pairs. To the best of our knowledge, no prior CZSL framework has incorporated superordinate-level contextual modeling in this manner.
We conduct evaluations on three popular benchmark datasets: MIT-States [16], UT-Zappos [17], and C-GQA [7]. Our proposed CoFAD model demonstrates superior performance, surpassing state-of-the-art (SOTA) results in open-world scenarios. By leveraging diverse learning strategies, CoFAD effectively reduces the search space, achieving competitive performance even in closed-world settings. Notably, CoFAD offers exceptional computational efficiency compared to existing models, requiring 2×–4× less GPU memory and achieving training time reductions ranging from 3× to 50×. The primary contributions of this work are as follows:
  • Novel Model Architecture: We introduce the CoFAD model, which extends contextual feature boundaries through concept-oriented learning.
  • State-of-the-Art Performance: Experimental results on benchmark datasets establish CoFAD as SOTA in open-world scenarios, while maintaining strong performance in closed-world settings.
  • Enhanced Computational Efficiency: CoFAD achieves remarkable efficiency, utilizing significantly less GPU memory and reducing training times by up to 50× compared to SOTA models.

2. Related Work

Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute–object combinations by generalizing beyond seen compositions. Existing approaches can be broadly categorized into three main paradigms, based on how they represent and predict compositions:
(1) Multi-Label Classification Methods. These methods treat attribute and object recognition as separate tasks and train individual classifiers to independently predict each primitive from an image. This disentangled prediction scheme reduces the compositional search space to | A | + | 𝒪 | , enabling computational efficiency. Representative works include attention-based disentanglers [18], knowledge-guided classifiers [12], Siamese contrastive embeddings [13], and modular representations [6,19]. However, these models often fail to capture compositional dependencies, especially when attributes exhibit different semantics under varying object contexts.
(2) Multi-Class (Compositional Space) Methods. These approaches model each attribute–object pair as a distinct class and learn joint representations for both seen and unseen combinations. Transformation-based models such as MLPs [3,9] and GCNs [7] are commonly used to encode relationships across the entire A × 𝒪 space. While expressive, these methods incur high computational costs and face challenges in scalability and label sparsity, especially for large-scale CZSL benchmarks like C-GQA.
(3) Multi-Path and Prompt-Based Vision-Language Models. Recent works adopt multi-path architectures or utilize pre-trained vision-language models (VLMs) such as CLIP [20] to jointly model compositions and primitives. For example, CSP [21] replaces class-specific prompts with trainable attribute/object tokens. Troika [14] introduces a cross-modal prompting strategy to generate context-aware textual features, while CDS-CZSL [15] emphasizes context diversity and specificity through multi-path design. These models improve alignment between visual features and composition labels, but still operate within a fixed composition set and do not explicitly organize primitives into semantically coherent clusters.
Despite the progress achieved by these approaches, challenges persist, including limited generalization to unseen data and significant computational overhead, highlighting the need for innovative solutions. In this study, we introduce an efficient novel multi-label classification framework based on VLMs.

3. Preliminaries

3.1. Problem Formulation

Compositional Zero-Shot Learning (CZSL) is concerned with modeling images as compositions of primitives, specifically attributes (e.g., old) and objects (e.g., car). The composition space in CZSL is defined as the Cartesian product of all possible attributes and objects, Y = A × O , capturing all attribute–object combinations. This space is divided into two disjoint sets: seen compositions ( Y S ) and unseen compositions ( Y U ), such that Y S Y U = and Y S Y U = Y . During training, the model learns from a dataset S = { ( x , y ) | x X S , y Y S } , where each image x is labeled with a composition y = ( a , o ) . At test time, the model is required to predict labels for both seen and unseen compositions within the test label space Y t e s t = Y S Y U . Depending on the evaluation setting, the task can be restricted to a predefined subset of unseen compositions in closed-world evaluation (CW-CZSL) or extended to all possible compositions in open-world evaluation (OW-CZSL). The goal of CZSL is to learn a model f : X Y t e s t capable of recognizing images from novel attribute–object compositions.

3.2. Construct Prompt and Backbone

To leverage pre-trained knowledge from CLIP, following the approach in prior work [14], we construct prompts for each primitive using pre-trained embeddings from CLIP [20]. A new primitive vocabulary V = [ V A , V O ] R ( | A | + | O | ) × d i m is created for all attributes and objects, where d i m t represents the embedding dimension of each token. To generate prompts for each primitive, we append a tokenized prefix, “a photo of”, to the primitives. The prompt for each primitive is defined as P i = [ p 1 , , p m , v i ] , where { p 1 , p m } are the prefix tokens, and all tokens are fully trainable. The text feature of a primitive is obtained by feeding the prompt P i into the CLIP-based text encoder E t , formulated as:
x i t = E t ( P i ) .
Following prior works [14,15], we utilize an Adapter [22] for the visual encoder E v , enabling adaptation without updating its parameters. The composition visual feature extracted from the encoder E v is disentangled into two primitives via disentangler layers D a and D o , formulated as:
x a v = D a ( E v ( x ) ) , x o v = D o ( E v ( x ) ) .

4. Methodology

Approach. As discussed in the Introduction section, recent methodologies in CZSL often suffer from two critical limitations: models either fail to learn the relationships between primitives due to biases in the training set, or they incur substantial computational costs to handle the large search space of | A | × | O | . We hypothesize that this trade-off between performance and computational cost arises from the design of current models, which focus on learning direct one-to-one associations between attribute–object pairs. To address this issue, we propose a novel method that enables the model to learn the relationships between a subordinate concept, primitive, and its associated superordinate concepts. Specifically, an attribute such as “old” manifests differently depending on the superordinate concept. For example, within animals, “old” may be characterized by increased body mass or fur growth, whereas in vehicles, it is reflected in features such as rust or outdated designs. However, subordinate concepts under the same superordinate category, such as truck and car within vehicles, exhibit similar characteristics for the “old” attribute. Based on this principle, we introduce a novel method CoFAD, Concept-oriented Feature ADjustment, that combines spectral clustering with the membership function in fuzzy logic. This approach enables the model to effectively learn the associations between primitives and the feature representations of their superordinate concepts. The overall framework of our proposed method is illustrated in Figure 2.

4.1. Fuzzy Spectral Clustering

Learning the relationships between primitives and their associated superordinate concepts requires identifying feature vectors that represent these superordinate concepts. In CZSL, a common approach involves minimizing the cosine similarity score between the visually encoded features and the word embeddings of attribute–object compositions. However, the pre-trained word embeddings used in this process (e.g., Word2Vec, FastText, or the embedding layer of the backbone network) often contain noise unrelated to the features present in the dataset, which can hinder accurate learning. To eliminate this noise and ensure stable modeling of superordinate concepts in sparse attribute–object graphs, we leverage theoretical insights from spectral graph theory. Spectral clustering, by construction, relies on the eigen decomposition of the normalized Laplacian matrix, which is guaranteed to yield stable clusters under mild assumptions of graph connectivity and sparsity. In our case, the attribute–object co-occurrence graph derived from the training compositions is typically sparse but connected within subdomains (e.g., vehicles, animals), ensuring non-trivial eigenvalue spectra. Formally, given a set of attributes A = { a 0 , a 1 , , a n } and a set of objects O = { o 0 , o 1 , , o m } , we can define a combined set U = A O , where U = { u 0 , , u n , u n + 1 , , u n + m } . The adjacency matrix M for the compositions in the training set is defined as follows:
M i , j = { 1 if ( u i , u j ) Y S , 0 else .
To represent the graph structure, the Laplacian matrix L is computed using the degree matrix D, where the diagonal entries are defined as D [ i , i ] = j M [ i , j ] . The normalized Laplacian matrix is then calculated as
L = I D 1 2 M D 1 2 ,
where I denotes the identity matrix. Next, eigenvalue decomposition is performed on L:
L v q = λ q v q ,
where λ q represents the q-th eigenvalue and v q is the corresponding eigenvector. The eigenvectors corresponding to the smallest η eigenvalues are selected to form the matrix V, where η is sames as the number of clusters determined for each dataset. The K-means clustering algorithm is then applied to the eigenvector matrix V.
From the perspective of spectral graph theory, the stability and semantic interpretability of fuzzy spectral clustering in CoFAD can be formalized through the eigen-gap property of the normalized Laplacian matrix. The eigenvectors corresponding to the smallest η eigenvalues of L are used to embed primitives into a low-dimensional subspace for clustering. A fundamental result of spectral clustering [23] states that if the eigen-gap
Δ η = λ η + 1 λ η
between the η -th and ( η + 1 )-th eigenvalues is sufficiently large, then the partition induced by the first η eigenvectors is stable against perturbations in the graph structure (e.g., noise in adjacency weights or incomplete compositional coverage). In other words, the larger the eigen-gap, the more robust the resulting clusters are to small variations in data. In our context, this condition has two important implications. First, the attribute–object graphs used in CZSL are typically sparse but exhibit natural community-like structures around superordinate categories (e.g., “vehicle,” “animal”). A non-trivial eigen-gap arises when these communities are well separated, ensuring that fuzzy spectral clustering produces stable and semantically meaningful clusters despite sparsity. Second, because the eigen-gap controls how distinct the clusters are in the embedding space, a significant Δ η guarantees that the derived clusters correspond to semantically coherent superordinate concepts. This ensures that primitives grouped together (e.g., cat, dog, tiger) share contextual regularities rather than being artifacts of noisy graph edges.
The clusters and centroids obtained solely from the connectivity information among primitives represent the superordinate concepts and their respective representative values. Since a primitive can belong to multiple superordinate concepts rather than being restricted to a single one, the degree of membership to each superordinate concept is calculated based on Euclidean distance using the membership function in fuzzy logic. The Euclidean distance d i , k between label point i and the k-th cluster center is calculated as follows:
d i , k = | | V [ i ] μ k | | ,
where μ k is the center of the k-th cluster. Finally, the fuzzy membership u i , k of each label eigenvector is determined based on the distances d i , k :
u i , k = 1 d i , k 2 f 1 / j = 1 1 d i , j 2 f 1 ,
where f is the fuzziness parameter. This fuzzy membership quantifies the degree of association of each label with a given cluster. In other words, fuzzy membership represents the degree to which each primitive belongs to the η superordinate concepts. A highly skewed membership indicates a specialized primitive with fewer associated compositions, while a more uniform membership suggests a common concept capable of exhibiting diverse characteristics.
The fuzziness parameter f in fuzzy spectral clustering regulates the degree to which each primitive can belong to multiple superordinate concepts. A lower f value enforces hard assignments, where each primitive is associated with only one cluster, limiting the model’s ability to capture contextual overlaps across concepts. Conversely, a higher f value distributes membership more evenly, which prevents overfitting to specific clusters but risks diluting discriminative signals. By choosing an intermediate value, the model achieves a balance: it can capture semantic overlaps (e.g., “old car” and “old tiger” both reflecting age, albeit differently) while retaining sufficient discriminability. This theoretical trade-off underpins the observed improvement in generalization, as the model learns richer, context-aware feature boundaries.
The number of clusters η directly influences the granularity of superordinate concept formation. Too few clusters cause unrelated primitives to be grouped together, introducing noise; too many clusters result in fragmentation, reducing the benefits of concept sharing. Spectral clustering, guided by feasibility-driven adjacency, provides a mathematically grounded way to approximate the intrinsic structure of the attribute–object graph. In essence, η controls the level of abstraction: moderate values preserve both intra-concept similarity and inter-concept diversity, which empirically leads to improved generalization.

4.2. Feasibility Score

Spectral clustering effectively excludes noise present in word features, allowing the model to focus solely on the given label settings. However, it still relies exclusively on training set compositions, resulting in feasible unseen compositions being assigned a value of 0 in the adjacency matrix M. We believe that incorporating compositional feasibility for unseen connections into the computation of adjacency matrix M can improve the accuracy of spectral clustering. To achieve this, we adopt the compositional feasibility estimation method from previous work [8], which is based on the conjecture that similar objects share similar attributes, whereas dissimilar objects do not. For an unseen composition ( a u c , o u c ), the attribute a u c and object o u c are paired with a set of objects O s c = { o 0 , o 1 , , o i } and attributes A s c = { a 0 , a 1 , , a j } , respectively, from the training set Y S . The feasibility scores of the unseen composition with respect to the object o u c and attribute a u c are defined as follows:
ρ o b j ( a u c , o u c ) = max o O s c c o s ( x o u c t , x o t ) , ρ a t t r ( a u c , o u c ) = max a A s c c o s ( x a u c t , x a t ) ,
where x o t represents the text feature of o encoded by a pre-trained text encoder E t and c o s ( · ) denotes the cosine similarity function. The mixed feasibility score for the adjacency matrix is then defined as
ρ u c = m a x ( ρ o b j + ρ a t t r 2 , 0 ) .
The adjacency matrix M is redefined using ρ u c as follows:
M i , j = { 1 if ( u i , u j ) Y S , ρ u c if ( u i , u j ) Y U , 0 e l s e .
Using this updated adjacency matrix M, the CoFAD model performs fuzzy spectral clustering as described in Section 3.1. This integration enables CoFAD to incorporate feasibility-driven connections for unseen compositions, enhancing the clustering process and improving generalization to novel attribute–object pairs. For more details on the feasibility score, refer to the prior study [8].

4.3. Label Adjustment

To represent the superordinate concepts of textual features, the centroid c k t for a cluster C k is calculated as
c k t = 1 | C k | i C k x i t .
The superordinate concept feature c k t is then aggregated as a weighted sum of cluster centroids. (To avoid confusion, the centroids μ k derived in Section 4.1 are calculated from eigenvectors, whereas c k t is derived from text features), using the fuzzy membership values computed earlier. The aggregated superordinate concept feature is defined as
x i s u p e r = k = 1 u i , k · c k t ,
where u i , k represents the fuzzy membership of label i in cluster k.
Finally, a label adjustment layer projects the concatenated concept features [ x i t , x i s u p e r ] into an adjusted label feature x i a d j . This adjustment enables the model to learn the relationships between primitives and multiple superordinate concepts effectively.

4.4. Training and Inference

Training Objectives. The logits for predicting the attribute and object labels of an image x are given by
l o g i t a = c o s ( x a v , x a i a d j ) , l o g i t o = c o s ( x o v , x o j a d j ) .
To encourage the model to consider the full search space A × O , pairwise summations are computed as l o g i t c = l o g i t a + l o g i t o . The classification losses are defined as
L a = C E ( l o g i t a , y a ) , L o = C E ( l o g i t o , y o ) , L c = C E ( l o g i t c , y ) ,
where C E ( · ) denotes the Cross-Entropy objective function. The total loss L is a weighted sum of these losses:
L = λ a L a + λ o L o + λ c L c ,
where λ a , λ o , λ c R are weights balancing the contributions of different losses.
Inference. During testing, the logits of attributes and objects are combined using a pairwise product to adjust the composition logits. The most likely composition is then predicted as
p ^ ( y i , j | x ) = ω ( l o g i t c ) + ω ( l o g i t a ) · ω ( l o g i t o ) , y ^ = a r g m a x ( p ^ ( y i , j | x ) )
where ω represents the softmax function.

5. Experiments

5.1. Experimental Setup

Dataset CoFAD was evaluated on three widely used CZSL benchmark datasets: MIT-States [16], UT-Zappos [17], and C-GQA [7].
  • MIT-States, collected via older search engines, includes diverse compositions without distinguishing between living and non-living entities, such as “Burnt Wood” or “Tiny Cat.” It contains 115 attributes and 245 objects, with 26,114 out of 28,175 compositions being non-existent labels (≈93%).In addition, because it is compiled from web-scraped and automatically annotated images, MIT-States contains numerous mislabeled instances, occlusions, and low-quality samples—conditions that closely resemble the imperfections typically observed in large-scale real-world corpora.
  • UT-Zappos focuses on fine-grained images of shoes, such as “Suede Slippers” or “Cotton Sandals.” It includes 16 attributes and 12 objects, with 76 out of 192 compositions being non-existent labels (≈40%). Due to the extremely subtle inter-class differences (e.g., “Leather Boots” vs. “Suede Boots”), even human annotators often struggle to differentiate categories, making this dataset an effective stand-in for noisy or ambiguous inputs.
  • C-GQA, built on the Stanford GQA dataset [24], shares similar primitives with MIT-States but includes a significantly larger number of labels. It comprises 413 attributes and 674 objects, resulting in nearly 280,000 possible compositions. However, only 7555 compositions are valid, with approximately 97% being non-existent pairs. This extreme imbalance mirrors the combinatorial explosion and sparsity of feasible compositions that occur in real-world knowledge graphs and retrieval-based applications.
For a fair comparison, we used the datasets and train/validation/test splits provided by a previous work [5,21].

5.2. Metrics

We adopt the established evaluation protocols [5,7,25] and report all results using four key metrics. Specifically, we measure the best seen score (S), where a large bias term limits the model to predicting only among seen labels, and the best unseen score (U), which reflects the model’s zero-shot performance by predicting only unseen labels. To assess the balance between seen and unseen performance, we report the best harmonic mean (HM), which captures the trade-off between the two. Additionally, we provide the area under the seen-unseen curve (AUC) by varying the calibration bias. Both HM and AUC are core metrics for quantitatively evaluating models in CZSL tasks, offering comprehensive insights into their generalization capabilities. The standard evaluation framework in Compositional Zero-Shot Learning (CZSL) has long relied on these four core metrics—Seen (S), Unseen (U), Harmonic Mean (HM), and AUC both closed-world and open-world settings.

5.3. Implementation Details

In our experiments, all baseline models, as well as our proposed model CoFAD, were implemented using PyTorch 2.4.1+cu121 [26] and utilized the pre-trained CLIP ViT-L/14 as the backbone. The models were trained and evaluated on a single NVIDIA A5000 GPU and an Intel Xeon Silver 4314 Processor. For spectral clustering in CoFAD, the number of clusters was set to 4 for the UT-Zappos and MIT-States datasets, and 8 for the C-GQA dataset. This choice was made based on a greedy search over candidate values η 2 , 4 , 6 , 8 , 10 , and the final setting was selected according to the highest validation performance in terms of harmonic mean (HM). The loss weights λ a , λ o , and λ c were fixed at 1.0 across all settings. While these weights can be tuned to emphasize attributes, objects, or compositions differently, we intentionally use equal weighting to prevent bias and to maintain a balanced optimization objective. This decision also facilitates fair comparisons across datasets and aligns with recent multi-path compositional learning approaches.

5.4. Comparison with State-of-the-Arts

Table 2 presents a comparison of our method with previous studies in the Open-World setting. CoFAD demonstrates either superior performance or comparable results to previous state-of-the-art (SOTA) methods across all datasets. On both UT-Zappos and C-GQA, CoFAD demonstrates its strength in generalization, achieving highest performance in HM (49.9 on UT-Zappos and 12.5 on C-GQA) and AUC (33.9 on UT-Zappos and 3.4 on C-GQA). On MIT-States, CoFAD achieves an HM of 20.2 and an AUC of 7.3, which is slightly lower than CDS-CZSL [15] but achieved with a significantly reduced search space ( A + O ) compared to A × O . This competitive performance and substantial improvement in cost efficiency underscores the effectiveness and strengths of CoFAD in addressing CZSL tasks. A detailed comparison of cost efficiency is provided in the following section.

5.5. Cost Efficiency

We conducted a comparative analysis of training cost efficiency against state-of-the-art (SOTA) models, as shown in Figure 3. The comparison was conducted using a compositionally diverse CGQA dataset in an open-world setting, with all models trained using a batch size of 2 and ViT-B/32 as the backbone architecture. The results demonstrate a dramatic improvement in both training time and GPU memory usage for our approach. Most high-performance CLIP-based models generate compositional features covering the | A | × | O | search space, leading to substantial GPU memory consumption and extended training duration. In contrast, the proposed CoFAD adopts a multi-label approach, reducing the search space to | A | + | O | . This results in GPU memory usage being reduced by at least 2× and up to 4× compared to baselines. Moreover, training time is reduced by at least 3× and up to 50×. Notably, despite CSP [21] being the smallest model among the baselines, CoFAD achieves over 2× improvements in both speed and memory efficiency while simultaneously delivering nearly double the performance.

5.6. Discussion

CoFAD utilizes a multi-label classification approach, enabling it to learn both primitives and the full compositions even in the closed-world setting. To address the challenge of handling a large search space, we conducted experiments using two methods: (1) a masking method, where logits for labels outside the closed-world pairs are multiplied by “ 1 × 10 8 ” to mask them during loss computation, and (2) a discard method, where logits for these labels are excluded entirely, and their loss is not computed. Table 3 presents the experimental results in the closed-world setting. The masking method demonstrates improved performance and produces results comparable to those of SOTA models, whereas the discard method leads to a decrease in performance. This result suggests that CoFAD, which learns the relationships between primitives and their superordinate concepts, is negatively affected by the uncertainty introduced when unseen compositions are discarded in the loss function. By contrast, calculating the loss for all labels while using masking to identify irrelevant compositions allows the model to effectively learn and distinguish meaningful patterns, proving to be more effective for Zero-Shot Learning tasks.
In addition to the masking and discard strategies, we further conducted an ablation experiment by removing the fuzzy spectral clustering component from CoFAD, denoted as CoFAD w / oFuzzySpectralClustering . This variant eliminates the process of learning superordinate concept structure and instead relies solely on the raw label features without contextual expansion. As shown in Table 3, the absence of fuzzy spectral clustering leads to a consistent drop in performance across both datasets: on UT-Zappos, HM drops from 54.2 (full CoFAD) to 48.9, and on C-GQA, HM drops from 28.0 to 25.3. Similarly, AUC decreases by 5.2 points on UT-Zappos and 2.3 points on C-GQA. These results confirm that the contextual feature expansion via fuzzy clustering plays a critical role in improving compositional generalization. Without modeling the shared structure among primitives, the model becomes less capable of distinguishing subtle attribute–object relationships—especially in fine-grained or compositionally dense datasets like C-GQA. This ablation validates our core hypothesis: that superordinate-level organization is essential for mitigating overfitting to seen compositions and enhancing generalization to unseen combinations, even in the closed-world setting.
While CoFAD demonstrates strong performance across both open-world and closed-world CZSL settings, its effectiveness may be limited in certain conditions. In particular, when the amount of training data is extremely limited and contains only a narrow range of primitive combinations, it becomes challenging to construct a meaningful composition graph. As a result, the spectral clustering process may produce fragmented or noisy clusters, which can hinder the model’s ability to derive reliable superordinate concepts. This, in turn, weakens the benefits of contextual feature expansion.

5.7. Qualitative Results

The visualization of contextual labels is challenging due to the complex interrelations among them, which makes analysis nontrivial [33]. Inspired by prior works [4,6,18,25,33], we perform a qualitative evaluation under scenarios where primitives are associated with different superordinate concepts, leading to significant visual differences.
As illustrated in Figure 4, CoFAD consistently demonstrates superior contextual reasoning in distinguishing compositions. For the cases where two objects belong to different superordinate concepts but share the same attribute, CoFAD enables better semantic alignment by accurately identifying related compositions, such as “Caramelized Beef” or “Caramelized Sugar”. Without CoFAD, the model fails to maintain contextual consistency, misclassifying labels as semantically unrelated combinations like “Caramelized Sauce” or “Molten Sugar”. Similarly, for the cases where two attributes belong to different superordinate concepts but share the same object, CoFAD improves semantic coherence by correctly interpreting visual attributes, such as “Diced Cheese” and “Burnt Fence,” whereas the absence of CoFAD leads to unrelated predictions like “Splintered Wood” or “Moldy Cheese.”
Earlier, we noted that the multi-label approach tends to be biased in the training set when loosely related superordinate concepts share similar attributes, leading to poor recognition of features in unseen compositions. This bias is also evident in the qualitative results, where for food categories involving sugar and cheese, the model without CoFAD often leans toward the “Molten” attribute. In contrast, the results obtained by applying CoFAD demonstrate that this issue has been effectively mitigated.

6. Conclusions

This paper introduces a novel approach termed Concept-oriented Feature ADjustment (CoFAD), designed specifically for Compositional Zero-Shot Learning (CZSL). CoFAD addresses the critical trade-off between performance and computational efficiency inherent in multi-label and multi-class classification approaches. By incorporating a design that enables the model to effectively capture and learn the relationships between individual labels and their corresponding superordinate concepts, CoFAD demonstrates remarkable generalization capabilities to unseen attribute–object compositions. Comprehensive experiments conducted on multiple benchmark datasets highlight the superior performance and efficiency of CoFAD. These findings emphasize the pivotal role of concept-oriented feature learning in real-world CZSL scenarios and establish a foundation for advancing efficient multi-label classification strategies in future research.

Author Contributions

Conceptualization, S.K. and Y.S.C.; methodology, S.K.; software, S.K.; validation, S.K.; formal analysis, S.K.; investigation, S.K.; resources, S.K.; data curation, S.K.; writing—original draft preparation, S.K.; writing—review and editing, S.K. and Y.S.C.; visualization, S.K.; supervision, Y.S.C.; project administration, Y.S.C.; funding acquisition, Y.S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information and communications Technology Planning and evaluation (IITP) grant (No.RS-2025-25422680, No. RS-2020-II201373), and the National Research Foundation of Korea (NRF) grant (No. RS-2025-00520618) funded by the Korean Government (MSIT).

Institutional Review Board Statement

Not Applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The MIT-States dataset can be downloaded from https://web.mit.edu/phillipi/Public/states_and_transformations/ (accessed on 3 March 2024). The UT-Zappos dataset can be downloaded from https://vision.cs.utexas.edu/projects/finegrained/utzap50k/ (accessed on 3 March 2024). The C-GQA dataset can be downloaded from https://s3.mlcloud.uni-tuebingen.de/czsl/cgqa-updated.zip (accessed on 3 March 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Rosch, E.; Mervis, C.B.; Gray, W.D.; Johnson, D.M.; Boyes-Braem, P. Basic objects in natural categories. Cogn. Psychol. 1976, 8, 382–439. [Google Scholar] [CrossRef]
  2. Rosch, E. Principles of categorization. In Cognition and Categorization/Erlbaum; Routledge: Abingdon, UK, 1978. [Google Scholar]
  3. Misra, I.; Gupta, A.; Hebert, M. From red wine to red tomato: Composition with context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1792–1801. [Google Scholar]
  4. Nagarajan, T.; Grauman, K. Attributes as operators: Factorizing unseen attribute-object compositions. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 169–185. [Google Scholar]
  5. Purushwalkam, S.; Nickel, M.; Gupta, A.; Ranzato, M. Task-driven modular networks for zero-shot compositional learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Rpublic of Korea, 27 October–2 November 2019; pp. 3593–3602. [Google Scholar]
  6. Li, Y.L.; Xu, Y.; Mao, X.; Lu, C. Symmetry and group in attribute-object compositions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11316–11325. [Google Scholar]
  7. Naeem, M.F.; Xian, Y.; Tombari, F.; Akata, Z. Learning graph embeddings for compositional zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 953–962. [Google Scholar]
  8. Mancini, M.; Naeem, M.F.; Xian, Y.; Akata, Z. Learning graph embeddings for open world compositional zero-shot learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 46, 1545–1560. [Google Scholar] [CrossRef] [PubMed]
  9. Karthik, S.; Mancini, M.; Akata, Z. Revisiting visual product for compositional zero-shot learning. In Proceedings of the NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications, Virtual, 6–14 December 2021. [Google Scholar]
  10. Kim, S.; Lee, S.; Choi, Y.S. Focusing on valid search space in Open-World Compositional Zero-Shot Learning by leveraging misleading answers. IEEE Access 2024, 12, 165822–165830. [Google Scholar] [CrossRef]
  11. Anwaar, M.U.; Pan, Z.; Kleinsteuber, M. On leveraging variational graph embeddings for open world compositional zero-shot learning. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 4645–4654. [Google Scholar]
  12. Karthik, S.; Mancini, M.; Akata, Z. Kg-sp: Knowledge guided simple primitives for open world compositional zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9336–9345. [Google Scholar]
  13. Li, X.; Yang, X.; Wei, K.; Deng, C.; Yang, M. Siamese contrastive embedding network for compositional zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9326–9335. [Google Scholar]
  14. Huang, S.; Gong, B.; Feng, Y.; Zhang, M.; Lv, Y.; Wang, D. Troika: Multi-path cross-modal traction for compositional zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 24005–24014. [Google Scholar]
  15. Li, Y.; Liu, Z.; Chen, H.; Yao, L. Context-based and Diversity-driven Specificity in Compositional Zero-Shot Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17037–17046. [Google Scholar]
  16. Isola, P.; Lim, J.J.; Adelson, E.H. Discovering states and transformations in image collections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1383–1391. [Google Scholar]
  17. Yu, A.; Grauman, K. Fine-grained visual comparisons with local learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 192–199. [Google Scholar]
  18. Hao, S.; Han, K.; Wong, K.Y.K. Learning attention as disentangler for compositional zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15315–15324. [Google Scholar]
  19. Mancini, M.; Naeem, M.F.; Xian, Y.; Akata, Z. Open world compositional zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 5222–5230. [Google Scholar]
  20. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning PMLR, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
  21. Nayak, N.V.; Yu, P.; Bach, S.H. Learning to compose soft prompts for compositional zero-shot learning. arXiv 2022, arXiv:2204.03574. [Google Scholar]
  22. Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the International Conference on Machine Learning PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 2790–2799. [Google Scholar]
  23. Von Luxburg, U. A tutorial on spectral clustering. Stat. Comput. 2007, 17, 395–416. [Google Scholar] [CrossRef]
  24. Hudson, D.A.; Manning, C.D. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6700–6709. [Google Scholar]
  25. Zhang, T.; Liang, K.; Du, R.; Sun, X.; Ma, Z.; Guo, J. Learning invariant visual representations for compositional zero-shot learning. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 339–355. [Google Scholar]
  26. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar] [CrossRef]
  27. Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
  28. Xu, G.; Kordjamshidi, P.; Chai, J. Prompting large pre-trained vision-language models for compositional concept learning. arXiv 2022, arXiv:2211.05077. [Google Scholar]
  29. Wang, H.; Yang, M.; Wei, K.; Deng, C. Hierarchical prompt learning for compositional zero-shot recognition. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao, China, 19–25 August 2023; pp. 1470–1478. [Google Scholar]
  30. Xu, G.; Chai, J.; Kordjamshidi, P. GIPCOL: Graph-Injected Soft Prompting for Compositional Zero-Shot Learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 5774–5783. [Google Scholar]
  31. Lu, X.; Guo, S.; Liu, Z.; Guo, J. Decomposed soft prompt guided fusion enhancing for compositional zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 23–24 June 2023; pp. 23560–23569. [Google Scholar]
  32. Bao, W.; Chen, L.; Huang, H.; Kong, Y. Prompting language-informed distribution for compositional zero-shot learning. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2025; pp. 107–123. [Google Scholar]
  33. Saini, N.; Pham, K.; Shrivastava, A. Disentangling visual embeddings for attributes and objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13658–13667. [Google Scholar]
Figure 1. Comparison of label feature spaces between conventional multi-label and multi-class strategies and the proposed CoFAD method.
Figure 1. Comparison of label feature spaces between conventional multi-label and multi-class strategies and the proposed CoFAD method.
Applsci 15 09837 g001
Figure 2. The overall flow of the proposed CoFAD method. Both the visual and text encoders are based on a CLIP-based model. Training pairs are represented as connections between labels, forming a graph structure. Fuzzy spectral clustering is then applied to derive superordinate concept features (Section 4.1). Labels are adjusted to align with superordinate concept features based on their memberships to these concepts, thereby expanding the feature space (Section 4.3). Based on cosine similarity, the label features are trained on potential unseen pair features derived from their respective superordinate concepts (Section 4.4).
Figure 2. The overall flow of the proposed CoFAD method. Both the visual and text encoders are based on a CLIP-based model. Training pairs are represented as connections between labels, forming a graph structure. Fuzzy spectral clustering is then applied to derive superordinate concept features (Section 4.1). Labels are adjusted to align with superordinate concept features based on their memberships to these concepts, thereby expanding the feature space (Section 4.3). Based on cosine similarity, the label features are trained on potential unseen pair features derived from their respective superordinate concepts (Section 4.4).
Applsci 15 09837 g002
Figure 3. Comparison of the efficiency of our model and baseline methods on the C-GQA dataset. The numbers above each data point indicate the HM performance on C-GQA dataset as reported in Table 2.
Figure 3. Comparison of the efficiency of our model and baseline methods on the C-GQA dataset. The numbers above each data point indicate the HM performance on C-GQA dataset as reported in Table 2.
Applsci 15 09837 g003
Figure 4. Qualitative results. We present the top-3 predictions for cases where objects share the same attribute but belong to different superordinate concepts, and where attributes share the same object but belong to different superordinate concepts, comparing the results with and without applying CoFAD. Red and blue indicate the ground truth attributes and objects, respectively.
Figure 4. Qualitative results. We present the top-3 predictions for cases where objects share the same attribute but belong to different superordinate concepts, and where attributes share the same object but belong to different superordinate concepts, comparing the results with and without applying CoFAD. Red and blue indicate the ground truth attributes and objects, respectively.
Applsci 15 09837 g004
Table 1. Data splits for the three benchmark datasets used in this work. | A | and | O | represent the number of attributes and objects, respectively. Y S represents the number of seen compositions, and | Y U | represents the number of unseen compositions. X denotes the number of image samples used in each split.
Table 1. Data splits for the three benchmark datasets used in this work. | A | and | O | represent the number of attributes and objects, respectively. Y S represents the number of seen compositions, and | Y U | represents the number of unseen compositions. X denotes the number of image samples used in each split.
Dataset TrainingValidationTest
| A | | O | | A | × | O | Y S X Y S Y U X Y S Y U X
UT-Zappos16121928323 k15153 k18183 k
MIT-States11524528,175126230 k30030010 k40040013 k
CGQA413674278,362559227 k104012527 k8889235 k
Table 2. OW-CZSL results on three benchmark datasets. The performance of baseline models is reported from their respective papers. Bold scores indicate the best results, and underlined scores indicate the second-best results.
Table 2. OW-CZSL results on three benchmark datasets. The performance of baseline models is reported from their respective papers. Bold scores indicate the best results, and underlined scores indicate the second-best results.
ModelsMITUT-ZapposC-GQA
SUHMAUCSUHMAUCSUHMAUC
w/o CLIP
SymNet [6]21.47.05.80.853.344.634.518.526.72.23.30.43
CGE [7]32.45.16.01.061.747.739.023.132.71.82.90.47
CompCos [19]25.4108.91.659.346.836.921.328.41.82.80.39
VisProd++ [9]28.17.57.31.262.551.541.826.528.02.84.50.75
KG-SP [12]28.47.57.41.361.852.142.326.531.52.94.70.78
Co-CGE [8]30.311.210.72.361.245.840.823.332.13.04.80.78
CLIP [20]30.114.312.83.015.720.611.22.27.54.64.00.3
CoOp [27]34.69.312.32.852.131.528.913.221.04.65.50.7
PromptVL [28]48.516.017.76.164.644.037.121.6----
CSP [21]46.315.717.45.764.144.138.922.728.75.26.91.2
HPL [29]46.418.919.86.963.448.140.224.630.15.87.51.4
GIPCOL [30]48.516.017.96.365.045.040.123.531.65.57.31.3
DFSP(i2t) [31]47.218.219.16.764.353.841.226.435.66.59.02.0
DFSP(BiF) [31]47.118.119.26.763.557.242.727.636.47.610.62.4
DFSP(t2i) [31]47.518.519.36.866.860.044.030.338.37.210.42.4
PLID [32]49.118.720.07.367.655.546.630.839.17.510.62.5
Troika [14]48.818.720.17.266.461.247.833.040.87.910.92.7
CDS-CZSL [15]49.421.822.18.564.761.348.232.337.68.211.62.7
CoFAD (ours)45.521.620.27.367.459.750.134.044.69.112.53.4
Table 3. CW-CZSL results on two benchmark datasets. The performance of baseline models is reported from their respective papers. Bold scores indicate the best results.
Table 3. CW-CZSL results on two benchmark datasets. The performance of baseline models is reported from their respective papers. Bold scores indicate the best results.
ModelsUT-ZapposC-GQA
SUHMAUCSUHMAUC
CSP64.266.246.633.028.826.820.56.2
HPL63.068.848.235.030.828.422.47.2
GIPCOL65.068.548.836.231.928.422.57.1
DFSP(t2i)66.771.747.236.038.232.027.110.5
PLID67.368.852.438.738.833.027.911.0
Troika66.873.854.641.741.035.729.412.4
CDS-CZSL63.974.852.739.538.334.228.111.1
CoFAD66.372.754.240.745.429.228.011.5
CoFAD m a s k i n g 67.171.654.241.144.629.528.611.5
CoFAD d i s c a r d 30.334.324.38.632.013.313.73.3
CoFAD w / o F u z z y S p e c t r a l C l u s t e r i n g 65.167.348.935.837.727.325.39.2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, S.; Choi, Y.S. Contextual Feature Expansion with Superordinate Concept for Compositional Zero-Shot Learning. Appl. Sci. 2025, 15, 9837. https://doi.org/10.3390/app15179837

AMA Style

Kim S, Choi YS. Contextual Feature Expansion with Superordinate Concept for Compositional Zero-Shot Learning. Applied Sciences. 2025; 15(17):9837. https://doi.org/10.3390/app15179837

Chicago/Turabian Style

Kim, Soohyeong, and Yong Suk Choi. 2025. "Contextual Feature Expansion with Superordinate Concept for Compositional Zero-Shot Learning" Applied Sciences 15, no. 17: 9837. https://doi.org/10.3390/app15179837

APA Style

Kim, S., & Choi, Y. S. (2025). Contextual Feature Expansion with Superordinate Concept for Compositional Zero-Shot Learning. Applied Sciences, 15(17), 9837. https://doi.org/10.3390/app15179837

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop