In this section, we conduct extensive experiments to evaluate the effectiveness of our proposed SOP framework. We first outline the experimental setup, followed by a quantitative performance comparison with state-of-the-art methods. Subsequently, we perform detailed ablation studies to validate the contribution of each module, specifically focusing on Selective Focus Restoration (SFR), Orthogonal Subspace Projection (OSP), and Geometric Composition and Context Preservation (GCCP). Finally, we provide qualitative analyses to visualize how SOP addresses focus ambiguity and semantic erosion.
4.1. Experimental Settings
Datasets. Following previous works [
10,
37,
45], we evaluate SOP on three widely used benchmarks: FashionIQ, Shoes, and CIRR. Notably, the fashion domain serves as a critical analytical benchmark for fine-grained vision–language tasks due to its structured attribute definitions [
46]. While FashionIQ and Shoes provide rigorous testbeds for fine-grained attribute decoupling, they are domain specific and limited in terms of semantic diversity compared to open-world concepts. To mitigate this bias, we include CIRR to verify the model’s generalization capabilities. However, we acknowledge that the current evaluation primarily focuses on object-centric retrieval, and performance on abstract or scene-level retrieval (e.g., complex event reasoning) remains an area for future exploration.
FashionIQ [
47]: This represents the baseline for explicit attribute manipulation (e.g., specific changes in color or material). To evaluate retrieval performance within the fashion domain, we employ this large-scale dataset containing 77,684 images. Following standard evaluation protocols, the dataset is segregated into three independent subsets: Dresses, Shirts, and Tops&Tees. The data partition includes approximately 18,000 triplets allocated for training, with an additional 6000 triplets reserved for validation purposes. The fashion domain serves as a crucial diagnostic benchmark for disentanglement. Unlike general open-domain scenes, fashion images contain highly structured attributes. This allows us to rigorously diagnose whether the model successfully decouples the "modification" from the "preservation", rather than just learning superficial correlations.
Shoes [
48]: This represents an intermediate level, testing relative changes (e.g., “shinier” and “higher”) and requiring finer-grained comparative reasoning. This benchmark is specifically curated for attribute-aware retrieval in the footwear domain, sourced from e-commerce platforms. It is characterized by rich, attribute-specific textual feedback that describes fine-grained visual modifications (e.g., changes in color or structural details). In terms of data distribution, it offers 10,000 training samples and 4658 testing samples, facilitating robust model optimization and assessment on dialogue-based queries.
CIRR [
36]: This represents the most challenging level, targeting ambiguous real-world scenarios with complex, unconstrained semantic transitions. Distinct from domain-specific datasets, CIRR serves as an open-domain benchmark designed to test the model’s robustness in unconstrained, real-life scenarios. It emphasizes zero-shot generalization and the understanding of complex semantic transitions within scenes. The corpus consists of 36,000 relative caption pairs, which are split into a training set of 28,000 pairs and a validation set of 8000 pairs, challenging the model to interpret high-level visual transformations rather than simple attribute manipulation.
Dataset Characteristics: Specifically, we clarify their specific evaluation roles of these datasets as follows:
FashionIQ: “This benchmark primarily tests the model’s ability to execute strict attribute replacement (e.g., changing "white" to "blue") while maintaining the structural integrity of the reference object”.
Shoes: “This dataset evaluates the model’s sensitivity to relative feedback, requiring it to understand comparative adjectives (e.g., ‘more athletic’) rather than just categorical tags”.
CIRR: “This benchmark serves as a stress test for robustness against ambiguity, assessing whether the model can handle vague instructions and diverse visual distributions in unconstrained open domains”.
Implementation Details. We summarize all key implementation details and hyper-parameters in
Table 1. To ensure transparency, the core architectural implementation is available in our repository (
https://github.com/Cheng-SDU/SOP/, accessed on 3 March 2026).
Evaluation. To ensure a fair comparison with existing state-of-the-art methods, we strictly adhere to the standard evaluation metrics widely adopted within the CIR community. We primarily quantify retrieval performance using Recall@k (R@k), which measures the proportion of the ground-truth appearing among the top k retrieval candidates.
Fashion-Domain Evaluation: For the FashionIQ and Shoes datasets, we focus on the model’s performance in the top-ranked results. On the Shoes dataset, we report R@1, R@10, and R@50 along with their averages. For FashionIQ, we compute R@10 and R@50 separately for three subsets and report the average score across all categories to comprehensively reflect the model’s fine-grained attribute awareness.
Open-Domain Evaluation: For the CIRR dataset, the evaluation criteria are more stringent. Beyond standard R@k (k in {}) metrics, we introduce subset recall @k (k in {}). This metric is computed only on curated subsets containing high-difficulty negative samples, providing a more precise reflection of the model’s ability to exclude irrelevant items. The final performance is determined by the average of R@5 and @1.
4.2. Performance Comparison
We compare SOP against two categories of CIR methods: (1) conventional methods (e.g., TIRG [
34], ARTEMIS [
49], MGUR [
45], and IUDC [
11]) and (2) VLP-based methods (e.g., CLIP4CIR [
37], SSN [
33], BLIP4CIR [
38], CoVR [
50], CoVR-2 [
51], and ENCODER [
10]).
The quantitative results on FashionIQ, Shoes, and CIRR are summarized in
Table 2 and
Table 3. From the comparative results, we can draw the following observations on three benchmarks:
Results on FashionIQ. As shown in
Table 2, SOP demonstrates outstanding performance on the FashionIQ dataset, significantly outperforming current state-of-the-art methods. Specifically, our approach achieves an average R@10 of 56.55% and R@50 of 78.98%. This improvement is particularly pronounced in the Dress and Shirt categories. We attribute this success to the synergistic interaction between the OSP and GCCP modules. OSP mathematically decouples modification signals from invariant visual context, preventing loss of original structural details (i.e., mitigating semantic erosion). Meanwhile, the GCCP module enforces orthographic projection consistency constraints, ensuring synthesized features strictly preserve garment identity attributes (e.g., sleeve length and collar structure) while maintaining high contextual fidelity even when color or texture changes occur.
Results on Shoes. As shown in
Table 2, performance comparisons on the Shoes dataset further validate SOP’s robustness in domain-specific retrieval scenarios involving relative attribute feedback. SOP achieves 26.17% R@1 and 64.38% R@10, outperforming baseline models by an average of 8.10%. Queries in this dataset frequently contain relative descriptions such as “darker”, “higher heels”, or “more athletic”, requiring the model to strictly distinguish between areas requiring modification and those needing preservation. The test results validate the effectiveness of our proposed constant context anchoring mechanism. By explicitly defining constant feature subspaces (e.g., shoe silhouette contours) and enforcing modifications only within the orthogonal complement space, SOP accurately captures subtle relative changes while minimizing background noise interference during retrieval.
Results on CIRR.
Table 3 presents results on the challenging open-domain CIRR dataset, highlighting the core strengths of our framework. SOP achieves an outstanding performance on the test set, with R@5 reaching 80.81% and
@1 at 77.04%. Notably, the CIRR dataset exhibits high visual variability and linguistic ambiguity (e.g., vague instructions like “change it to this” or “make it look like…”), characteristics that typically confound traditional VLP-based models. The experimental results validate the effectiveness of the SFR module. By integrating entropy-aware uncertainty modeling with structural distribution alignment techniques, SOP successfully leverages global priors to resolve focus ambiguity, enabling accurate target retrieval even when local visual cues are insufficient or misleading.
Table 2.
Performance comparison with state-of-the-art methods on the FashionIQ and Shoes datasets. We report Recall@K (K = 10 and 50) for three categories (Dresses, Shirts, and Tops&Tees) in FashionIQ and Recall@K (K = 1, 10, and 50) for Shoes. The best and second-best results are highlighted in bold and underlined, respectively. “-” indicates that the result is not available.
Table 2.
Performance comparison with state-of-the-art methods on the FashionIQ and Shoes datasets. We report Recall@K (K = 10 and 50) for three categories (Dresses, Shirts, and Tops&Tees) in FashionIQ and Recall@K (K = 1, 10, and 50) for Shoes. The best and second-best results are highlighted in bold and underlined, respectively. “-” indicates that the result is not available.
| Method | FashionIQ | Shoes |
|---|
|
Dresses
|
Shirts
|
Tops&Tees
|
Avg
|
|---|
|
R@10
|
R@50
|
R@10
|
R@50
|
R@10
|
R@50
|
R@10
|
R@50
|
R@1
|
R@10
|
R@50
|
Avg
|
|---|
| TIRG [34] (CVPR’19) | 14.87 | 34.66 | 18.26 | 37.89 | 19.08 | 39.62 | 17.40 | 37.39 | 12.60 | 45.45 | 69.39 | 42.48 |
| LF-CLIP [32] (CVPR’22) | 31.63 | 56.67 | 36.36 | 58.00 | 38.19 | 62.42 | 35.39 | 59.03 | - | - | - | - |
| ARTEMIS [49] (ICLR’22) | 27.16 | 52.40 | 21.78 | 43.64 | 29.20 | 54.83 | 26.05 | 50.29 | 18.72 | 53.11 | 79.31 | 50.38 |
| CLIP4CIR [37] (TOMM’23) | 33.81 | 59.40 | 39.99 | 60.45 | 41.41 | 65.37 | 38.40 | 61.74 | - | - | - | - |
| MGUR [45] (ICLR’24) | 32.61 | 61.34 | 33.23 | 62.55 | 41.40 | 72.51 | 35.75 | 65.47 | 18.41 | 53.63 | 79.84 | 50.63 |
| SSN [33] (AAAI’24) | 34.36 | 60.78 | 38.13 | 61.83 | 44.26 | 69.05 | 38.92 | 63.89 | - | - | - | - |
| BLIP4CIR [38] (WACV’24) | 40.65 | 66.34 | 40.38 | 64.13 | 46.86 | 69.91 | 42.63 | 66.79 | - | - | - | - |
| CoVR [50] (AAAI’24) | 44.55 | 69.03 | 48.43 | 67.42 | 52.60 | 74.31 | 48.53 | 70.25 | - | - | - | - |
| CoVR-2 [51] (TPAMI’24) | 46.53 | 69.60 | 51.23 | 70.64 | 52.14 | 73.27 | 49.97 | 71.17 | - | - | - | - |
| IUDC [11] (TOIS’25) | 35.22 | 61.90 | 41.86 | 63.52 | 42.19 | 69.23 | 39.76 | 64.88 | - | - | - | - |
| ENCODER [10] (AAAI’25) | 51.51 | 76.95 | 54.86 | 74.93 | 62.01 | 80.88 | 56.13 | 77.59 | - | - | - | - |
| SOP (Ours) | 52.17 | 79.34 | 55.63 | 76.29 | 61.84 | 81.32 | 56.55 | 78.98 | 26.17 | 64.38 | 85.65 | 58.73 |
Table 3.
Quantitative results on the CIRR test set. The evaluation metrics include standard Recall@K (K = 1, 5, 10, and 50) and subset-based @K (K = 1, 2, and 3). The last column shows the average performance of R@5 and @1. The best performance is marked in bold, and the second best is underlined.
Table 3.
Quantitative results on the CIRR test set. The evaluation metrics include standard Recall@K (K = 1, 5, 10, and 50) and subset-based @K (K = 1, 2, and 3). The last column shows the average performance of R@5 and @1. The best performance is marked in bold, and the second best is underlined.
| Method | R@k | Rsubset@k | Avg |
|---|
|
k = 1
|
k = 5
|
k = 10
|
k = 50
|
k = 1
|
k = 2
|
k = 3
|
|---|
| TIRG [34] (CVPR’19) | 14.61 | 48.37 | 64.08 | 90.03 | 22.67 | 44.97 | 65.14 | 35.52 |
| ARTEMIS [49] (ICLR’22) | 16.96 | 46.10 | 61.31 | 87.73 | 39.99 | 62.20 | 75.67 | 43.05 |
| LF-CLIP [32] (CVPR’22) | 33.59 | 65.35 | 77.35 | 95.21 | 62.39 | 81.81 | 92.02 | 63.87 |
| BLIP4CIR [38] (WACV’24) | 40.17 | 71.81 | 83.18 | 95.69 | 72.34 | 88.70 | 95.23 | 72.08 |
| SSN [33] (AAAI’24) | 43.91 | 77.25 | 86.48 | 97.45 | 71.76 | 88.63 | 95.54 | 74.51 |
| CoVR [50] (AAAI’24) | 49.69 | 78.60 | 86.77 | 94.31 | 75.01 | 88.12 | 93.16 | 76.81 |
| CoVR-2 [51] (TPAMI’24) | 50.43 | 81.08 | 88.89 | 98.05 | 76.75 | 90.34 | 95.78 | 78.92 |
| ENCODER [10] (AAAI’25) | 46.10 | 77.98 | 87.16 | 97.64 | 76.92 | 90.41 | 95.95 | 77.45 |
| SOP (Ours) | 50.72 | 80.81 | 88.59 | 98.07 | 77.04 | 90.63 | 96.18 | 78.93 |
4.3. Ablation Study
To systematically analyze the contribution of each component, we conduct ablation studies on the CIRR dataset. We classify the variants into three groups: G1 for SFR, G2 for OSP, and G3 for GCCP.
G1: Ablation on Selective Focus Restoration (SFR). We first investigate the efficacy of the SFR module in handling focus ambiguity and distribution alignment. The variants are defined as follows:
D#(1) w/o SFR: We remove the entire SFR module and directly add the reference image features with the modification text features to form the composed features.
D#(2) w/o Uncertainty Weighting (
): The SFR structure is retained; however, the entropy-driven adaptive weight in Equation (
4) is replaced with a fixed scalar (set to 0.5) to fuse global priors.
D#(3) w/o Structural Alignment (): The uncertainty estimation is preserved, but the structural consistency regularization (KL divergence) is removed, relying solely on retrieval loss for training.
G2: Ablation on Orthogonal Subspace Projection (OSP). This group assesses the impact of signal decoupling in terms of preventing semantic erosion.
D#(4) w/o OSP: The orthogonal projection step is discarded, and the non-decoupled query features are directly input into the subsequent composition module.
D#(5) w/o Orthogonality: Simple vector subtraction is employed instead of strict Gram–Schmidt orthogonalization to extract the modification vector.
G3: Ablation on Geometric Composition (GCCP). Finally, we evaluate the contribution of the GCCP module in preserving contextual fidelity.
Analysis of G1: As shown in
Table 4, D#(1) exhibits the most significant performance degradation, underscoring the inefficacy of simple feature addition under focus ambiguity. When modification texts are vague (e.g., containing only pronouns), directly concatenated features often fall into low-density regions of the feature space, lacking explicit semantic direction. By introducing global distribution priors, SFR effectively provides a semantic anchor for ambiguous queries, realigning them with a reasonable target distribution. Comparing D#(2) with the full model reveals that fixed weights cannot adapt to the varying degrees of ambiguity in the CIRR dataset. For explicit instructions, excessive prior intervention introduces noise; for ambiguous ones, fixed weights are insufficient to correct deviations. The adaptive weight
successfully functions as a confidence gate, activating prior guidance only when the model confronts instruction uncertainty, which is critical for the robustness of SOP. Solely point-to-point retrieval lossleads to local optima, whereas
serves as a regularization term forcing the generated feature distribution to maintain consistency with the topological structure of the real image distribution, thereby enhancing the retrieval.
Analysis of G2: The experimental results indicate a substantial decline for D#(4) in fine-grained metrics (particularly FashionIQ R@10). This confirms that feature entanglement is the primary cause of semantic erosion. In a non-decoupled state, modification signals are highly coupled with background signals, where strengthening the former often suppresses the latter. Notably, the unsatisfactory performance of D#(5) suggests that simple algebraic subtraction cannot effectively strip semantic correlations. Feature vectors in multi-dimensional spaces often exhibit complex non-linear collinearity. The Gram–Schmidt orthogonalization employed by SOP constructs a complement space strictly orthogonal to the reference image features from a geometric perspective. This strict orthogonality ensures that serves as a pure orthogonal modification increment, effectively mitigating interference with the invariant context.
Analysis of G3: The performance decline of D#(6) indicates that the feature composition process requires dynamic adjustment based on instruction reliability. For substantial modifications (high uncertainty), the model relies more on the generated modification vector ; for fine-tuning (low uncertainty), it should preserve more of the original features . The adaptive gating precisely controls this balance, preventing over-modification or under-modification. Regarding D#(7), a significant impact is observed in relation to high-precision metrics such as R@1. The orthogonal projection consistency loss essentially introduces a constraint requiring the projection of synthesized features onto the invariant subspace to strictly equal the original visual context. This geometric constraint establishes a final defense against semantic drift, ensuring that invariant features to be preserved are not accidentally lost while altering target attributes.
Additionally, to clarify that the performance gains stem from our SOP design rather than solely from the BLIP-2 backbone or freezing strategies, we performed two sets of additional ablation studies, as follows:
(1) Robustness across Different VLP Backbones: As shown in
Table 5, we migrated the SOP framework to the CLIP (ViT-B/32) backbone, a widely used baseline in previous works (e.g., CLIP4CIR [
37]), to verify its universality. We replaced the BLIP-2 visual/text encoders with CLIP’s encoders and applied the SOP modules (SFR, OSP, and GCCP) on top of the CLIP features. As shown in
Table 5, SOP provides significant improvements (4.5% on average) regardless of whether the backbone is CLIP or BLIP-2. This proves that SOP serves as a “plug-and-play” geometric regularization module that is effective independently of the underlying VLP architecture.
(2) Analysis of Freezing Strategies and Training Efficiency: To verify whether the performance gain is due to the backbone or our design, and to evaluate the trade-off between efficiency and effectiveness, we compared three training strategies on the FashionIQ dataset: (1) Full Fine-tuning: Unfreezing the entire ViT-G/14 visual encoder. (2) LoRA (low-rank adaptation): Applying LoRA to the visual encoder. (3) SOP (ours; frozen): Keeping the visual encoder frozen and training only the SOP modules (SFR, OSP, and GCCP). As shown in
Table 6, our frozen strategy (SOP) achieves a superior balance. Compared to Full Fine-tuning, SOP reduces trainable parameters by 98.5% (1.2B vs. 18M) and training time per epoch by about 80% (52 min vs. 10 min). Surprisingly, SOP (56.55%) performs comparably to LoRA fine-tune, and in some metrics, SOP even surpasses Full Fine-tuning (57.10%). This suggests that blindly unfreezing the massive backbone on smaller CIR datasets can lead to overfitting, whereas our SOP design acts as an effective geometric regularizer, extracting robust features without the heavy cost of full re-training.
(3) Robustness under “Truly Open-Domain” Retrieval (CIRR). The CIRR dataset serves as our primary “open-domain” benchmark due to its unconstrained, real-life images and complex, non-template instructions. To further prove robustness, we evaluated the Generalization Gap by testing our model on unseen concepts within CIRR (using the standard split which ensures disjoint reference images). We also compared the inference latency to ensure it is viable for open-domain retrieval libraries.
As shown in
Table 7, SOP achieves a significant boost on CIRR (+4.42% in R@5) compared to the strong BLIP-2 Baseline. Crucially, the improvement on the open domain (CIRR) is consistent with the domain-specific (FashionIQ) results, proving that SOP’s Orthogonal Subspace Projection (OSP) is a generic geometric solution that handles the high semantic variance of open domains effectively, rather than just memorizing specific attributes. For open-domain libraries (like FAISS), query encoding speed is vital. SOP adds negligible latency (+2 ms) compared to the backbone, making it highly scalable for large-scale retrieval.
In summary, the gains are derived principally from the SOP architecture’s ability to geometrically decouple features, enabling high efficiency (frozen encoder) and strong robustness across both specific (FashionIQ) and open (CIRR) domains.
4.5. Qualitative Experiments
Case Study. To provide an intuitive understanding of how SOP mitigates semantic erosion and preserves visual context, we present qualitative comparisons between the full SOP model and the variant without Orthogonal Subspace Projection (w/o OSP) in
Figure 4.
The top row of
Figure 4 illustrates a typical case from the FashionIQ dataset. The user instruction is “replace white with blue”, which requires the model to alter the color attribute while strictly preserving the fine-grained structural details. As observed, the variant w/o OSP fails to maintain structural fidelity. Although it successfully retrieves blue dresses, the Top-1 result is a wrap-style dress that completely loses the original ruffled design. This indicates that without orthogonal decoupling, the strong semantic signal of “blue” eroded the invariant structural features of the reference image. In contrast, SOP effectively disentangles the color modification from the structural context. Its Top-1 retrieval is the exact target image, which adopts the requested blue color while perfectly retaining the complex design elements of the reference dress. This verifies that OSP successfully locks the invariant visual context in a protected subspace.
The bottom row demonstrates a more challenging scenario from the CIRR dataset, where the instruction “Get close to small dog…” implies a significant change in viewpoint and scale. In this case, the “modification” is the camera movement, while the “invariant context” is the identity of the dog itself. The w/o OSP variant retrieves a dog with similar colors but fails to preserve the specific identity features (e.g., the specific ear shape and facial expression), ranking the ground truth at Top-5. This suggests that the features representing the viewpoint transformation became entangled with the object identity features, causing a drift in the retrieved subject. Conversely, SOP accurately retrieves the ground truth at Top-1. By projecting the “zoom-in” signal onto the orthogonal complement of the identity features, SOP ensures that the geometric transformation does not compromise the discriminative features of the specific dog instance.
Attention Visualization. To empirically validate the geometric interpretability of the OSP module, we visualize the spatial response maps of the decoupled features in
Figure 5.
Specifically, we employ Grad-CAM [
52] to compute the feature sensitivity for the two orthogonal components: the orthogonal modification increment (
) and the invariant feature (
).
As shown in row (a), the instruction “…red electric locomotive - now silver” requires changing the color attribute.
Orthogonal Modification Increment: The map (left) exhibits high activation strictly concentrated on the locomotive body, acting as a precise “editing vector” for the target attribute.
Invariant Feature: Crucially, the invariant map (right) highlights the background structures (e.g., railway tracks and wheels), verifying that actively anchors the retrieval on stable visual cues.
This disentanglement generalizes to additive queries as well. As observed in row (b) (“Add a second lemon slice”), accurately focuses on the semantic region of the food and existing lemon to guide the addition, whereas locks onto specific textural details and background to maintain object identity. The consistent spatial separation across different query types provides strong physical evidence that SOP effectively disentangles semantic modification from visual context.