SOP: Selective Orthogonal Projection for Composed Image Retrieval

Cheng, Su; Liu, Guoyang

doi:10.3390/s26051621

Open AccessArticle

SOP: Selective Orthogonal Projection for Composed Image Retrieval

by

Su Cheng

¹ and

Guoyang Liu

^1,2,*

¹

School of Integrated Circuits, Shandong University, Jinan 250101, China

²

Shenzhen Research Institute of Shandong University, Shenzhen 518000, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(5), 1621; https://doi.org/10.3390/s26051621

Submission received: 25 January 2026 / Revised: 2 March 2026 / Accepted: 3 March 2026 / Published: 4 March 2026

(This article belongs to the Section Intelligent Sensors)

Download

Browse Figures

Versions Notes

Abstract

The proliferation of intelligent sensor networks in urban surveillance and remote sensing has triggered the explosive growth of unstructured visual sensor data. Accurately retrieving targets from these massive streams based on complex cross-modal user intents remains a critical bottleneck for efficient intelligent perception. Composed Image Retrieval (CIR) addresses this by enabling retrieval via a multi-modal query that combines a reference image with semantic control signals. However, existing methods often struggle with abstract instructions in real-world scenarios. Consequently, models often suffer from feature distribution shifts due to focus ambiguity, as well as semantic erosion caused by highly entangled visual and textual features. To address these challenges, we propose a geometry-based Selective Orthogonal Projection Network (SOP). First, the Selective Focus Recovery module quantifies instruction uncertainty via information entropy and calibrates shifted query features to the true target distribution using structural consistency regularization. Second, to ensure data fidelity, we introduce Orthogonal Subspace Projectionand Geometric Composition Fidelity. These mechanisms employ Gram–Schmidt orthogonalization to decouple features into a constant visual base and an orthogonal modification increment, restricting semantic modifications to the null space. Extensive experiments on FashionIQ, Shoes, and CIRR datasets demonstrate that SOP significantly outperforms SOTA methods, offering a novel solution for efficient large-scale sensor data retrieval and analysis.

Keywords:

composed image retrieval; cross-modal information retrieval; semantic alignment

1. Introduction

With the rapid advancement of intelligent sensor networks and multimedia surveillance technologies [1,2,3,4], the volume of visual sensor data, such as surveillance video frames and remote sensing images, has grown exponentially. Consequently, accurately retrieving target data from complex streams based on intricate cross-modal intents has become a core challenge in multi-source data fusion and perception. To address this, Composed Image Retrieval (CIR) has emerged [5,6,7,8,9,10,11,12,13]. As illustrated in Figure 1a, CIR enables users to formulate query intents by combining a reference image with a semantic control signal, specifically modification text. Unlike traditional text-based retrieval, CIR requires models to not only comprehend multi-modal inputs but also perform precise visual semantic reasoning within the feature space. This technology holds significant value in applications such as intelligent video surveillance [14,15,16], remote sensing imagery [17,18], and human–computer interactions (HCIs) [19,20].

Recently, breakthroughs in large-scale vision–language pre-training models, such as CLIP [21] and BLIP [22], have significantly advanced CIR performance. Mainstream methods [23,24,25,26,27,28,29,30,31] typically adopt an “Encode–Fuse–Retrieve” paradigm, employing pre-trained encoders for feature extraction and fusing visual–textual information via attention mechanisms or simple Multi-Layer Perceptrons (MLP). While these approaches excel at handling explicit modifications, such as simple color changes, they often struggle with the highly abstract or complex instructions encountered in real-world scenarios. However, addressing these limitations remains non-trivial due to two primary challenges.

C1: Distribution Shift Induced by Focus Ambiguity. As illustrated in Figure 1b, in real-world CIR tasks, reference images often contain complex backgrounds or multiple entities, whereas user modification texts are typically concise and exhibit focus ambiguity, such as referring to specific objects using pronouns in multi-person scenes. When processing such uncertain instructions (e.g., the dilemma of “which one to focus” as shown in the figure), traditional feature fusion paradigms frequently fail to precisely localize modification regions, causing a distribution shift in the generated query features. Specifically, Figure 1b visualizes the divergence between the composed and target distributions. Without explicitly modeling instruction uncertainty, synthesized feature vectors often deviate from the target image’s feature distribution, disrupting intrinsic topological relationships within the feature space. Consequently, the generated query neither remains sufficiently close to the reference image nor aligns accurately with the target image. Thus, the primary challenge lies in quantifying instruction uncertainty amidst focus ambiguity and recalibrating generated query features to the true target feature distribution.

C2: Semantic Erosion Induced by Feature Entanglement. CIR mandates strict preservation of contextual consistency regarding irrelevant backgrounds and subject identities within the reference image while executing semantic modifications. However, as depicted in Figure 1c, visual and textual features often remain highly coupled within the deep feature space. The semantic direction of modification instructions (e.g., “Color”) frequently exhibits non-orthogonal overlap with the intrinsic attribute directions of the reference image (e.g., “Structure”). This feature entanglement causes unavoidable interference with the visual context that should be preserved during the injection of modification signals. This leads to a phenomenon we term semantic erosion, which refers to the accidental loss of invariant visual context (e.g., background structure) caused by the entanglement between modification text and visual features. Unlike existing soft attention or linear addition methods that cannot mathematically guarantee feature separation [32,33], addressing this erosion requires a strict geometric decoupling between modification text and visual features. As is shown in Figure 1c, existing methods relying on simple linear addition fail to effectively distinguish between these dimensions, leading to retrieval results that satisfy the color change but lose the original structural integrity (labeled as “no structure”). Consequently, the second challenge lies in achieving geometric decoupling between the modification increment and the invariant context within the feature space to ensure maximal visual context fidelity during semantic modification.

To address these challenges, we propose a novel Selective Orthogonal Projection Network (SOP). The SOP aims to reconstruct the feature fusion process of multi-modal sensor data from the perspectives of manifold geometry and subspace orthogonality. For Challenge C1, we design the Selective Focus Recovery (SFR) module. This module first utilizes Shannon entropy to explicitly quantify the cognitive uncertainty of instructions, using this as a basis to dynamically adjust optimization strategies. Subsequently, structural consistency regularization is introduced to enforce generated query features to mimic the relative similarity distribution of real target images within the batch. This strategy effectively calibrates features shifted due to ambiguity back into the target distribution, ensuring the local manifold topology of query features remains undistorted. Addressing Challenge C2, we propose the Orthogonal Subspace Projection (OSP) and Geometric Composition and Context Preservation (GCCP) modules. Utilizing the Gram–Schmidt orthogonalization principle, we dynamically decompose features into a constant visual base and an orthogonal modification increment. Through orthogonal projection consistency loss, we then enforce modification operations to occur exclusively within the null space of the constant background. This mathematically freezes irrelevant visual contexts, effectively resolving the problem of semantic erosion.

In summary, the main contributions of this paper are outlined as follows:

We propose the SOP framework. To the best of our knowledge, it is the first approach in CIR tasks to utilize strict orthogonal geometric projection to address feature entanglement, moving beyond traditional soft fusion paradigms.
We design the SFR module, which actively mitigates distribution shifts resulting from focus ambiguity. Unlike passive probabilistic modeling, it introduces entropy-aware uncertainty and structural consistency constraints to actively calibrate feature topology.
We introduce OSP and GCCP mechanisms to realize the strict orthogonal decoupling of modification increments and invariant contexts. Extensive experiments on the FashionIQ, Shoes, and CIRR benchmark datasets demonstrate that SOP significantly outperforms state-of-the-art methods.

2. Related Work

2.1. Composed Image Retrieval

Composed Image Retrieval (CIR) is an emerging multi-modal retrieval paradigm designed to address the limitations of traditional text-to-image retrieval, which can only describe general semantics. In CIR tasks, users provide not only a reference image as visual context but also a modification text expressing their intent to alter that image. The model’s objective is to understand the visual content within the reference image and transform its features according to textual instructions, thereby retrieving target images that meet the specified requirements from a candidate database.

Based on the evolution of technical approaches, existing CIR methods can be broadly categorized into two phases. (1) Early Deep Learning Approaches: Prior to the emergence of large-scale pre-trained models, researchers primarily focused on designing specific network architectures to fuse visual and linguistic features. The seminal work TIRG [34] introduced a residual gating mechanism, treating text modifications as a “displacement vector” injected into the feature space of reference images via additive operations. Subsequently, VAL [35] and CIRPLANT [36] introduced attention mechanisms to enhance inter-modal interactions. However, since these methods were primarily trained from scratch on small-scale datasets, they lacked general semantic knowledge and often struggled with abstract modification instructions in open domains. (2) Visual–Language Pre-training (VLP)-based Approaches: In recent years, CIR performance has achieved a qualitative leap thanks to the powerful alignment capabilities acquired by models like CLIP and BLIP on massive image–text datasets. CLIP4CIR [37] stands as a landmark work in this field, designing a lightweight combiner network to fuse dual-tower features extracted by CLIP. BLIP4CIR [38] leverages BLIP’s generative capabilities, employing instruction tuning to prompt the model to generate representations more closely aligned with the target image.

Although VLP-based methods improve retrieval performance, most employ a “holistic fusion” strategy (such as concatenation or weighted summation). This approach overlooks the “entanglement” issue between multi-modal signals: strong modification signals often overwhelm background information that should be preserved in reference images, leading to “semantic erosion”. That is, retrieved images may match textual descriptions but lose critical identity features of the reference object. Recent advancements in diffusion models have also explored fine-grained conditional visual synthesis. For instance, ImagPose [39] proposes a unified framework for pose-guided person generation, effectively balancing reference identity preservation with pose modification. While generative approaches focus on pixel-level synthesis, Composed Image Retrieval (CIR) aims to perform these modifications within the feature space [8,40,41]. Drawing inspiration from this “compositional” philosophy, our proposed SOP introduces Orthogonal Subspace Projection. This technique forces geometric decoupling between “modification increments” and “invariant context” within the feature space. This ensures high semantic fidelity and prevents the distribution shift common in holistic fusion methods. Furthermore, our work draws inspiration from robust 3D object retrieval methods that can accurately match deformable shapes even when queries contain missing parts [42]. Guided by this philosophy of preserving topological invariants, our SOP framework utilizes geometric decoupling to strictly protect the invariant visual context of the reference image against semantic erosion.

2.2. Uncertainty Modeling in Retrieval

Standard retrieval models typically assume that users’ intentions are clear and unambiguous. However, in practical applications, CIR faces significant challenges due to “focus ambiguity”. For instance, users may employ vague pronouns or general descriptions, making it difficult for the model to pinpoint the specific area being modified. To address this inherent noise in data, probabilistic modeling has been introduced into retrieval tasks. The classic PCME [43] approach no longer represents images and text as deterministic point vectors but models them as probability distributions (e.g., Gaussian distributions). By leveraging the variance of these distributions to cover potential semantic ranges, it attempts to capture the polysemy inherent in ambiguous queries.

However, this general approach based on probability distributions often adopts a “passive adaptation” strategy, primarily “tolerating” ambiguity by increasing the variance of probabilities. In CIR tasks, this approach fails to actively correct the semantic orientation of features. Instead, it often leads to features generated by fuzzy queries landing in low-density regions of the feature space, causing “distribution shift”, where features deviate from the distribution region of the true target image. Unlike probabilistic methods like PCME [43], which adopt a "passive" strategy by tolerating ambiguity through increased variance, our SOP employs an "active" calibration strategy. We explicitly quantify uncertainty using entropy and realign shifted features to the target distribution. Furthermore, compared to traditional disentanglement methods that rely on linear subtraction or simple attention, our OSP module enforces strict geometric orthogonality. This ensures that modification signals operate exclusively within the null space of the invariant context, providing a mathematically guaranteed separation of attributes. Moreover, recent works have also explored uncertainty in CIR. For instance, Tang et al. [8] introduced probabilistic embeddings to model ambiguous queries as Gaussian distributions, using variance to tolerate semantic uncertainty. However, this probabilistic approach primarily adopts a "passive adaptation" strategy, which tends to blur feature representations. In contrast, our proposed SOP adopts an "active calibration" strategy via geometric decoupling. Instead of merely tolerating uncertainty, we explicitly project the modification signal onto a subspace orthogonal to the invariant context. This ensures that the modification does not encroach upon the visual background, providing higher fidelity than probabilistic blurring.

3. Methodology

As the core innovation of this paper, we propose the Selective Orthogonal Projection Network (SOP), which is designed to address the issues of focus ambiguity and semantic erosion in Composed Image Retrieval (CIR) from the perspectives of feature disentanglement and structural alignment. As illustrated in Figure 2, SOP consists of three tightly coupled modules: (a) Selective Focus Restoration (SFR), which explicitly quantifies instruction uncertainty and calibrates features shifted due to ambiguity to the target feature distribution via structural consistency regularization; (b) Orthogonal Subspace Projection (OSP), which decouples multi-modal features into geometrically orthogonal invariant visual context and orthogonal modification increment through orthogonal decomposition, effectively mitigating the interference of modification signals on the background; and (c) Geometric Composition and Context Preservation (GCCP), which strictly constrains the fidelity of the visual context via uncertainty-aware feature composition and orthogonal projection consistency, thereby preventing semantic drift during feature composition while realizing semantic modification. In this section, we first formulate the CIR task and then elaborate on the components of SOP.

3.1. Problem Formulation

In sensor network applications such as intelligent multimedia surveillance and remote sensing, Composed Image Retrieval (CIR) aims to retrieve specific target sensor data based on multi-modal inputs. Let

T = {{(x_{r}, x_{m}, x_{t})}_{n}}_{n = 1}^{N}

denote a set containing N triplets. Here,

x_{r}

represents the reference image captured by visual sensors,

x_{m}

represents the modification text used for modifying visual content (serving as a semantic control signal), and

x_{t}

denotes the target image conforming to user intent.

From the perspective of multi-modal sensor data fusion, our core objective is to construct a robust metric space. In this space, the embedding representation derived from the fusion of the reference observation

x_{r}

and the semantic control signal

x_{m}

should closely approximate the ground-truth target sensor data

x_{t}

. This process is formally expressed as

Φ (x_{r}, x_{m}) \to Φ (x_{t})

(1)

where

Φ

denotes the embedding function to be optimized, utilized for feature mapping and aligning heterogeneous sensor data.

3.2. Selective Focus Restoration (SFR)

In CIR tasks, user modification instructions are often highly concise and suffer from focus ambiguity. When models process such vague instructions, traditional feature fusion is prone to producing an averaging effect, causing the generated latent features to undergo a distribution shift. This results in features that deviate from the reference image and fail to accurately align with the target image. To address this issue, the SFR module first utilizes the robust vision–language alignment capabilities of BLIP-2 to extract multi-modal features. Subsequently, it explicitly quantifies focus uncertainty and pulls the generated features closer to the target feature distribution via Structural Consistency Regularization, thereby achieving accurate focus calibration.

Multi-modal Feature Extraction. To bridge the gap between vision and language, we adopt the pre-trained BLIP-2 [44] model as the backbone for feature extraction. Specifically, given a reference image

x_{r}

, we first utilize the frozen visual encoder

Φ_{v i s u a l}

to extract image embedding features and input them into the Q-Former to obtain corresponding reference visual tokens

F_{r} \in R^{L \times D}

, where L is the number of query tokens and D is the feature dimension. Simultaneously, for the modification text

x_{m}

, we input it into the text encoder

Φ_{t e x t}

and extract its corresponding global text embedding (taking the [CLS] token)

f_{m} \in R^{D}

via the Q-Former. This process is formulated as follows:

F_{r} = Q-Former (Φ_{v i s u a l} (x_{r})), f_{m} = Q-Former (Φ_{t e x t} (x_{m})) .

(2)

Spatial Anchoring and Ambiguity Quantification. To determine the scope of the modification instruction within the visual space, we establish a fine-grained correlation between the text vector

f_{m}

and the visual tokens

F_{r}

. We map features to a common subspace via learnable projection matrices

W_{q}, W_{k} \in R^{D \times D}

and compute the attention distribution of the text over visual regions

a = [a_{1}, \dots, a_{i}, \dots, a_{L}] \in R^{L}

, formulated as

a = softmax (\frac{(f_{m} W_{q}) {(F_{r} W_{k})}^{⊤}}{\sqrt{D}}),

(3)

where

a_{i} \in [0, 1]

is a scalar reflecting the relevance of the i-th visual region to the modification instruction. However, simple attention weights cannot directly reflect instruction ambiguity (i.e., whether the instruction points to multiple similar targets simultaneously). Therefore, we introduce Shannon Entropy

H^{(i)}

to quantify the focus uncertainty for the i-th sample:

H^{(i)} = - \sum_{j = 1}^{L} a_{j}^{(i)} log (a_{j}^{(i)}),

(4)

where L is the total number of image tokens. The numerical properties of

H^{(i)}

keenly capture changes in the distribution shape: when the instruction is clear, attention is concentrated, resulting in a lower entropy value; conversely, when the instruction is ambiguous, leading to dispersed attention, the entropy value increases significantly. This metric is subsequently used in the exponential form

e^{H^{(i)}}

as an adaptive penalty weight to regulate the intensity of regularization.

Latent Draft Construction. This step aims to construct an initial query estimation that integrates visual focus and text intent. Serving as a bridge between the original and target feature spaces, this latent draft explicitly aggregates focal features and injects text semantics. This process provides a semantically rich initialization, although its distribution requires subsequent structural correction. Specifically, we perform a two-stage feature synthesis based on the spatial anchoring distribution

a

. First, we perform weighted aggregation on the reference visual tokens to extract the focal visual feature

f_{f o c u s} \in R^{D}

most relevant to the modification instruction:

f_{f o c u s} = a \cdot F_{r} .

(5)

Subsequently, to inject the modification intent into the visual representation, we concatenate the focal feature

f_{f o c u s}

with the global text embedding

f_{m}

along the channel dimension, yielding a joint feature vector

[f_{f o c u s} ‖ f_{m}] \in R^{2 D}

. Since direct concatenation leads to feature distribution heterogeneity, we further introduce a Multi-layer Perceptron (MLP) consisting of two fully connected layers and GELU activation functions to map it back to the original latent space

R^{D}

. The latent query

z_{q} \in R^{D}

generated by this process is formulated as

z_{q} = MLP ([f_{f o c u s} ‖ f_{m}]) .

(6)

Here,

z_{q}

can be regarded as an initial estimation of the target image. Although it integrates visual focus and text semantics, it relies only on linear weighting and simple projection, meaning its distribution often does not fully align with the real target image, thus necessitating subsequent structural regularization for calibration.

Structure Consistency Regularization. To correct the distribution shift of

z_{q}

in an unsupervised manner, we propose structural consistency regularization. The core idea is that the relative similarity relationships of the generated query feature with other samples within a batch should remain consistent with those of the real target image. Specifically, for data with a batch size of B, we compute the affinity distribution

p_{i}^{q} \in R^{B}

(Student Distribution) between the query feature

z_{q}^{(i)}

generated for the i-th sample and other samples in the batch; simultaneously, we compute the affinity distribution

p_{i}^{t} \in R^{B}

(Teacher Distribution) for the corresponding ground-truth target image

z_{t}^{(i)}

:

p_{i}^{q} [j] = \frac{exp (sim (z_{q}^{(i)}, z_{q}^{(j)}) / τ)}{\sum_{k = 1}^{B} exp (sim (z_{q}^{(i)}, z_{q}^{(k)}) / τ)},

(7)

where

sim (u, v) = u^{⊤} v / (∥ u ∥ ∥ v ∥)

is the cosine similarity. The calculation for

p_{i}^{t}

follows the same logic. Subsequently, we minimize the difference between these two distributions using Kullback–Leibler (KL) divergence and integrate the ambiguity entropy

H

as an adaptive weight:

L_{s c r} = \frac{1}{B} \sum_{i = 1}^{B} e^{H^{(i)}} \cdot KL (p_{i}^{t} ∥ p_{i}^{q}),

(8)

where

H^{(i)}

is calculated by Equation (4).

e^{H^{(i)}}

acts as an adaptive penalty: when instruction ambiguity is high, the model lacks confidence in the structure of the generated features, thus imposing stronger structural constraints to force strict adherence to the geometric distribution patterns of real images.

3.3. Orthogonal Subspace Projection (OSP)

Following query feature calibration to the target distribution via the SFR module, we obtain a latent representation exhibiting distributional robustness, yet its internal semantic components may remain coupled. Standard feature fusion methods, such as concatenation or addition, suffer from non-orthogonal semantic overlap where strong modification signals often bleed into and erode the preserved context. To prevent this “semantic erosion”, mere soft attention is insufficient. We necessitate a hard geometric constraint, specifically orthogonal projection, to create a “safe” subspace. By mathematically forcing the modification vector to be orthogonal to the invariant context, we guarantee that the injection of new semantics effectively has zero projection onto the background features, thereby preserving the original identity by design. Accordingly, the OSP module performs orthogonal decomposition within the feature space to ensure that the modification increment and constant background remain mutually non-interfering.

Invariant Context Anchoring. To achieve decoupling, we first identify the constant base within the reference image. Recalling the SFR module, we obtained the spatial attention distribution

a

. Based on the complementarity principle,

(1 - a)

corresponds to non-modified regions, such as the background or preserved subject. Accordingly, we extract the invariant feature

f_{i n v}

from the reference image, formulated as follows:

f_{i n v} = \sum_{i = 1}^{L} (1 - a_{i}) F_{r, i},

(9)

where

f_{i n v} \in R^{D}

captures contextual details within the visual scene ignored by the modification instruction. It defines a subspace direction within the feature space that must be strictly protected from encroachment by the modification instruction.

Conflict Quantification and Orthogonal Decoupling. Theoretically, the latent sketch

z_{q}

generated by SFR represents the ideal target state; however, practically, it often contains redundant information linearly correlated with the background. To quantify this feature conflict, we calculate the cosine similarity C between

z_{q}

and

f_{i n v}

as

C = \frac{f_{i n v}^{⊤} z_{q}}{∥ f_{i n v} ∥ ∥ z_{q} ∥}

. A high C value implies that the recovered feature perturbs the constant background while attempting to modify the target. To extract a pure modification increment, we employ the Gram–Schmidt orthogonalization principle to project

z_{q}

onto the orthogonal complement of

f_{i n v}

. By eliminating the projection component of

z_{q}

along the background direction, we derive the orthogonal modification vector

v_{m o d}

as follows:

v_{m o d} = z_{q} - (\frac{z_{q}^{⊤} f_{i n v}}{f_{i n v}^{⊤} f_{i n v}}) f_{i n v} .

(10)

Mathematically, this operation guarantees

v_{m o d} ⊥ f_{i n v}

. Through such enforced decoupling,

v_{m o d}

encapsulates only pure semantic variations relative to the background, thereby eliminating the potential interference of the modification signal on the background context.

3.4. Geometric Composition and Context Preservation (GCCP)

The core objective of this module is to reassemble the decoupled invariant visual context

f_{i n v}

and the semantic modification increment

v_{m o d}

into the final query feature

z_{f i n a l}

. To ensure that the modification does not disrupt visual structures within the reference image that are unaffected by the instruction, we formulate the visual preservation goal as a contextual fidelity constraint. This mandates that the projection of the synthesized feature onto the invariant subspace remains anchored to the original visual context, thereby preventing contextual drift induced by semantic modification.

Uncertainty-Aware Geometric Composition. Simple linear superposition overlooks the impact of instruction ambiguity on modification reliability. To robustly embed modification semantics within the geometric space, we construct an adaptive synthesis mapping utilizing the uncertainty entropy

H

derived from the SFR module (detailed in Equation (4)). Modeling this synthesis process as a controlled translation upon the constant contextual base, and leveraging the robust cross-modal fusion capabilities of the Q-Former as a foundation for contextual interaction, we extract the final composed feature

z_{f i n a l} \in R^{D}

, formulated as follows:

z_{f i n a l} = Q-Former (f_{i n v} + ϕ (H) \cdot v_{m o d}, F_{r}, F_{m}),

(11)

where

ϕ (\cdot)

denotes an MLP-based monotonically non-increasing gating function. When instruction ambiguity

H

is elevated, implying unreliable modification instructions,

ϕ (H) \to 0

, compelling the query vector to revert to the visual context anchor to mitigate high-risk semantic modifications. Conversely, lower ambiguity permits the modification increment to drive substantial feature updates.

Contextual Fidelity Constraint via Orthogonal Projection. To ensure that the synthesized feature

z_{f i n a l}

introduces no noise detrimental to background semantics, we devise a contextual fidelity constraint based on orthogonal projection. Per the definition of the OSP module, ideal modifications should occur exclusively within the orthogonal complement of

f_{i n v}

. This implies that the projection of

z_{f i n a l}

onto the invariant subspace must align with

f_{i n v}

. Consequently, we propose the orthogonal projection consistency loss, formulated as follows:

L_{c t x} = ∥ {Proj}_{f_{i n v}} (z_{f i n a l}) - f_{i n v} ∥_{2}^{2} = {∥\frac{z_{f i n a l}^{⊤} f_{i n v}}{∥ f_{i n v} ∥^{2}} f_{i n v} - f_{i n v}∥}_{2}^{2},

(12)

where

{Proj}_{u} (v)

denotes the orthogonal projection of vector

v

onto the linear subspace spanned by vector

u

, and

{‖ \cdot ‖}_{2}

represents the

L_{2}

norm. This constraint enforces that the modification increment

v_{m o d}

operates solely within the null space of

f_{i n v}

, thereby mathematically guaranteeing absolute fidelity of the visual context. Consequently, the model strictly freezes irrelevant background and environmental information while altering target attributes.

3.5. Optimization Objective

To align the multi-modal query and target image within a unified metric space, we employ the standard Batch-based Classification Loss [34] as the primary optimization objective. For B triplets in a batch,

L_{r e t r i e v a l}

is defined as

L_{r e t r i e v a l} = - \frac{1}{B} \sum_{i = 1}^{B} log \frac{exp (sim (z_{f i n a l}^{(i)}, z_{t}^{(i)}) / τ)}{\sum_{j = 1}^{B} exp (sim (z_{f i n a l}^{(i)}, z_{t}^{(j)}) / τ)},

(13)

where

sim (\cdot)

denotes the cosine similarity,

τ

represents the temperature parameter, and

z_{t}

is the target image feature, computed following Equation (2).

Finally, by integrating all aforementioned modules, we formulate the final optimization objective of the SOP network as follows:

Θ^{*} = \underset{Θ}{arg min} (L_{r e t r i e v a l} + λ_{1} L_{s c r} + λ_{2} L_{c t x}),

(14)

where

Θ

represents all learnable parameters in the SOP network, while

λ_{1}

and

λ_{2}

act as trade-off hyper-parameters to balance the different constraint terms. Through the synergy of geometric constraints (

L_{s c r}

and

L_{c t x}

) and discriminative learning (

L_{r e t r i e v a l}

), this objective function achieves the robust decoupling and recomposition of complex visual semantics.

4. Experiment

In this section, we conduct extensive experiments to evaluate the effectiveness of our proposed SOP framework. We first outline the experimental setup, followed by a quantitative performance comparison with state-of-the-art methods. Subsequently, we perform detailed ablation studies to validate the contribution of each module, specifically focusing on Selective Focus Restoration (SFR), Orthogonal Subspace Projection (OSP), and Geometric Composition and Context Preservation (GCCP). Finally, we provide qualitative analyses to visualize how SOP addresses focus ambiguity and semantic erosion.

4.1. Experimental Settings

Datasets. Following previous works [10,37,45], we evaluate SOP on three widely used benchmarks: FashionIQ, Shoes, and CIRR. Notably, the fashion domain serves as a critical analytical benchmark for fine-grained vision–language tasks due to its structured attribute definitions [46]. While FashionIQ and Shoes provide rigorous testbeds for fine-grained attribute decoupling, they are domain specific and limited in terms of semantic diversity compared to open-world concepts. To mitigate this bias, we include CIRR to verify the model’s generalization capabilities. However, we acknowledge that the current evaluation primarily focuses on object-centric retrieval, and performance on abstract or scene-level retrieval (e.g., complex event reasoning) remains an area for future exploration.

FashionIQ [47]: This represents the baseline for explicit attribute manipulation (e.g., specific changes in color or material). To evaluate retrieval performance within the fashion domain, we employ this large-scale dataset containing 77,684 images. Following standard evaluation protocols, the dataset is segregated into three independent subsets: Dresses, Shirts, and Tops&Tees. The data partition includes approximately 18,000 triplets allocated for training, with an additional 6000 triplets reserved for validation purposes. The fashion domain serves as a crucial diagnostic benchmark for disentanglement. Unlike general open-domain scenes, fashion images contain highly structured attributes. This allows us to rigorously diagnose whether the model successfully decouples the "modification" from the "preservation", rather than just learning superficial correlations.
Shoes [48]: This represents an intermediate level, testing relative changes (e.g., “shinier” and “higher”) and requiring finer-grained comparative reasoning. This benchmark is specifically curated for attribute-aware retrieval in the footwear domain, sourced from e-commerce platforms. It is characterized by rich, attribute-specific textual feedback that describes fine-grained visual modifications (e.g., changes in color or structural details). In terms of data distribution, it offers 10,000 training samples and 4658 testing samples, facilitating robust model optimization and assessment on dialogue-based queries.
CIRR [36]: This represents the most challenging level, targeting ambiguous real-world scenarios with complex, unconstrained semantic transitions. Distinct from domain-specific datasets, CIRR serves as an open-domain benchmark designed to test the model’s robustness in unconstrained, real-life scenarios. It emphasizes zero-shot generalization and the understanding of complex semantic transitions within scenes. The corpus consists of 36,000 relative caption pairs, which are split into a training set of 28,000 pairs and a validation set of 8000 pairs, challenging the model to interpret high-level visual transformations rather than simple attribute manipulation.

Dataset Characteristics: Specifically, we clarify their specific evaluation roles of these datasets as follows:

FashionIQ: “This benchmark primarily tests the model’s ability to execute strict attribute replacement (e.g., changing "white" to "blue") while maintaining the structural integrity of the reference object”.
Shoes: “This dataset evaluates the model’s sensitivity to relative feedback, requiring it to understand comparative adjectives (e.g., ‘more athletic’) rather than just categorical tags”.
CIRR: “This benchmark serves as a stress test for robustness against ambiguity, assessing whether the model can handle vague instructions and diverse visual distributions in unconstrained open domains”.

Implementation Details. We summarize all key implementation details and hyper-parameters in Table 1. To ensure transparency, the core architectural implementation is available in our repository (https://github.com/Cheng-SDU/SOP/, accessed on 3 March 2026).

Evaluation. To ensure a fair comparison with existing state-of-the-art methods, we strictly adhere to the standard evaluation metrics widely adopted within the CIR community. We primarily quantify retrieval performance using Recall@k (R@k), which measures the proportion of the ground-truth appearing among the top k retrieval candidates.

Fashion-Domain Evaluation: For the FashionIQ and Shoes datasets, we focus on the model’s performance in the top-ranked results. On the Shoes dataset, we report R@1, R@10, and R@50 along with their averages. For FashionIQ, we compute R@10 and R@50 separately for three subsets and report the average score across all categories to comprehensively reflect the model’s fine-grained attribute awareness.
Open-Domain Evaluation: For the CIRR dataset, the evaluation criteria are more stringent. Beyond standard R@k (k in { $1, 5, 10, 50$ }) metrics, we introduce subset recall $R_{s u b s e t}$ @k (k in { $1, 2, 3$ }). This metric is computed only on curated subsets containing high-difficulty negative samples, providing a more precise reflection of the model’s ability to exclude irrelevant items. The final performance is determined by the average of R@5 and $R_{s u b s e t}$ @1.

4.2. Performance Comparison

We compare SOP against two categories of CIR methods: (1) conventional methods (e.g., TIRG [34], ARTEMIS [49], MGUR [45], and IUDC [11]) and (2) VLP-based methods (e.g., CLIP4CIR [37], SSN [33], BLIP4CIR [38], CoVR [50], CoVR-2 [51], and ENCODER [10]).

The quantitative results on FashionIQ, Shoes, and CIRR are summarized in Table 2 and Table 3. From the comparative results, we can draw the following observations on three benchmarks:

Results on FashionIQ. As shown in Table 2, SOP demonstrates outstanding performance on the FashionIQ dataset, significantly outperforming current state-of-the-art methods. Specifically, our approach achieves an average R@10 of 56.55% and R@50 of 78.98%. This improvement is particularly pronounced in the Dress and Shirt categories. We attribute this success to the synergistic interaction between the OSP and GCCP modules. OSP mathematically decouples modification signals from invariant visual context, preventing loss of original structural details (i.e., mitigating semantic erosion). Meanwhile, the GCCP module enforces orthographic projection consistency constraints, ensuring synthesized features strictly preserve garment identity attributes (e.g., sleeve length and collar structure) while maintaining high contextual fidelity even when color or texture changes occur.
Results on Shoes. As shown in Table 2, performance comparisons on the Shoes dataset further validate SOP’s robustness in domain-specific retrieval scenarios involving relative attribute feedback. SOP achieves 26.17% R@1 and 64.38% R@10, outperforming baseline models by an average of 8.10%. Queries in this dataset frequently contain relative descriptions such as “darker”, “higher heels”, or “more athletic”, requiring the model to strictly distinguish between areas requiring modification and those needing preservation. The test results validate the effectiveness of our proposed constant context anchoring mechanism. By explicitly defining constant feature subspaces (e.g., shoe silhouette contours) and enforcing modifications only within the orthogonal complement space, SOP accurately captures subtle relative changes while minimizing background noise interference during retrieval.
Results on CIRR. Table 3 presents results on the challenging open-domain CIRR dataset, highlighting the core strengths of our framework. SOP achieves an outstanding performance on the test set, with R@5 reaching 80.81% and $R_{s u b s e t}$ @1 at 77.04%. Notably, the CIRR dataset exhibits high visual variability and linguistic ambiguity (e.g., vague instructions like “change it to this” or “make it look like…”), characteristics that typically confound traditional VLP-based models. The experimental results validate the effectiveness of the SFR module. By integrating entropy-aware uncertainty modeling with structural distribution alignment techniques, SOP successfully leverages global priors to resolve focus ambiguity, enabling accurate target retrieval even when local visual cues are insufficient or misleading.

Table 2. Performance comparison with state-of-the-art methods on the FashionIQ and Shoes datasets. We report Recall@K (K = 10 and 50) for three categories (Dresses, Shirts, and Tops&Tees) in FashionIQ and Recall@K (K = 1, 10, and 50) for Shoes. The best and second-best results are highlighted in bold and underlined, respectively. “-” indicates that the result is not available.

Method	FashionIQ								Shoes
	Dresses		Shirts		Tops&Tees		Avg		Shoes
	R@10	R@50	R@10	R@50	R@10	R@50	R@10	R@50	R@1	R@10	R@50	Avg
TIRG [34] (CVPR’19)	14.87	34.66	18.26	37.89	19.08	39.62	17.40	37.39	12.60	45.45	69.39	42.48
LF-CLIP [32] (CVPR’22)	31.63	56.67	36.36	58.00	38.19	62.42	35.39	59.03	-	-	-	-
ARTEMIS [49] (ICLR’22)	27.16	52.40	21.78	43.64	29.20	54.83	26.05	50.29	18.72	53.11	79.31	50.38
CLIP4CIR [37] (TOMM’23)	33.81	59.40	39.99	60.45	41.41	65.37	38.40	61.74	-	-	-	-
MGUR [45] (ICLR’24)	32.61	61.34	33.23	62.55	41.40	72.51	35.75	65.47	18.41	53.63	79.84	50.63
SSN [33] (AAAI’24)	34.36	60.78	38.13	61.83	44.26	69.05	38.92	63.89	-	-	-	-
BLIP4CIR [38] (WACV’24)	40.65	66.34	40.38	64.13	46.86	69.91	42.63	66.79	-	-	-	-
CoVR [50] (AAAI’24)	44.55	69.03	48.43	67.42	52.60	74.31	48.53	70.25	-	-	-	-
CoVR-2 [51] (TPAMI’24)	46.53	69.60	51.23	70.64	52.14	73.27	49.97	71.17	-	-	-	-
IUDC [11] (TOIS’25)	35.22	61.90	41.86	63.52	42.19	69.23	39.76	64.88	-	-	-	-
ENCODER [10] (AAAI’25)	51.51	76.95	54.86	74.93	62.01	80.88	56.13	77.59	-	-	-	-
SOP (Ours)	52.17	79.34	55.63	76.29	61.84	81.32	56.55	78.98	26.17	64.38	85.65	58.73

Table 3. Quantitative results on the CIRR test set. The evaluation metrics include standard Recall@K (K = 1, 5, 10, and 50) and subset-based

{Recall}_{s u b s e t}

@K (K = 1, 2, and 3). The last column shows the average performance of R@5 and

R_{s u b s e t}

@1. The best performance is marked in bold, and the second best is underlined.

Table 3. Quantitative results on the CIRR test set. The evaluation metrics include standard Recall@K (K = 1, 5, 10, and 50) and subset-based

{Recall}_{s u b s e t}

@K (K = 1, 2, and 3). The last column shows the average performance of R@5 and

R_{s u b s e t}

@1. The best performance is marked in bold, and the second best is underlined.

Method	R@k				R_subset@k			Avg
Method	k = 1	k = 5	k = 10	k = 50	k = 1	k = 2	k = 3	Avg
TIRG [34] (CVPR’19)	14.61	48.37	64.08	90.03	22.67	44.97	65.14	35.52
ARTEMIS [49] (ICLR’22)	16.96	46.10	61.31	87.73	39.99	62.20	75.67	43.05
LF-CLIP [32] (CVPR’22)	33.59	65.35	77.35	95.21	62.39	81.81	92.02	63.87
BLIP4CIR [38] (WACV’24)	40.17	71.81	83.18	95.69	72.34	88.70	95.23	72.08
SSN [33] (AAAI’24)	43.91	77.25	86.48	97.45	71.76	88.63	95.54	74.51
CoVR [50] (AAAI’24)	49.69	78.60	86.77	94.31	75.01	88.12	93.16	76.81
CoVR-2 [51] (TPAMI’24)	50.43	81.08	88.89	98.05	76.75	90.34	95.78	78.92
ENCODER [10] (AAAI’25)	46.10	77.98	87.16	97.64	76.92	90.41	95.95	77.45
SOP (Ours)	50.72	80.81	88.59	98.07	77.04	90.63	96.18	78.93

4.3. Ablation Study

To systematically analyze the contribution of each component, we conduct ablation studies on the CIRR dataset. We classify the variants into three groups: G1 for SFR, G2 for OSP, and G3 for GCCP.

G1: Ablation on Selective Focus Restoration (SFR). We first investigate the efficacy of the SFR module in handling focus ambiguity and distribution alignment. The variants are defined as follows:

D#(1) w/o SFR: We remove the entire SFR module and directly add the reference image features with the modification text features to form the composed features.
D#(2) w/o Uncertainty Weighting ( $H$ ): The SFR structure is retained; however, the entropy-driven adaptive weight in Equation (4) is replaced with a fixed scalar (set to 0.5) to fuse global priors.
D#(3) w/o Structural Alignment ( $L_{s c r}$ ): The uncertainty estimation is preserved, but the structural consistency regularization (KL divergence) is removed, relying solely on retrieval loss for training.

G2: Ablation on Orthogonal Subspace Projection (OSP). This group assesses the impact of signal decoupling in terms of preventing semantic erosion.

D#(4) w/o OSP: The orthogonal projection step is discarded, and the non-decoupled query features $z_{q}$ are directly input into the subsequent composition module.
D#(5) w/o Orthogonality: Simple vector subtraction is employed instead of strict Gram–Schmidt orthogonalization to extract the modification vector.

G3: Ablation on Geometric Composition (GCCP). Finally, we evaluate the contribution of the GCCP module in preserving contextual fidelity.

D#(6) w/o Adaptive Gating ( $ϕ (H)$ ): The uncertainty-controlled adaptive gating in the composition (Equation (11) is replaced with a fixed connection coefficient.
D#(7) w/o Projection Consistency ( $L_{c t x}$ ): The orthogonal projection consistency loss is removed, implying that the projection of synthesized features onto the invariant subspace is no longer constrained to revert to the original context.

Analysis of G1: As shown in Table 4, D#(1) exhibits the most significant performance degradation, underscoring the inefficacy of simple feature addition under focus ambiguity. When modification texts are vague (e.g., containing only pronouns), directly concatenated features often fall into low-density regions of the feature space, lacking explicit semantic direction. By introducing global distribution priors, SFR effectively provides a semantic anchor for ambiguous queries, realigning them with a reasonable target distribution. Comparing D#(2) with the full model reveals that fixed weights cannot adapt to the varying degrees of ambiguity in the CIRR dataset. For explicit instructions, excessive prior intervention introduces noise; for ambiguous ones, fixed weights are insufficient to correct deviations. The adaptive weight

H

successfully functions as a confidence gate, activating prior guidance only when the model confronts instruction uncertainty, which is critical for the robustness of SOP. Solely point-to-point retrieval lossleads to local optima, whereas

L_{s c r}

serves as a regularization term forcing the generated feature distribution to maintain consistency with the topological structure of the real image distribution, thereby enhancing the retrieval.

Analysis of G2: The experimental results indicate a substantial decline for D#(4) in fine-grained metrics (particularly FashionIQ R@10). This confirms that feature entanglement is the primary cause of semantic erosion. In a non-decoupled state, modification signals are highly coupled with background signals, where strengthening the former often suppresses the latter. Notably, the unsatisfactory performance of D#(5) suggests that simple algebraic subtraction cannot effectively strip semantic correlations. Feature vectors in multi-dimensional spaces often exhibit complex non-linear collinearity. The Gram–Schmidt orthogonalization employed by SOP constructs a complement space strictly orthogonal to the reference image features

f_{i n v}

from a geometric perspective. This strict orthogonality ensures that

v_{m o d}

serves as a pure orthogonal modification increment, effectively mitigating interference with the invariant context.

Analysis of G3: The performance decline of D#(6) indicates that the feature composition process requires dynamic adjustment based on instruction reliability. For substantial modifications (high uncertainty), the model relies more on the generated modification vector

v_{m o d}

; for fine-tuning (low uncertainty), it should preserve more of the original features

f_{i n v}

. The adaptive gating

γ

precisely controls this balance, preventing over-modification or under-modification. Regarding D#(7), a significant impact is observed in relation to high-precision metrics such as R@1. The orthogonal projection consistency loss essentially introduces a constraint requiring the projection of synthesized features onto the invariant subspace to strictly equal the original visual context. This geometric constraint establishes a final defense against semantic drift, ensuring that invariant features to be preserved are not accidentally lost while altering target attributes.

Additionally, to clarify that the performance gains stem from our SOP design rather than solely from the BLIP-2 backbone or freezing strategies, we performed two sets of additional ablation studies, as follows:

(1) Robustness across Different VLP Backbones: As shown in Table 5, we migrated the SOP framework to the CLIP (ViT-B/32) backbone, a widely used baseline in previous works (e.g., CLIP4CIR [37]), to verify its universality. We replaced the BLIP-2 visual/text encoders with CLIP’s encoders and applied the SOP modules (SFR, OSP, and GCCP) on top of the CLIP features. As shown in Table 5, SOP provides significant improvements (4.5% on average) regardless of whether the backbone is CLIP or BLIP-2. This proves that SOP serves as a “plug-and-play” geometric regularization module that is effective independently of the underlying VLP architecture.

(2) Analysis of Freezing Strategies and Training Efficiency: To verify whether the performance gain is due to the backbone or our design, and to evaluate the trade-off between efficiency and effectiveness, we compared three training strategies on the FashionIQ dataset: (1) Full Fine-tuning: Unfreezing the entire ViT-G/14 visual encoder. (2) LoRA (low-rank adaptation): Applying LoRA to the visual encoder. (3) SOP (ours; frozen): Keeping the visual encoder frozen and training only the SOP modules (SFR, OSP, and GCCP). As shown in Table 6, our frozen strategy (SOP) achieves a superior balance. Compared to Full Fine-tuning, SOP reduces trainable parameters by 98.5% (1.2B vs. 18M) and training time per epoch by about 80% (52 min vs. 10 min). Surprisingly, SOP (56.55%) performs comparably to LoRA fine-tune, and in some metrics, SOP even surpasses Full Fine-tuning (57.10%). This suggests that blindly unfreezing the massive backbone on smaller CIR datasets can lead to overfitting, whereas our SOP design acts as an effective geometric regularizer, extracting robust features without the heavy cost of full re-training.

(3) Robustness under “Truly Open-Domain” Retrieval (CIRR). The CIRR dataset serves as our primary “open-domain” benchmark due to its unconstrained, real-life images and complex, non-template instructions. To further prove robustness, we evaluated the Generalization Gap by testing our model on unseen concepts within CIRR (using the standard split which ensures disjoint reference images). We also compared the inference latency to ensure it is viable for open-domain retrieval libraries.

As shown in Table 7, SOP achieves a significant boost on CIRR (+4.42% in R@5) compared to the strong BLIP-2 Baseline. Crucially, the improvement on the open domain (CIRR) is consistent with the domain-specific (FashionIQ) results, proving that SOP’s Orthogonal Subspace Projection (OSP) is a generic geometric solution that handles the high semantic variance of open domains effectively, rather than just memorizing specific attributes. For open-domain libraries (like FAISS), query encoding speed is vital. SOP adds negligible latency (+2 ms) compared to the backbone, making it highly scalable for large-scale retrieval.

In summary, the gains are derived principally from the SOP architecture’s ability to geometrically decouple features, enabling high efficiency (frozen encoder) and strong robustness across both specific (FashionIQ) and open (CIRR) domains.

4.4. Parameter Sensitivity

We further investigated the sensitivity of SOP to the weights in the loss function. Figure 3a illustrates the trend of CIRR Avg as the weight

λ_{1}

(in Equation (14)) increases from 0.1 to 1.0. We observe a gradual improvement in performance as

λ_{1}

increases, peaking at

λ_{1} = 0.5

. This confirms that appropriately introducing distribution priors effectively regularizes latent features, but overly strong priors may obscure the image’s inherent personalized characteristics. For the projection consistency weight

λ_{2}

(in Equation (14)), as shown in Figure 3b, the optimal value occurs around 0.7. This indicates that while context fidelity is important, overly stringent geometric constraints (excessively large

λ_{2}

) may limit the model’s ability to learn substantial visual transformations, necessitating a balanced approach.

Stability under Batch Size Variations. Our original setting used a batch size of 128. We performed new experiments on the FashionIQ dataset with varying batch sizes (from 16 to 256) to observe the impact on the SFR module. As shown in Table 8, the performance improves consistently as the batch size increases and peaks at 128. When the batch size is further increased to 256, the performance remains comparable to that of 128 (56.52 vs. 56.55), indicating that our model is robust and insensitive to excessively large batch sizes once a sufficient number of samples are provided.

4.5. Qualitative Experiments

Case Study. To provide an intuitive understanding of how SOP mitigates semantic erosion and preserves visual context, we present qualitative comparisons between the full SOP model and the variant without Orthogonal Subspace Projection (w/o OSP) in Figure 4.

The top row of Figure 4 illustrates a typical case from the FashionIQ dataset. The user instruction is “replace white with blue”, which requires the model to alter the color attribute while strictly preserving the fine-grained structural details. As observed, the variant w/o OSP fails to maintain structural fidelity. Although it successfully retrieves blue dresses, the Top-1 result is a wrap-style dress that completely loses the original ruffled design. This indicates that without orthogonal decoupling, the strong semantic signal of “blue” eroded the invariant structural features of the reference image. In contrast, SOP effectively disentangles the color modification from the structural context. Its Top-1 retrieval is the exact target image, which adopts the requested blue color while perfectly retaining the complex design elements of the reference dress. This verifies that OSP successfully locks the invariant visual context in a protected subspace.

The bottom row demonstrates a more challenging scenario from the CIRR dataset, where the instruction “Get close to small dog…” implies a significant change in viewpoint and scale. In this case, the “modification” is the camera movement, while the “invariant context” is the identity of the dog itself. The w/o OSP variant retrieves a dog with similar colors but fails to preserve the specific identity features (e.g., the specific ear shape and facial expression), ranking the ground truth at Top-5. This suggests that the features representing the viewpoint transformation became entangled with the object identity features, causing a drift in the retrieved subject. Conversely, SOP accurately retrieves the ground truth at Top-1. By projecting the “zoom-in” signal onto the orthogonal complement of the identity features, SOP ensures that the geometric transformation does not compromise the discriminative features of the specific dog instance.

Attention Visualization. To empirically validate the geometric interpretability of the OSP module, we visualize the spatial response maps of the decoupled features in Figure 5.

Specifically, we employ Grad-CAM [52] to compute the feature sensitivity for the two orthogonal components: the orthogonal modification increment (

v_{m o d}

) and the invariant feature (

f_{i n v}

).

As shown in row (a), the instruction “…red electric locomotive - now silver” requires changing the color attribute.

Orthogonal Modification Increment: The map (left) exhibits high activation strictly concentrated on the locomotive body, acting as a precise “editing vector” for the target attribute.
Invariant Feature: Crucially, the invariant map (right) highlights the background structures (e.g., railway tracks and wheels), verifying that $f_{i n v}$ actively anchors the retrieval on stable visual cues.

This disentanglement generalizes to additive queries as well. As observed in row (b) (“Add a second lemon slice”),

v_{m o d}

accurately focuses on the semantic region of the food and existing lemon to guide the addition, whereas

f_{i n v}

locks onto specific textural details and background to maintain object identity. The consistent spatial separation across different query types provides strong physical evidence that SOP effectively disentangles semantic modification from visual context.

5. Conclusions

To address instruction ambiguity and feature entanglement challenges in cross-modal data retrieval within intelligent sensor networks, we propose the Selective Orthogonal Projection Network (SOP). SOP redefines the feature fusion process of sensor data from the perspective of manifold geometry. By employing entropy-aware calibration within the Selective Focus Recovery (SFR) module, we effectively rectify feature distribution shifts caused by ambiguous instructions. Furthermore, the orthogonal decoupling mechanisms of Orthogonal Subspace Projection (OSP) and Geometric Composition Fidelity (GCCP) ensure the strict preservation of irrelevant visual backgrounds, thereby resolving semantic erosion. The experimental results demonstrate the significant superiority of SOP in handling complex queries. Beyond enhancing Composed Image Retrieval (CIR) accuracy, this work provides an effective geometric solution to the semantic gap in large-scale visual sensor data streams, exhibiting broad potential for next-generation intelligent video surveillance and human–computer interaction systems. Although verified extensively on fashion and object retrieval benchmarks, the core contribution of SOP, namely Geometric Decoupling via Orthogonal Projection, is methodological and domain agnostic. The principle of separating “modification signals” from “invariant contexts” is universally applicable to other sensor data modalities. However, our linear orthogonalization approach may struggle with highly non-linear semantic entanglement, and the current evaluation primarily focuses on object-centric retrieval. To address these boundaries, future work will explore non-linear manifold projections (e.g., kernelized orthogonalization) and extend this framework to more abstract scene-level retrieval, as well as other non-fashion sensor data modalities, such as remote sensing image retrieval (where landscape context is invariant while seasons change) and medical imaging (where anatomical structure is invariant while distinct anomalies appear), to further validate its versatility.

Author Contributions

Conceptualization, S.C. and G.L.; methodology, S.C.; software, S.C.; validation, S.C.; formal analysis, S.C.; investigation, S.C.; resources, S.C. and G.L.; data curation, S.C.; writing—original draft preparation, S.C.; writing—review and editing, S.C. and G.L.; visualization, S.C.; supervision, G.L.; project administration, G.L.; and funding acquisition, G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (No. 62401342), the Natural Science Foundation of Shandong Province (No. ZR2024QF092, ZR2024LZH007, and ZR2025ZD24), the Guangdong Basic and Applied Basic Research Foundation (No. 2025A1515011826 and 2026A1515010768), the Shenzhen Fundamental Research Program (No. JCYJ20250604124702003), the Sichuan Provincial Key Laboratory of Philosophy and Social Science for Language Intelligence in Special Education (No. YYZN-2025-11), the National Key Research and Development Program of China (No. 2024YFC2418300 and 2024YFC2418303), and the Key Laboratory of Social Computing and Cognitive Intelligence (Dalian University of Technology), Ministry of Education (No. SCCl2025YB02).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code used in this study are openly available at https://github.com/Cheng-SDU/SOP/, accessed on 3 March 2026.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hou, K.; Tong, Q.; Yan, N.; Liu, X.; Hou, S. MCFA: Multi-Scale Cascade and Feature Adaptive Alignment Network for Cross-View Geo-Localization. Sensors 2025, 25, 4519. [Google Scholar] [CrossRef] [PubMed]
Hwang, J.; Jeon, M.; Kim, J. Multimodal latent representation learning for video moment retrieval. Sensors 2025, 25, 4528. [Google Scholar] [CrossRef] [PubMed]
Ye, Y.; Sun, Z.; Chen, J. Learning Hierarchically Consistent Disentanglement with Multi-Channel Augmentation for Public Security-Oriented Sketch Person Re-Identification. Sensors 2025, 25, 6155. [Google Scholar] [CrossRef] [PubMed]
Zhong, X.; Liu, G.; Dong, X.; Li, C.; Li, H.; Cui, H.; Zhou, W. Automatic seizure detection based on stockwell transform and transformer. Sensors 2023, 24, 77. [Google Scholar] [CrossRef] [PubMed]
Chen, Z.; Hu, Y.; Fu, Z.; Li, Z.; Huang, J.; Huang, Q.; Wei, Y. INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Singapore, 20–27 January 2026. [Google Scholar]
Kwak, J.; Inhar, R.M.I.; Yun, S.Y.; Lee, S.J. QuRe: Query-Relevant Retrieval through Hard Negative Sampling in Composed Image Retrieval. arXiv 2025, arXiv:2507.12416. [Google Scholar]
Li, Z.; Hu, Y.; Chen, Z.; Zhang, S.; Huang, Q.; Fu, Z.; Wei, Y. HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Singapore, 20–27 January 2026. [Google Scholar]
Tang, H.; Wang, J.; Peng, Y.; Meng, G.; Luo, R.; Chen, B.; Chen, L.; Wang, Y.; Xia, S.T. Modeling uncertainty in Composed Image Retrieval via probabilistic embeddings. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics; ACL: Stroudsburg, PA, USA, 2025; pp. 1210–1222. [Google Scholar]
Wen, H.; Song, X.; Yin, J.; Wu, J.; Guan, W.; Nie, L. Self-Training Boosted Multi-Factor Matching Network for Composed Image Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 3665–3678. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Chen, Z.; Wen, H.; Fu, Z.; Hu, Y.; Guan, W. ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Burlingame, CA, USA, 31 March–2 April 2025. [Google Scholar]
Ge, H.; Jiang, Y.; Sun, J.; Yuan, K.; Liu, Y. LLM-Enhanced Composed Image Retrieval: An Intent Uncertainty-Aware Linguistic-Visual Dual Channel Matching Model. ACM Trans. Inf. Syst. 2025, 43, 1–30. [Google Scholar] [CrossRef]
Xu, X.; Liu, Y.; Khan, S.; Khan, F.; Zuo, W.; Goh, R.S.M.; Feng, C.M. Sentence-level Prompts Benefit Composed Image Retrieval. In Proceedings of the 12th International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 7 May 2024. [Google Scholar]
Li, Z.; Hu, Y.; Chen, Z.; Huang, Q.; Qiu, G.; Fu, Z.; Liu, M. ReTrack: Evidence Driven Dual Stream Directional Anchor Calibration Network for Composed Video Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Singapore, 20–27 January 2026. [Google Scholar]
Hu, Y.; Li, Z.; Chen, Z.; Huang, Q.; Fu, Z.; Xu, M.; Nie, L. REFINE: Composed Video Retrieval via Shared and Differential Semantics Enhancement. ACM Trans. Multimed. Comput. Commun. Appl. 2026. [Google Scholar] [CrossRef]
Natha, S.; Ahmed, F.; Siraj, M.; Lagari, M.; Altamimi, M.; Chandio, A.A. Deep BiLSTM attention model for spatial and temporal anomaly detection in video surveillance. Sensors 2025, 25, 251. [Google Scholar] [CrossRef] [PubMed]
Jain, V.; Wu, Z.; Zou, Q.; Florentin, L.; Turbell, H.; Siddhartha, S.; Timofte, R.; Gao, Q.; Jiang, L.; Luo, Q.; et al. NTIRE 2025 challenge on video quality enhancement for video conferencing: Datasets, methods and results. In Proceedings of the Computer Vision and Pattern Recognition Conference, Shanghai, China, 15–18 October 2025; pp. 1184–1194. [Google Scholar]
Liu, Q.; Meng, X.; Zhang, S.; Li, X.; Shao, F. A temporally insensitive spatio-temporal fusion method for remote sensing imagery via semantic prior regularization. Inf. Fusion 2025, 117, 102818. [Google Scholar] [CrossRef]
Wu, K.; Zhang, Y.; Ru, L.; Dang, B.; Lao, J.; Yu, L.; Luo, J.; Zhu, Z.; Sun, Y.; Zhang, J.; et al. A semantic-enhanced multi-modal remote sensing foundation model for Earth observation. Nat. Mach. Intell. 2025, 7, 1235–1249. [Google Scholar] [CrossRef]
Kapania, S.; Wang, R.; Li, T.J.J.; Li, T.; Shen, H. ’I’m Categorizing LLM as a Productivity Tool’: Examining Ethics of LLM Use in HCI Research Practices. Proc. ACM -Hum.-Comput. Interact. 2025, 9, 1–26. [Google Scholar] [CrossRef]
Boonprakong, N.; Tag, B.; Goncalves, J.; Dingler, T. How Do HCI Researchers Study Cognitive Biases? A Scoping Review. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 26 April–1 May 2025; pp. 1–20. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning; PMLR: Breckenridge, CO, USA, 2021; pp. 8748–8763. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision–language understanding and generation. In Proceedings of the International Conference on Machine Learning; PMLR: Breckenridge, CO, USA, 2022; pp. 12888–12900. [Google Scholar]
Wen, H.; Zhang, X.; Song, X.; Wei, Y.; Nie, L. Target-guided Composed Image Retrieval. In Proceedings of the 31st ACM International Conference on Multimedia; ACM: New York, NY, USA, 2023; pp. 915–923. [Google Scholar]
Li, S.; He, C.; Liu, X.; Zhou, J.T.; Peng, X.; Hu, P. Learning with Noisy Triplet Correspondence for Composed Image Retrieval. In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2025; pp. 19628–19637. [Google Scholar]
Chen, Z.; Hu, Y.; Li, Z.; Fu, Z.; Wen, H.; Guan, W. HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval. In Proceedings of the 33rd ACM International Conference on Multimedia; ACM: New York, NY, USA, 2025; pp. 6143–6152. [Google Scholar]
Tian, L.; Zhao, J.; Hu, Z.; Yang, Z.; Li, H.; Jin, L.; Wang, Z.; Li, X. CCIN: Compositional Conflict Identification and Neutralization for Composed Image Retrieval. In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2025; pp. 3974–3983. [Google Scholar]
Chen, Z.; Hu, Y.; Li, Z.; Fu, Z.; Song, X.; Nie, L. OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval. In Proceedings of the 33rd ACM International Conference on Multimedia; ACM: New York, NY, USA, 2025; pp. 6113–6122. [Google Scholar]
Wang, Y.; Huang, W.; Yuan, C. Aligning Composed Query with Image via Discriminative Perception from Negative Correspondences. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence; AAAI Press: Washington, DC, USA, 2025; Volume 39, pp. 8078–8086. [Google Scholar]
Li, Z.; Fu, Z.; Hu, Y.; Chen, Z.; Wen, H.; Nie, L. FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval. arXiv 2025, arXiv:2503.21309. [Google Scholar]
Fu, Z.; Li, Z.; Chen, Z.; Wang, C.; Song, X.; Hu, Y.; Nie, L. PAIR: Complementarity-guided Disentanglement for Composed Image Retrieval. In ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar]
Huang, Q.; Chen, Z.; Li, Z.; Wang, C.; Song, X.; Hu, Y.; Nie, L. MEDIAN: Adaptive Intermediate-grained Aggregation Network for Composed Image Retrieval. In ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar]
Baldrati, A.; Bertini, M.; Uricchio, T.; Del Bimbo, A. Effective conditioned and Composed Image Retrieval combining clip-based features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 21466–21474. [Google Scholar]
Yang, X.; Liu, D.; Zhang, H.; Luo, Y.; Wang, C.; Zhang, J. Decomposing Semantic Shifts for Composed Image Retrieval. In Proceedings of the AAAI, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6576–6584. [Google Scholar]
Vo, N.; Jiang, L.; Sun, C.; Murphy, K.; Li, L.; Fei-Fei, L.; Hays, J. Composing Text and Image for Image Retrieval—An Empirical Odyssey. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2019; pp. 6439–6448. [Google Scholar]
Chen, Y.; Gong, S.; Bazzani, L. Image Search With Text Feedback by Visiolinguistic Attention Learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2020; pp. 2998–3008. [Google Scholar]
Liu, Z.; Opazo, C.R.; Teney, D.; Gould, S. Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2021; pp. 2105–2114. [Google Scholar]
Baldrati, A.; Bertini, M.; Uricchio, T.; Del Bimbo, A. Conditioned and Composed Image Retrieval combining and partially fine-tuning clip-based features. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); IEEE: Piscataway, NJ, USA, 2022; pp. 4959–4968. [Google Scholar]
Liu, Z.; Sun, W.; Hong, Y.; Teney, D.; Gould, S. Bi-Directional Training for Composed Image Retrieval via Text Prompt Learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 5753–5762. [Google Scholar]
Shen, F.; Tang, J. Imagpose: A unified conditional framework for pose-guided person generation. Adv. Neural Inf. Process. Syst. 2024, 37, 6246–6266. [Google Scholar]
Wan, Y.; Zou, G.; Zhang, B. Composed Image Retrieval: A survey on recent research and development. Appl. Intell. 2025, 55, 482. [Google Scholar] [CrossRef]
Jiang, X.; Wang, Y.; Li, M.; Wu, Y.; Hu, B.; Qian, X. Cala: Complementary association learning for augmenting comoposed image retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval; ACM: New York, NY, USA, 2024; pp. 2177–2187. [Google Scholar]
Rodolà, E.; Cosmo, L.; Litany, O.; Bronstein, M.M.; Bronstein, A.M.; Audebert, N.; Ben Hamza, A.; Boulch, A.; Castellani, U.; Do, M.N.; et al. SHREC’17: Deformable shape retrieval with missing parts. In Eurographics Workshop on 3D Object Retrieval; The Eurographics Association: Eindhoven, The Netherlands, 2017. [Google Scholar]
Chun, S.; Oh, S.J.; De Rezende, R.S.; Kalantidis, Y.; Larlus, D. Probabilistic embeddings for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8415–8424. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning; PMLR: Breckenridge, CO, USA, 2023; pp. 19730–19742. [Google Scholar]
Chen, Y.; Zheng, Z.; Ji, W.; Qu, L.; Chua, T.S. Composed Image Retrieval with text feedback via multi-grained uncertainty regularization. arXiv 2024, arXiv:2211.07394. [Google Scholar] [CrossRef]
Imtiaz, A.; Pathirana, N.; Saheel, S.; Karunanayaka, K.; Trenado, C. A Review on the Influence of Deep Learning and Generative AI in the Fashion Industry. J. Future Artif. Intell. Technol. 2024, 1, 201–216. [Google Scholar] [CrossRef]
Wu, H.; Gao, Y.; Guo, X.; Al-Halah, Z.; Rennie, S.; Grauman, K.; Feris, R. Fashion iq: A new dataset towards retrieving images by natural language feedback. In Proceedings of the CVPR, Nashville, TN, USA, 20–25 June 2021; pp. 11307–11317. [Google Scholar]
Guo, X.; Wu, H.; Cheng, Y.; Rennie, S.; Tesauro, G.; Feris, R.S. Dialog-based Interactive Image Retrieval. In Proceedings of the NeurIPS, Montreal, QC, Canada, 3–8 December 2018; MIT Press: Cambridge, MA, USA, 2018; pp. 676–686. [Google Scholar]
Delmas, G.; de Rezende, R.S.; Csurka, G.; Larlus, D. ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022; pp. 1–12. [Google Scholar]
Ventura, L.; Yang, A.; Schmid, C.; Varol, G. CoVR: Learning composed video retrieval from web video captions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 5270–5279. [Google Scholar]
Ventura, L.; Yang, A.; Schmid, C.; Varol, G. CoVR-2: Automatic Data Construction for Composed Video Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11409–11421. [Google Scholar] [CrossRef] [PubMed]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. Illustration of the CIR task and its inherent challenges. (a) The workflow of retrieving a target image using a multi-modal query. (b) Focus ambiguity causes the composed features to deviate from the real target distribution (distribution shift). (c) Feature Entanglement leads to semantic erosion, where applying a modification (e.g., changing color to blue) incorrectly alters preserved attributes (e.g., dress structure), failing to maintain the identity of the reference object.

Figure 2. Overall framework of SOP, consisting of (a) Selective Focus Restoration, (b) Orthogonal Subspace Projection, and (c) Geometric Composition and Context Preservation.

Figure 3. Parameter sensitivity analysis of the proposed SOP framework. (a) Impact of the uncertainty weighting factor

λ_{1}

. (b) Impact of the cycle consistency weight

λ_{2}

. The blue solid lines represent R@10 Avg on FashionIQ (left y-axis), while the red dashed lines denote the average recall on CIRR (right y-axis).

Figure 3. Parameter sensitivity analysis of the proposed SOP framework. (a) Impact of the uncertainty weighting factor

λ_{1}

. (b) Impact of the cycle consistency weight

λ_{2}

. The blue solid lines represent R@10 Avg on FashionIQ (left y-axis), while the red dashed lines denote the average recall on CIRR (right y-axis).

Figure 4. Qualitative comparison between the proposed SOP and the baseline w/o OSP on FashionIQ (top row) and CIRR (bottom row). The target images are highlighted with red boxes.

Figure 5. Attention visualization of the decoupled feature subspaces. (a) and (b) present two examples of multimodal queries. The "Orthogonal Modification Increment" attention maps highlight regions relevant to the text-driven changes (e.g., the body of the locomotive for color modification or the area for adding a lemon), while the "Invariant Feature" maps focus on content that remains constant. This demonstrates the model’s ability to effectively decouple features into modification-related and identity-preserving subspaces.

Table 1. Summary of implementation details and hyper-parameters.

Configuration	Value/Setting
Framework	PyTorch 2.10.0/BLIP-2 Architecture
Visual Encoder	ViT-G/14 (Frozen) from EVA-CLIP
Text Encoder	BERT-base
Trainable Modules	Q-Former, Linear Projections (SFR, OSP, GCCP)
Optimizer	AdamW
Learning Rate	$1 \times 10^{- 5}$
Weight Decay	0.05
Batch Size	128
Epochs	30
Loss Weights	$λ_{1} = 0.5$ (SCR), $λ_{2} = 0.7$ (Ctx)
Temperature ( $τ$ )	0.07
Hardware	1 × NVIDIA A40 (48 GB)

Table 4. Ablation study of different components on FashionIQ, Shoes, and CIRR datasets. The best results are highlighted in bold.

D#	Models	FashionIQ-Avg		Shoes	CIRR
D#	Models	R@10	R@50	Avg	Avg
(1)	w/o SFR	52.81	75.24	55.62	76.85
(2)	w/o Uncertainty_Weighting	54.84	77.17	56.85	76.26
(3)	w/o Structural Alignment	55.65	78.05	57.90	77.50
(4)	w/o OSP	54.12	76.45	56.20	76.50
(5)	w/o Orthogonality	54.20	76.55	56.40	77.95
(6)	w/o Adaptive_Gating	55.10	77.40	57.15	77.80
(7)	w/o Cycle	55.40	77.85	57.50	78.10
	SOP (Ours)	56.55	78.98	58.73	78.93

Table 5. Performance comparison with different backbones on FashionIQ (Avg R@10).

Backbone	Method	R@10 (Avg)	Gain
CLIP (ViT-B/32)	CLIP	38.40	-
CLIP (ViT-B/32)	CLIP + SOP (Ours)	42.95	+4.55%
BLIP-2 (ViT-G/14)	BLIP-2	48.20	-
BLIP-2 (ViT-G/14)	BLIP-2 + SOP (Ours)	56.55	+8.35%

Table 6. Comparison of freezing strategies on FashionIQ (efficiency vs. accuracy). The best results are highlighted in bold.

Strategy	Trainable Params	GPU Memory	Training Time/Epoch	R@10 (Avg)	R@50 (Avg)
Full Fine-tuning	∼1.2B (100%)	42 GB	∼52 min	57.10	79.45
LoRA Tuning	∼35M (2.9%)	28 GB	∼18 min	55.90	78.20
SOP (Frozen; Ours)	∼18M (1.5%)	16 GB	∼10 min	56.55	78.98

Table 7. Robustness and inference efficiency on open-domain CIRR (test set).

Method	Backbone	R@1	R@5	Latency (ms/query)
Baseline	BLIP-2 (Frozen)	46.10	76.39	18 ms
SOP (Ours)	BLIP-2 (Frozen)	50.72	80.81	20 ms
Improvement	-	+4.62%	+4.42%	+2 ms (Negligible)

Table 8. Sensitivity analysis of batch size on FashionIQ (average R@10/R@50).

Method	Batch Size
Method	16	32	64	128 (Default)	256
Baseline (w/o SFR)	52.81/75.24	53.10/75.45	53.25/75.60	53.40/75.80	53.42/75.82
SOP (Ours)	54.95/77.50	55.82/78.15	56.20/78.65	56.55/78.98	56.52/78.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cheng, S.; Liu, G. SOP: Selective Orthogonal Projection for Composed Image Retrieval. Sensors 2026, 26, 1621. https://doi.org/10.3390/s26051621

AMA Style

Cheng S, Liu G. SOP: Selective Orthogonal Projection for Composed Image Retrieval. Sensors. 2026; 26(5):1621. https://doi.org/10.3390/s26051621

Chicago/Turabian Style

Cheng, Su, and Guoyang Liu. 2026. "SOP: Selective Orthogonal Projection for Composed Image Retrieval" Sensors 26, no. 5: 1621. https://doi.org/10.3390/s26051621

APA Style

Cheng, S., & Liu, G. (2026). SOP: Selective Orthogonal Projection for Composed Image Retrieval. Sensors, 26(5), 1621. https://doi.org/10.3390/s26051621

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SOP: Selective Orthogonal Projection for Composed Image Retrieval

Abstract

1. Introduction

2. Related Work

2.1. Composed Image Retrieval

2.2. Uncertainty Modeling in Retrieval

3. Methodology

3.1. Problem Formulation

3.2. Selective Focus Restoration (SFR)

3.3. Orthogonal Subspace Projection (OSP)

3.4. Geometric Composition and Context Preservation (GCCP)

3.5. Optimization Objective

4. Experiment

4.1. Experimental Settings

4.2. Performance Comparison

4.3. Ablation Study

4.4. Parameter Sensitivity

4.5. Qualitative Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI