1. Introduction
Image retrieval (e.g., [
1]) aims to retrieve from a database the images that best align with the user’s intent. In Composed Image Retrieval (CIR) [
2], the user’s intent is expressed as a multi-modal query consisting of a reference image and a set of required modifications. Composed retrieval facilitates a variety of use cases that demand precise, intent-driven visual exploration. This is particularly evident in the e-commerce and fashion sectors [
3,
4], where shoppers often refine their searches with specific modifications, e.g., requesting a darker shade of a garment or a shorter hemline. This approach enables a more tailored and iterative shopping experience than traditional category-based searches. In addition to these sectors, CIR is highly effective in contexts where the core identity of a person or object must remain constant while specific visual traits are modified [
5].
Before the emergence of natural-language-based CIR, image retrieval with attribute-based modifications represented a major direction for enabling flexible query refinement [
6,
7]. Currently, the state-of-the-art CIR models leverage free-form natural language to describe required changes, offering an ideal mechanism to capture complex edits. Approaches to the CIR problem can be broadly categorized into supervised methods [
2,
8,
9,
10,
11,
12,
13] and zero-shot learning methods [
14,
15,
16]. The former focuses on designing optimal fusion mechanisms for vision and textual information to align the composed query with the target image in a shared embedding space. For the latter, several studies have explored reasoning-based, training-free approaches [
17,
18], driven by advancements in pre-trained language models like BERT [
19] and vision-language models like CLIP [
20] and BLIP-2 [
21].
In a composed image retrieval task, a user’s query imposes two distinct requirements on the retrieved image. First, the retrieved image should remain as similar as possible to the reference image, preserving its overall identity, category, or structure. Second, the retrieved image must satisfy the requirements expressed in the modification text, which describe how the target image should differ from the reference. While both components are critical, they are often not equally weighted from the user’s perspective. In practice, satisfying the modification text is frequently more critical than maximizing similarity to the reference image.
Figure 1a illustrates an example query from the FashionIQ dataset. The image on the left is the reference image. The user seeks a dress similar to this reference, but red in color and longer in length. The image on the right is the ground-truth target. Consider image
in
Figure 1b and image
in
Figure 1c as candidates retrieved according to this query. We observe that
better preserves the overall appearance and the style of the reference image. However, it fails to fully satisfy the requested modifications: although the color is correctly changed to red, the length is not increased. Image
, on the other hand, successfully reflects both the color modification and the appropriate length requested in the text query, despite being visually less similar to the reference than
. When the visual reference and textual modification are jointly optimized as equal components, image
may emerge as the nearest neighbour of the composed query in the latent embedding space, resulting in a high retrieval rank. However, if the requested changes are treated as mandatory constraints, image
is clearly a superior candidate.
Existing CIR approaches often lack an explicit mechanism to guarantee that a user’s modification requirements are fully satisfied. Because the visual reference and textual modifications are typically treated as equal inputs when fused into a shared embedding space, a gap emerges between retrieval similarity and edit faithfulness; high similarity in the embedding space does not necessarily imply satisfaction of the user’s stated constraints. This gap leads to retrieval results that appear close to the reference yet violate one or more requested modifications. For instance, image
in
Figure 1b is ranked first by Pic2Word [
16], a prominent zero-shot CIR model, despite the presence of the fully compliant image
in the gallery.
This asymmetric nature of user queries in CIR was recently addressed by Feng et al. [
22]. By reframing the CIR task through the lens of visual question answering, the authors utilize a fine-tuned LLM to generate questions based on requested modifications, which are subsequently used to evaluate and rank the candidate images. While this approach handles absolute attribute values specified in users requirements, it fails to capture relative modifications defined with respect to the reference image, such as shorter, which inherently implies shorter than the reference image.
In this work, we propose a semi-automated framework for fashion retrieval that explicitly prioritizes the textual constraints within a query. At the core of our framework is a verifiable logical structure for specifying user’s intent. A user’s text query is formally expressed as a logical conjunction of atomic constraints according to a domain-specific schema that defines the attributes and their value types.
We distinguish between two types of atomic constraints based on how they are evaluated:
Value Constraints: A constraint such as in red is associated with a value operator to specify requirements on concrete attribute values (e.g., a specific color or style). It can be verified against the candidate image alone, similar to the attribute-specific checks performed in VQA4CIR.
Relative Constraint: A constraint such as darker, longer, or more fitted is associated with a relative operator and must be verified jointly between the candidate image and the reference image.
By integrating both value and relative operators, our logical language significantly expands the expressive power of attribute-based constraints, capturing the nuanced relative changes typical of natural language queries.
We perform verification of the user’s intent during a post-processing stage applied to the top retrievals of an existing CIR model. This workflow preserves the expressive flexibility of the natural language query while enforcing structural correctness.
Our approach is semi-automated: a fashion-specific set of attributes and operators is detived from a statistical analysis of standard benchmards (FashionIQ [
23] and DeepFashion [
24]), paired with an LLM-assisted domain abstraction and human-verified error checking. The resulting schema and structural language maintain an optimal balance between the expressive power of the constraints and the efficiency of the downstream extraction and verification processes.
We leverage prompt engineering to extract formal constraints from user queries according to the defined schema and structural language. The Large Language Model (LLM) GTP-4o-mini is adopted to parse the modification text and map its expressions to the corresponding constraint types, as well as the attributes and attribute values defined by the schema.
A plug-in verification layer is implemented as part of the post-processing pipeline to identify and re-rank candidates that violate the extracted constraints. In doing so, we improve adherence to mandatary requirements while retaining the semantic richness of the underlying retrieval model. The verification process is driven by the state-of-the-art Vision-Language Model (VLM) Qwen2.5-VL. We design specialized prompts to guide the VLM agent to verify constraints either within the candidate image alone or comparatively between the candidate and reference images.
Figure 1b illustrates the top four images ranked by Pic2Word [
16] according to the query in
Figure 1a. Although the modification text explicitly requests both a color change to red and an increase in garment length, all four top-ranked retrievals satisfy the color constraint but fail to reflect the requested change in length.
Figure 1c displays the top four images ranked by Pic2Word after applying our method. The baseline retrievals are re-ranked, ensuring that all top-four results fully satisfy the mandatory text modifications, successfully addressing the query asymmetry.
Through our post-processing pipeline, composed image retrieval is reformulated as a soft-constrained optimization problem, where candidates are ranked based on a combination of latent similarity and explicit adherence to specified visual transformations, rather than treated purely as nearest neighbours in an embedding space. Experimental results on fashion benchmarks show that our approach significantly enhances the recall performance of existing state-of-the-art methods.
In summary, the primary contributions of this work are as follows:
A Verifiable Structural Language for CIR: We introduce a structured language with domain schema that formalizes user intent into a set of atomic constraints, effectively expanding the expressive capability of attribute-based queries to capture nuanced comparative text (e.g., shorter than reference).
An LLM-VLM Post-Processing Verification Pipeline: We propose a training-free, plug-in verification layer that leverages LLMs for constraint parsing and state-of-the-art VLMs for visual constraint verification, effectively bridging the gap between latent embedding similarity and explicit edit faithfulness.
State-of-the-Art Performance Improvements: We demonstrate through extensive evaluation on standard benchmarks that our re-ranking mechanism consistently boosts the retrieval accuracy and constraint adherence of both supervised and zero-shot CIR baselines.
LLM and VLM evaluations: To clarify the explicit contributions of each module and enhance the interpretability of our framework, we present detailed ablation studies on evaluating LLM and VLM components.
The remainder of this paper is organized as follows.
Section 2 reviews relevant literature in composed image retrieval and visual constraint verification.
Section 3 introduces our schema and structural language and details the extraction and plug-in verification architecture. Experimental settings, evaluation metrics, and quantitative/qualitative results are presented in
Section 4. Finally,
Section 5 concludes the paper with final remarks, and outlines future research directions.
2. Related Work
In this section, we review prior literature in composed image retrieval and contextualize the proposed framework within the broader landscape of multi-modal search and attribute-guided verification.
2.1. Attribute-Based Image Retrieval
Before the emergence of natural-language-based composed image retrieval, image retrieval with attribute-based modifications represented a dominant paradigm for enabling flexible query refinement. Representative works along this line of research include AMNet by Zhao et all. [
7], and ADDE by Hou et al. [
6]. In this setting, retrieval is guided by a reference image alongside predefined, discrete attribute alterations. This formulation is primarily suited for domains governed by highly structured attribute vocabularies, such as fashion and face retrieval.
When the modification requirements are provided as free-form natural language rather than structured attribute labels, it is possible to extract the changes from the text and map the task back to an attribute-based CIR framework. Doing so during inference, however, inevitably sacrifices the semantic richness and open-vocabulary flexibility inherent in a user’s natural expression. Although our framework also extracts structured attribute variations from the modification text, it preserves the expressive richness of the user’s original intent: retrieval inference is conducted directly using the multi-modal query with its native natural-language text. Instead of driving the initial retrieval, the structured attribute changes in our work function as explicit constraints deployed during a post-retrieval stage to verify the satisfiability of the user’s intent.
2.2. Supervised CIR
Traditionally, CIR methods have relied on supervised learning paradigms trained on annotated datasets of reference–text–target triplets. The primary objective of these models is to optimize a composition function that merges distinct visual and textual features into a unified query representation. As noted by Baldrati et al. [
25], this composition vector is optimized via metric learning to minimize its distance from the target image in a shared embedding space, typically utilizing triplet or contrastive loss functions.
Figure 2 provides a conceptual overview of a typical supervised CIR pipeline. The reference image and the modification text within the composed query are processed by an image encoder and a text encoder, respectively. Their respective feature maps are combined via a fusion module (combiner) to produce the query embedding vector. During training, this vector is utilized to compute the alignment loss, whereas during inference, it serves as the search query against the image gallery.
Current state-of-the-art supervised methods frequently utilize pre-trained vision-language models as foundational feature extractors to leverage the robust cross-modal alignments established during large-scale pre-training. A notable example is CLIP4CIR [
9], which adapts the CLIP [
20] architecture to map visual and textual inputs into a shared latent space. In the present work, we employ CLIP4CIR as one of our backbone retrieval methods to evaluate the generalizability of our framework.
2.3. Zero-Shot CIR
The scalability of supervised CIR is inherently constrained by the significant data-acquisition bottlenecks and human labor required to annotate reference–text–target triplets. As noted by Gu et al. [
15], these data limitations frequently cause models to overfit to specific training categories, hindering their ability to generalize across broader visual and textual contexts.
To alleviate this constraint, zero-shot CIR leverages the zero-shot capabilities of extensive pre-trained vision-language models like CLIP. A pioneering development in this space is Pic2Word [
16], which maps visual features directly into the linguistic domain by optimizing a mapping module that translates a reference image into a pseudo-word. This virtual token is subsequently concatenated with the modification text for token-based retrieval.
While effective, translating visual concepts into a singular pseudo-token can omit critical fine-grained nuances. SEARLE [
14] improves upon this by introducing a regularization technique designed to mitigate token degradation. It leverages LLM-generated concepts to ensure that the pseudo-tokens remain linguistically coherent and representative of the intended visual semantics.
Figure 3 illustrates a conceptual view of a typical zero-shot CIR pipelines, highlighting how reference images and modification texts are composed differently compared to the supervised learning paradigm. To validate our framework across diverse zero-shot conditions, we integrate both Pic2Word and SEARLE into our evaluation suite.
2.4. LLMs and VLMs
Recent advancements in Multimodal Large Language Models (MLLMs) have paved the way for reasoning-centric, training-free CIR paradigms. An early strategy, proposed in CIReVL [
17], employs a two-stage pipeline where a VLM generates a detailed caption of the reference image, which an LLM subsequently merges with the user’s modification text to construct an expanded textual query. However, as observed by Tang et al. [
18], this sequential process often introduces information bottlenecks, where critical visual nuances are lost during the image-to-text translation. To mitigate this, Reason-Before-Retrieve [
18] introduces a single-stage Chain-of-Thought approach. In their architecture, an MLLM processes the reference image and modification text simultaneously, enabling the system to deduce user intent holistically before triggering a gallery search.
These reasoning-based CIR methods primarily focus on improving the quality of the composed query representation by generating reasoning cues, inferring target descriptions, or dynamically selecting visual features. Consequently, they integrate reasoning directly into representation learning or initial query construction. In contrast, our approach targets explicit constraint verification applied after the initial retrieval. The novelty of our work does not stay in the application of re-ranking itself, but in translating implicit natural-language modifications into explicit, verifiable, reference-dependent logical constraints to enforce edit satisfaction across arbitrary CIR backbones.
2.5. Constraint Consistency and Post-Retrieval Refinement
Composed user queries exhibit an inherent asymmetry where satisfying the textual modifications is often more critical to user satisfaction than maximizing visual similarity to the reference image. Despite recent progress across supervised, zero-shot, and reasoning-based CIR approaches, retrieved images often exhibit high global similarity to the composed query while simultaneously violating specific constraints expressed in the modification text. This occurs because existing methods address composition predominantly at the representation level, evaluating candidate images via a global similarity score that fails to explicitly enforce individual, attribute-level changes. Consequently, candidates that violate one or more requested edits can still achieve a high retrieval rank.
VQA4CIR [
22] explored post-retrieval refinement to mitigate this limitation via a re-ranking strategy rooted in visual question answering. Specifically, an LLM is fine-tuned to generate attribute-level questions from the modification text, and a VLM is trained to answer these questions for each retrieved candidate image. By explicitly verifying whether individual attribute requirements are fulfilled, VQA4CIR effectively prioritizes user intent during re-ranking.
However, this formulation evaluates each retrieved candidate image independently, neglecting the reference image during the verification process. Consequently, relative modifications such as longer, shorter, or darker cannot be evaluated, as their semantics depend entirely on a direct comparison with the reference image. Without incorporating the reference image into the loop, the verification stage lacks the contextual baseline required to assess comparative adjustments. This limitation is particularly pronounced in the fashion domain, where modifications are inherently relative.
Motivated by this gap, our work introduces a multi-agent post-retrieval module that explicitly models the relationship between the reference image and each retrieved candidate, ensuring that relative, user-specified edits are accurately validated. Our framework differs fundamentally from VQA4CIR [
22] by representing the modification text as structured attribute–operator–value triplet constraints, explicitly supporting reference-dependent relational operators (e.g., longer, less patterned). In this manner, our method goes beyond assessing whether a candidate is broadly consistent with a caption; it systematically verifies whether specific extracted constraints are satisfied with respect to the reference image baseline.
Finally, our method builds upon the broader paradigms of multi-level feature fusion and multi-task learning, which have demonstrated significant foundational value across diverse specialized tasks such as multi-scale target detection [
26], identity recognition systems [
27], and dynamic energy-consumption modeling [
28].
3. Materials and Methods
Figure 4 illustrates the pipeline of the proposed framework. The query input comprises a reference image and a corresponding natural-language modification text. These components are initially processed by an existing baseline CIR model to generate a ranked list of candidate images from a gallery. Crucially, the user’s modification text is explicitly integrated within the baseline CIR model during initial retrieval, ensuring that we preserve the foundational semantic richness of the underlying multi-modal embedding space.
Highlighted in green in
Figure 4 is the structure of our proposed plug-in verification framework, which includes:
A domain-specific schema-grounded structural language that codify the fashion attribute ontology and edit structure;
An LLM-based parsing agent tasked with extracting a schema-grounded set of atomic constraints from raw modification texts; and
A VLM-based verification agent that extracts localized visual attribute values from both the reference image and retrieved candidates to perform explicit constraint checking.
By explicitly extracting, formalizing, and verifying user-specified requirements against candidate retrievals, our framework enables deterministic reasoning about whether an image satisfies the intended semantic modifications. This structural transformation from informal natural language into explicit logical constraints is vital because user queries are inherently conversational, ambiguous, and linguistically diverse. For instance, modification texts in standard CIR benchmarks frequently include expressions such as something like this but longer, less busy print, or more formal looking. While these phrases encode critical domain-specific requirements, they do not map directly to discrete attribute fields or relational mathematical operations. Interpreting these modifiers requires context-aware reasoning, comparative synthesis, and structural domain knowledge. We deploy an LLM agent as a structured mapping mechanism to bridge this abstraction gap.
3.1. Schema and Constraints
We conducted a corpus-level analysis of 60,166 modification texts spanning the training, validation, and test splits of the FashionIQ dataset [
23]. Our empirical findings demonstrate that the vast majority of modification texts rely on interpretable linguistic structures that can be decomposed into a compact set of attribute-level requirements. This motivates us to formally model user intent as a conjunction of a set of atomic constraints, with each constraint governed by a explicit relational operator.
The design of this constraint language prioritizes both expressive power and structural simplicity, as the latter is vital for optimizing the precision and efficiency of the downstream LLM parser and VLM verifier. To maintain tractability, logical disjunction (∨) is excluded from the primitive language layer. For an or-type expression (e.g., lighter or longer), the LLM parser decomposes the expression into two distinct relative atomic constraints. The retrieval pipeline subsequently prioritizes candidate images that fulfill both conditions, while systematically surfacing candidates that satisfy at least one, providing an operationally efficient proxy for disjunction without discarding relevant candidates.
Both the operator set and the attribute vocabulary are grounded empirically in our corpus-level analysis. The operator set comprises standard inequality and equality relations: greater than (>), less than (<), equal to (=), and not equal to (≠). The attribute vocabulary encompasses prominent categories central to fashion retrieval, including color family, sleeve length, brightness, etc. The selection of these sets achieves a balance between the expressiveness of the underlying constraint logic and the downstream accuracy of its extraction and verification.
Our corpus analysis reveals that a dominant portion of FashionIQ captions rely on comparative or relational language (see
Figure 5). This finding underscores the necessity of supporting both value constraints and relative constraints within our framework. Value constraints enforce absolute targets on a specific attribute (e.g., in red), whereas relative constraints govern the comparative relationship between the attribute values of two distinct images (e.g., shorter in length).
For computational efficiency, we model continuous and high-cardinality attributes via ordinal semantic categories. Numerical concepts such as garment length, sleeve length, and pattern density are cast to ordinal attributes; the VLM maps both the reference and candidate images onto predefined, ordered category lists bounded by closed value ranges. For example, the sleeve length attribute maps to the ordered sequence: [“sleeveless”, “cap/short”, “three-quarter”, “long”]. Comparative modifiers like longer, shorter, increase, or decrease are then deterministically evaluated by assessing the relative index positions of the extracted attribute values within these ordered arrays. Linguistic intensifiers (e.g., much longer) are managed via specialized parsing and comparison thresholds that amplify the relative modification signal.
3.2. Constraint Extraction
Large Language Models are proven highly effective in assisting image retrieval tasks (e.g., [
29]). We leverage an LLM to automatically parse logical constraints from raw modification texts. Given an input caption, the LLM executes a series of controlled reasoning steps conditioned on the predefined domain schema and structural language:
Segmentation: The model scans the full modification caption and segments it into distinct, semantically self-contained edit phrases. This handles multi-attribute queries where several independent constraints must be concurrently satisfied.
Attribute Grounding: For each parsed phrase, the LLM maps the edit to a canonical attribute defined in the schema ontology. If a phrase falls outside the schema’s domain, it remains unmapped rather than forcing an incorrect alignment, preventing hallucination and maintaining system consistency.
Operator Assignment: The model maps the phrase to its appropriate relational operator based on linguistic syntax and the attribute’s semantic type (nominal or ordinal). When a relative modifier is identified (e.g., longer), the target value is assigned the special token , indicating that evaluation requires direct comparison with the reference image.
Constraint extraction is executed via a two-stage LLM pipeline configured to balance high-throughput efficiency with parsing fidelity (
Figure 6). In the first stage, GPT-4o-mini is deployed across the entire corpus to perform schema-grounded constraint extraction. Models within the GPT family have demonstrated exceptional performance across contextual language understanding and structured reasoning tasks [
30]. To ensure structural reliability, we introduce a validation gate governed by a confidence threshold of
. Inputs that fall below this threshold or violate formatting checks are automatically routed to a second-stage verification layer powered by GPT-4. Rather than re-parsing the source text entirely, the second-stage verifier is instructed to repair and refine the initial structured outputs, preserving computational resources while resolving complex or ambiguous phrasing.
The extraction pipeline outputs a set of structured atomic constraints formatted as triplets, where a represents the canonical attribute name, o denotes the relational operator, and v signifies the target value (or for reference-dependent operations). This structural transformation yields a deterministic output that reduces the inherent ambiguity of free-form text and interfaces seamlessly with downstream visual reasoning modules.
Table 1 showcases representative modification phrases alongside their corresponding structured triplet representations generated by the LLM agent. When a query contains multiple independent edits, the agent aggregates the individual triplets into a unified constraint set.
In certain scenarios, the extracted constraint set may be empty. This typically stems from two operational conditions: (1) the modification text contains no explicit, schema-representable visual edits, or (2) the parsing agent fails to recognize a valid constraint. Rather than explicitly decoupling these cases, our pipeline adopts a conservative fallback strategy: when the constraint set is empty, the system bypasses the re-ranking layer and defaults to the original similarity-based ranking produced by the baseline CIR model.
3.3. Prompt Engineering and Schema Conditioning
To ensure that the parser generates text representations that are both semantically faithful and schema-adherent, the LLM is guided via schema-conditioned prompts. The prompt design is partitioned into five distinct logical blocks: role definition, task specification, rule encoding, few-shot examples, and output formatting constraints.
3.3.1. Role and Task Definition
The system prompt establishes a strict functional role, configuring the LLM as a specialized fashion constraint extraction agent. The task execution flow is explicitly bounded as follows:
Segment the text to isolate all attribute-level modifications.
Map each segmented modification to a valid schema attribute a.
Assign the appropriate relational operator o based on the semantic shift.
Isolate and normalize the target value v, substituting for comparative edits.
Output exclusively a structured JSON object representing the constraint set C.
3.3.2. Schema Conditioning
The active domain schema ontology is dynamically injected into the system prompt context. This structural conditioning strictly constrains the token space of the LLM, forcing it to choose exclusively from valid canonical attributes and admissible categorical values. This mechanism drastically curtails attribute hallucination and ensures syntactic alignment with downstream modules.
3.3.3. Few-Shot Examples
To implement the linguistic-to-operational mapping rules, the prompt includes multiple diversified few-shot examples illustrating edge cases and multi-constraint text inputs. Furthermore, we enforce a low decoding temperature () to maximize structural determinism and minimize generation variance.
3.4. VLM Agent for Post-Retrieval Attribute Extraction
The structured constraint set generated by the parser defines the precise subset of visual attributes that must be inspected and verified. By dynamically focusing the visual verification phase only on attributes explicitly invoked by the user’s query, our framework avoids the computational overhead and noise accumulation of exhaustive, full-attribute visual parsing.
Our framework embeds a VLM agent to perform target attribute extraction from both the reference image and the top candidate images. To ensure cross-module consistency, the VLM is conditioned using the exact same schema ontology. Because individual queries address distinct attributes, the visual verification prompt is constructed dynamically for each query. This dynamic target specification forces the VLM to attend only to the relevant visual regions, lowering prediction noise.
For baseline visual parsing, Qwen2.5-VL is used to generate candidate attribute labels restricted to the valid schema-defined options. Predictions that are flagged as missing, empty, invalid, or associated with high visual ambiguity are automatically routed to a second-stage verification layer powered by GPT-4, mirroring the cascading refinement strategy employed in our text parsing pipeline.
For each query instance, the VLM agent generates two distinct outputs: an attribute vector for each candidate image within the initial top-K pool, and a baseline attribute vector for the reference image, which provides the contextual anchor for relative comparisons.
Equipped with these visual extractions, the verification module can systematically identify and filter candidates that violate one or more atomic constraints. Images that pass this filtering step are retained and sorted according to their initial baseline retrieval order, yielding a verified top-K list. Subject to the bounded error rates of the text parser and visual extractor, every candidate in this verified list is guaranteed to comply with the user’s explicit textual modifications.
While binary filtering is effective when the candidate pool is densely populated with compliant images, this assumption does not always hold true in sparse or highly constrained gallery spaces. To ensure robust retrieval under data scarcity, candidates that violate constraints are de-ranked rather than discarded. The VLM verifier computes the exact number of atomic constraint violations for each candidate image, which is then utilized as a continuous optimization metric to govern the severity of the candidate’s positional penalty.
We formally frame composed image retrieval with explicit constraint verification as a soft-constrained optimization problem. Given a multi-modal query
, where
is the reference image and
t is the modification text, the text parser extracts a set of structured constraints
. Each individual constraint is defined as a triplet
, where
denotes the target attribute,
is the relational operator, and
is the target value. The underlying baseline CIR model provides a continuous semantic similarity score
between the composed query
q and an arbitrary candidate image
. For each extracted constraint, we define a soft satisfaction function
, where 1 denotes complete fulfillment and 0 denotes an explicit constraint violation. The aggregate constraint adherence score is computed as the mean satisfaction across the complete constraint set. Finally, we apply a soft relaxation that linearly integrates semantic similarity and explicit constraint satisfaction:
where
is a hyperparameter controlling the regularization weight assigned to constraint compliance. All candidates within the baseline top-
K retrieval pool are subsequently re-ranked in descending order according to their joint optimization score
.
4. Results
In this section, we present comprehensive experimental results and quantitative/qualitative analyses demonstrating the effectiveness of the proposed post-processing verification framework across both supervised and zero-shot CIR backbones.
4.1. Dataset Refinement
In the standard FashionIQ dataset [
23], modification texts are written by human annotators to describe the fine-grained visual differences between a reference image and a target image. While this design is intended to capture real-world user behavior, the resulting natural-language descriptions frequently exhibit semantic ambiguity, underspecified instructions, lexical repetitions, and occasional typographical or factual errors. In more severe cases, the text directly contradicts the explicit visual attributes of the annotated target image, introducing substantial label noise into the evaluation pipeline.
Several recent studies have attempted to mitigate this bottleneck by constructing alternative datasets where modification texts are synthesized or automatically reformulated using large language models (e.g., FaCap [
11]). Although these approaches successfully maximize linguistic consistency, they decouple evaluation from the human-written queries encountered in practical deployment scenarios. Because the primary objective of this work is to accurately model and verify user intent as expressed in natural language, we deliberately retain the original human-written modification texts provided by FashionIQ. Rather than fully regenerating captions, we execute a targeted, conservative refinement restricted solely to modification texts that are demonstrably incorrect or explicitly inconsistent with their annotated target images.
Our refinement process is intentionally conservative: no extraneous semantics are introduced, and adjustments are confined to rectifying clear, verifiable errors. Specifically, text corrections are applied only when the described attributes fail to align with the visual content of the ground-truth target.
Figure 7 illustrates a representative instance requiring dataset correction, where the target garment is visually gray, yet the original modification text incorrectly designates it as orange.
Because our proposed plug-in pipeline operates entirely as a training-free framework, only the validation splits are utilized for evaluation. Consequently, the training splits are excluded from the refinement process; corrections are applied exclusively to the validation sets of the Dress, Shirt, and Top-Tee categories. In doing so, minor corrections were performed on 58.67% of dress samples, 32.95% of shirt samples, and 19.73% of top-tee samples within the validation set.
4.2. Recall Improvement
Table 2 presents a comprehensive performance evaluation of our post-processing pipeline across the three foundational FashionIQ categories. We evaluate our framework using two prominent zero-shot baselines (Pic2Word [
16] and SEARLE [
14]) alongside a highly competitive supervised CIR backbone (CLIP4CIR [
9]), quantifying performance shifts via Recall@10 (R@10) and Recall@50 (R@50).
To establish an accurate baseline, all underlying retrieval backbones were re-evaluated on the refined validation splits. The rows designated as P-Base, S-Base, and C-Base denote baseline performance on the original, unrefined validation splits. Conversely, P-Fine, S-Fine, and C-Fine show baseline model metrics on the refined validation set. As illustrated in
Table 2, all models exhibit noticeable performance gains upon dataset refinement alone. For Pic2Word, evaluating on the refined split yields an average relative increase of 25.1% in R@10 and 19.6% in R@50. For SEARLE, refinement leads to an average relative boost of 11.5% in R@10 and 8.5% in R@50. Similarly, CLIP4CIR demonstrates an average increase of 25.8% in R@10 and 16.9% in R@50. These initial shifts confirm that dataset annotation errors artificially suppress measured performance metrics, validating the critical role of our preprocessing curation step.
The rows labeled P-Ours, S-Ours, and C-Ours display the results of integrating our constraint-verification pipeline into Pic2Word, SEARLE, and CLIP4CIR, respectively, evaluated directly on the refined validation split. When compared to the unadjusted methods on the identical refined split, our post-processing framework yields consistent, statistically significant improvements across all categories and evaluation metrics. Specifically, for Pic2Word, our method achieves an average relative improvement of 53.6% in R@10 and 14.2% in R@50. For SEARLE, our approach provides an average boost of 30.9% in R@10 and 12.4% in R@50. For the supervised CLIP4CIR backbone, our framework maintains a steady average relative increase of 4.7% in R@10 and 3.6% in R@50.
Analyzing the granular metrics reveals that the performance gains are significantly more pronounced at Recall@10 than at Recall@50. This trend arises because baseline Recall@10 values are inherently lower and more sensitive to ranking correctness, making the localized re-ranking adjustments exerted by our constraint verifier highly visible.
Furthermore, the empirical results show that the relative performance improvement is substantially more pronounced when applied to zero-shot backbones. Because zero-shot models are not fine-tuned on target fashion triplets, they rely entirely on general visual-language representations acquired during broad pre-training. While this yields exceptional open-domain generalization, it frequently leads to sub-optimal initial recall within narrow target vocabularies. Under these conditions, our post-processing framework effectively optimizes the internal ranking order within the initial candidate pool, translating semantic filters into large retrieval performance gains. In contrast, supervised baselines like CLIP4CIR are optimized directly on domain-specific triplets, yielding highly accurate initial embeddings. Consequently, the post-processing improvement is naturally smaller. Nevertheless, our framework provides consistent performance gains even for the strongest supervised baseline, demonstrating its capacity to refine candidate ordering beyond the limits of standard latent space alignment.
Figure 8 tracks the average performance of the models across three distinct experimental settings: (1) the baseline results originally reported in literature [
9,
14,
16]; (2) the baseline performance verified upon the refined validation set; and (3) the final performance achieved after deploying our post-processing framework. The metrics are reported as Recall@10 and Recall@50 averaged across all three fashion categories, illustrating the compounding benefits of dataset curation and explicit constraint enforcement.
4.3. Impact of Regularization Weight
We performed a comprehensive hyperparameter sensitivity analysis across multiple values of the constraint regularization weight
. Conceptually, lower values of
favor the initial latent similarity scores computed by the baseline retrieval model, maximizing baseline recall retention. Conversely, higher values of
aggressively prioritize explicit constraint fulfillment by penalizing candidates that violate parsed edits.
Table 3 details the trade-offs governed by this parameter. As
increases from 0.1 to 0.5, the Normalized Violated Constraints@10 decreases across all models, confirming an increase in explicit edit faithfulness. This optimization is accompanied by a minor, controlled reduction in raw Recall@10. However, scaling
further to 0.9 yields diminishing returns regarding constraint satisfaction; the normalized violation metrics plateau, remaining nearly unchanged relative to the 0.5 threshold.
In our primary evaluation suite, we set to heavily emphasize the constraint-verification objective of our framework. Because our fundamental objective is to evaluate whether explicit post-retrieval verification can robustly enforce user-specified modifications, this conservative parameterization prioritizes structural edit compliance over unverified latent space similarity.
4.4. Candidate Pool Size Analysis
The post-processing re-ranking layer operates directly on the top-
N candidates returned by the backbone model. A larger candidate pool size
N expands the search space but introduces a higher volume of images into the visual verification stage. To determine an optimal operational threshold for
N, we analyze its downstream influence on the Recall@10 metric across all three baseline models.
Figure 9 shows the result of this evaluation across the validation categories. As
N expands initially, more candidate images are captured within the post-processing window, making a larger number of compliant images available for constraint verification. This behavior is clearly visible in the sharp upward trajectories observed at lower values of
N, with performance gains peaking and plateauing between
and
.
Crucially, the performance curves are not strictly monotonic and exhibit occasional localized drops at higher pool sizes. Our structural auditing confirms that these localized drops are inherently dataset-driven. In fashion retrieval, although the ground-truth benchmark contains a single annotated target image, that target is rarely the sole valid response within a large gallery; numerous alternative images may equally fulfill the requested text modification. As the pool size N broadens, the re-ranking layer frequently discovers alternative images that satisfy the extracted logical constraints more comprehensively than the designated ground truth. When these fully compliant alternative candidates are promoted into the top ranks, the unique labeled target is displaced downward, causing a minor drop in the strict benchmark recall metric despite maintaining high semantic accuracy.
4.5. Ablation Analysis: Component-Level Performance
To rigorously verify the accuracy of the constraint extraction module and account for potential parsing errors, we constructed an empirical evaluation set by randomly sampling 200 modification texts and manually annotating their ground-truth logical triplets. We report performance at two granularities in
Table 4: attribute-level agreement, which tests whether the agent identified the correct fashion attribute regardless of operator or target value, and constraint-level agreement, which demands an exact match across the entire
triplet. With precision, recall, and F1-scores all exceeding 95%, the model demonstrates a robust generalization capability, indicating that residual parsing errors are statistically negligible for the extraction task.
The empirical efficacy of our re-ranking mechanism depends directly on the reliability of the visual attribute extractor. We evaluated the VLM’s annotation accuracy against a manually verified gold standard across 300 distinct images, evaluating a total of 1303 individual attribute decisions. The VLM agent achieved an exact-match micro-accuracy of 97.39% (matching the gold labels in 1269 cases) and a macro-accuracy of 97.42% when averaging performance independently across attribute classes. The per-attribute performance breakdown is provided in
Table 5.
To inspect structural failures, we visualize the confusion matrices for the two most visually complex attributes: neckline and sleeves (
Figure 10). In both matrices, rows denote human-verified gold labels and columns represent VLM predictions; diagonal entries indicate correct matches. The neckline confusion matrix reveals that the sparse off-diagonal errors are heavily concentrated among visually continuous categories, such as distinguishing v-neck from plunge, or isolating scoop necklines from adjacent rounded styles. Similarly, the sleeves confusion matrix shows strong overall alignment, with minor confusions limited to boundary conditions like three-quarter versus long sleeves, or sleeveless versus cap variations.
4.6. Qualitative Error Analysis
While successful re-ranking behavior was established in the Introduction, we analyze two distinct failure modes here to highlight operational boundary conditions.
Figure 11 and
Figure 12 present a fine-grained feature failure case and an inherent textual contradiction case, respectively.
In the failure case illustrated in
Figure 11, the target garment contains physical buttons along the front plinth. However, the visual contrast of these buttons is extremely low, rendering them highly ambiguous at standard image resolutions. Consequently, the VLM extraction agent classifies the embellishment attribute as none, mistakenly identifying a constraint violation. As a result, the valid target image is penalized and fails to surface within the top-5 results. Concurrently, a false-positive candidate that is not sleeveless but exhibits high initial cosine similarity is incorrectly promoted.
Figure 12 showcases a failure mode triggered by contradictory relational cues within the human-written modification text, which simultaneously demands a garment that is both shorter and longer. This conflicting language introduces an unresolvable semantic paradox relative to the target image. During the text parsing phase, the LLM gives structural priority to the explicit comparative term shorter, generating logical triplets that favor short garment. The annotated target image is maxi instead.
4.7. Computational Complexity and Algorithmic Cost
The proposed post-processing pipeline consists of three sequential computational stages: (1) text-to-constraint parsing via the LLM; (2) visual attribute extraction across the top candidate pool via the VLM; and (3) deterministic constraint scoring and re-ranking. Let N represent the total number of evaluation queries, K denote the candidate pool size selected for re-ranking, and M signify the average number of extracted atomic constraints per query.
The initial text parsing stage requires exactly one LLM call per input query, bounding the text overhead to API invocations. The visual attribute extraction stage demands processing the candidate pool considered during re-ranking, scaling to VLM evaluations in an uncached environment. The final scoring phase evaluates M extracted constraints against K active candidates, scaling to operations, while the sorting mechanism requires per query. In practical deployment settings, M remains small because human queries rarely specify more than a few attribute modifications simultaneously, and K is set to 100 candidates. Consequently, the computational overhead is heavily dominated by the visual VLM extraction phase, whereas the downstream logical scoring and sorting operations require negligible resources compared to standard model inference.
Regarding monetary API costs, GPT-4o-mini is restricted exclusively to text-based constraint extraction. This requires a single invocation per query caption and is completely independent of the gallery or candidate pool size. Based on standard commercial pricing for GPT-4o-mini (
$0.15 per
input tokens and
$0.60 per
output tokens), and assuming an aggressive prompt of 3000 input tokens and 150 output tokens per caption across the 5966 validation samples, the total text extraction cost is:
This confirms that the text extraction phase is highly cost-effective, scaling linearly with the number of unique queries rather than candidate image variations.
The visual processing step deploys Qwen2.5-VL locally through an optimized execution framework, introducing no per-call commercial API charges. Its primary operational footprint is computational rather than monetary, requiring localized visual evaluations in the worst-case uncached scenario. GPT-4 is selectively deployed as a high-fidelity verifier rather than a primary extractor. Specifically, it is invoked only when the local VLM outputs a structural formatting error, an empty field, or an ambiguous prediction. Because GPT-4 is restricted to resolving these low-confidence boundary conditions, its API cost scales dynamically with label uncertainty rather than scaling statically across all candidate images and attributes.
5. Discussion
Rather than altering the internal architecture or loss formulations of existing CIR models, this work introduces a plug-in, post-processing verification framework that seamlessly interfaces with the outputs of diverse baseline retrieval backbones. By explicitly verifying and refining the ranking order of candidate images based on logical compliance, our framework systematically improves downstream retrieval accuracy. The core mechanism hinges on leveraging structured, attribute-based constraints extracted directly from user modification texts and verifying these logical conditions against the top retrieved candidate pool.
A primary operational limitation of the current framework is its inherent inability to recover relevant target images that are completely excluded from the initial baseline candidate pool. If a backbone retrieval model fails to surface the true target image within its top-
N gallery selections, no downstream post-processing or re-ranking mechanism can restore it. Empirically, however, this candidate exclusion issue proves to be minor because modern CIR backbones typically capture the ground-truth targets within their broader retrieval windows. This behavior is quantitatively verified by the pool-size analysis illustrated in
Figure 9. Across all three evaluated retrieval backbones, Recall@10 improves significantly as the candidate pool expands from
to larger sets, confirming that relevant target images are frequently present just beyond the initial top-10 ranks and can be successfully promoted via constraint-aware re-ranking.
Our current empirical evaluations are intentionally focused on the fashion domain. The FashionIQ benchmark [
23] was selected because it represents the standard, widely accepted benchmark specifically engineered for CIR, providing explicitly curated reference-text-target triplets. While evaluating our framework on alternative large-scale fashion repositories such as DeepFashion [
24] would be highly valuable to verify broader domain consistency, DeepFashion is not naturally structured to support composed image retrieval tasks. Because it primarily offers fashion images with isolated attribute and category annotations rather than directional, text-guided retrieval triplets, adapting it for CIR benchmarks would require extensive preprocessing and triplet construction.
Integrating semantic constraints directly into initial multi-modal retrieval representations remains a compelling direction for future work. We plan to investigate constraint-aware training strategies and supervision mechanisms that insert structural verification directly into the embedding space. This line of inquiry can draw valuable inspiration from recent domain-specific neural architectures and artificial immune-based optimization algorithms, which explicitly integrate internal structural constraints or adaptive architectural blocks to regularize representation learning. Such paradigms have successfully addressed challenging visual and sensory tasks, including automated surface defect parsing [
31], neural-immune pipeline anomaly diagnosis [
32], coordinate-driven positioning metrics [
33], and fine-grained blurry image classification [
34].
Furthermore, we aim to explore more advanced multi-modal reasoning architectures to drive the attribute extraction and verification phases. While our current framework successfully deploys state-of-the-art vision-language models to verify whether candidate images satisfy parsed atomic constraints, the rapid evolution of Multimodal Large Language Models (MLLMs) opens up deeper avenues. Incorporating more sophisticated, high-capacity reasoning mechanisms could substantially enhance the system’s contextual understanding of complex modifications, while simultaneously boosting the precision of fine-grained, cross-modal attribute validation.
Finally, extending the proposed framework beyond the fashion domain represents a vital horizon for future research. While the FashionIQ dataset provides a structured environment with well-defined attributes, many real-world applications involve more diverse and complex visual concepts. Applying the proposed approach to other domains could therefore provide valuable insights into its generalizability and reveal new challenges for compositional image retrieval.