InstructSee: Instruction-Aware and Feedback-Driven Multimodal Retrieval with Dynamic Query Generation
Abstract
1. Introduction
- Instruction-aware dynamic query generation: We propose a novel mechanism that adaptively determines the number and semantic composition of Q-Former queries based on the complexity of the input instruction.
- Integration of LLM reasoning: The framework leverages the semantic reasoning capabilities of LLMs to enhance instruction interpretation and context-aware visual–textual alignment.
- Iterative refinement via user feedback: We introduce a user-in-the-loop query refinement process that enables the model to progressively adapt to implicit retrieval intentions through multiple rounds of interaction.
- Lightweight and generalizable design: Our architecture maintains compatibility with BLIP2-based retrieval pipelines, introducing minimal computational overhead while demonstrating strong generalization across diverse retrieval tasks and datasets.
2. Related Work
2.1. Multimodal Retrieval with Transformers
2.2. Instruction-Aware and Feedback-Driven Vision–Language Models
2.3. Semantic Reasoning with Large Language Models
3. Methodology
3.1. Fixed-Length Query Design in BLIP2
- For simple instructions, excess tokens lead to unnecessary representational redundancy.
- For complex instructions, the 32-token cap constrains the model’s capacity to capture fine-grained visual details relevant to user intent.
3.2. Instruction-Adaptive Token Scaling via QScaler
Algorithm 1: Overall Pipeline of BLIP2-IDQ-Former |
Input: Instruction I, Image V, Base query length , Max query length Output: Task-specific output O Step 1: Instruction complexity estimation; Tokenize I and encode via frozen language encoder to obtain contextual embedding ; Predict normalized complexity score using the learnable QScaler module; During training, apply Gaussian noise injection to r for robustness; Step 2: Dynamic query allocation; ; Select first learnable tokens from token bank ; Retrieve first positional encodings from pretrained matrix ; Add positional encodings to tokens to form query sequence ; Step 3: Cross-modal encoding; Extract visual embeddings from frozen image encoder; Fuse with via Q-Former cross-attention; Project multimodal features into LLM input space via projection head; Step 4: Output generation; Feed projected features into frozen LLM; Obtain task-specific output O; return
O; |
3.3. Instruction-Conditioned Cross-Attention
3.4. Training Strategy
- The Q-Former transformer layers responsible for cross-modal feature fusion;
- The learnable query token matrix, which provides the dynamically selected token subsets;
- The QScaler regression head, which predicts the instruction-conditioned complexity scores.
4. System Overview
4.1. Task Modes and Input Interface
- Visual Question Answering (VQA): Given an input image and a user-issued question (e.g., “What is the person doing?”), the system directly generates an answer using its multimodal encoder-decoder, without a ranking process.
- Image-to-Image Retrieval (I2I): Provided with a reference image, the system computes cross-modal similarity scores against a gallery of pre-encoded images to retrieve visually or semantically similar samples.
- Text-to-Image Retrieval (T2I): Given a natural language description, the system encodes the textual query and ranks gallery images according to semantic similarity.
4.2. User Feedback and Interactive Refinement
- Score Refreshing: The system updates similarity scores by integrating the embeddings of the original query and the PS/NS samples, allowing local re-ranking without altering the semantic composition of the query.
- LLM-Guided Semantic Refinement: For more nuanced adjustment, the system employs an LLM interaction module [12], with the following features:
- -
- Clarifying single-choice questions are generated based on the original query and the PS/NS captions, aiming to uncover latent intent dimensions (e.g., action, object attributes, and scene context).
- -
- User-selected answers are incorporated to synthesize a semantically refined query.
- -
- The refined query is embedded and used for gallery re-ranking as follows:
4.3. Prompt Design for Instruction Interpretation
- Question Generation Prompt: This template incorporates the original query along with natural language captions of the user-labeled positive and negative samples. The LLM is instructed to generate two to three clarifying, single-choice questions that contrast salient semantic attributes (e.g., action, background, clothing), distinguishing relevant from irrelevant results [16].
- Query Update Prompt: This template extends the interaction by embedding the original instruction, the generated questions, the user-selected answers, and sample captions. The LLM uses this information to synthesize a refined, semantically enriched query that more accurately reflects the clarified retrieval objective.
5. Results
5.1. Experimental Setup
5.1.1. Datasets
5.1.2. Evaluation Strategy
5.1.3. Baselines
- CLIP [23]: a dual-encoder architecture trained on large-scale image–text pairs, widely adopted for zero-shot image–text retrieval;
- BLIP2 [1]: employs a fixed-length Q-Former with 32 learnable query tokens for vision–language alignment and serves as the primary backbone for our approach;
- DynamicViT [24]: adapts token usage dynamically based on visual input, reducing computational redundancy in vision transformers;
- CrossGET [2]: a cross-modal retrieval model that leverages global–local attention for enhanced vision–language matching.
5.1.4. Implementation Details
5.2. Quantitative Results
5.2.1. Evaluation Metrics
5.2.2. Performance Comparison
5.2.3. Practical Feedback Protocol and Convergence Analysis
- Score Refreshing: The system updates similarity scores by integrating the embeddings of the original query with those of positive/negative samples (PS/NS), enabling local re-ranking while preserving the semantic composition of the query.
- LLM-Guided Semantic Refinement: The system employs an LLM interaction module to generate semantically refined query variants, which are re-encoded and re-ranked to better capture nuanced retrieval intents.
5.3. Ablation Study
5.3.1. Effect of QScaler-Guided Query Length Adaptation
5.3.2. Effect of Training the Q-Former
6. Discussion
6.1. Instruction-Aware Query Allocation Improves Semantic Alignment
6.2. Joint Optimization of Q-Former and QScaler Is Essential
6.3. Token Efficiency and Inference Trade-Offs
6.4. Extensibility Beyond Retrieval
6.5. Ethical Considerations and Robustness
6.6. Limitations and Future Work
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning. PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
- Shi, D.; Tao, C.; Rao, A.; Yang, Z.; Yuan, C.; Wang, J. Crossget: Cross-guided ensemble of tokens for accelerating vision-language transformers. arXiv 2023, arXiv:2305.17455. [Google Scholar]
- Gu, G.; Wu, Z.; He, J.; Song, L.; Wang, Z.; Liang, C. Talksee: Interactive video retrieval engine using large language model. In Proceedings of the International Conference on Multimedia Modeling, Amsterdam, The Netherlands, 29 January–2 February 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 387–393. [Google Scholar]
- Chen, W.; Shi, C.; Ma, C.; Li, W.; Dong, S. DepthBLIP-2: Leveraging Language to Guide BLIP-2 in Understanding Depth Information. In Proceedings of the Asian Conference on Computer Vision, Hanoi, Vietnam, 8–12 December 2024; pp. 2939–2953. [Google Scholar]
- Paul, D.; Parvez, M.R.; Mohammed, N.; Rahman, S. VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval. arXiv 2024, arXiv:2412.01558. [Google Scholar]
- Jiang, Z.; Meng, R.; Yang, X.; Yavuz, S.; Zhou, Y.; Chen, W. Vlm2vec: Training vision-language models for massive multimodal embedding tasks. arXiv 2024, arXiv:2410.05160. [Google Scholar]
- Zhang, L.; Wu, H.; Chen, Q.; Deng, Y.; Siebert, J.; Li, Z.; Han, Y.; Kong, D.; Cao, Z. VLDeformer: Vision–language decomposed transformer for fast cross-modal retrieval. Knowl.-Based Syst. 2022, 252, 109316. [Google Scholar] [CrossRef]
- Dai, W.; Li, J.; Li, D.; Tiong, A.M.H.; Zhao, J.; Wang, W.; Li, B.; Fung, P.; Hoi, S. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv 2023, arXiv:2305.06500. [Google Scholar]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. arXiv 2023, arXiv:2304.08485. [Google Scholar] [PubMed]
- Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
- Liu, Y.; Zhang, Y.; Cai, J.; Jiang, X.; Hu, Y.; Yao, J.; Wang, Y.; Xie, W. Lamra: Large multimodal model as your advanced retrieval assistant. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 4015–4025. [Google Scholar]
- Fu, T.J.; Hu, W.; Du, X.; Wang, W.Y.; Yang, Y.; Gan, Z. Guiding instruction-based image editing via multimodal large language models. arXiv 2023, arXiv:2309.17102. [Google Scholar]
- Zhu, Y.; Zhang, P.; Zhang, C.; Chen, Y.; Xie, B.; Liu, Z.; Wen, J.R.; Dou, Z. INTERS: Unlocking the power of large language models in search with instruction tuning. arXiv 2024, arXiv:2401.06532. [Google Scholar] [CrossRef]
- Weller, O.; Van Durme, B.; Lawrie, D.; Paranjape, A.; Zhang, Y.; Hessel, J. Promptriever: Instruction-trained retrievers can be prompted like language models. arXiv 2024, arXiv:2409.11136. [Google Scholar] [CrossRef]
- Hu, Z.; Wang, C.; Shu, Y.; Paik, H.Y.; Zhu, L. Prompt perturbation in retrieval-augmented generation based large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 1119–1130. [Google Scholar]
- Xiao, Z.; Chen, Y.; Zhang, L.; Yao, J.; Wu, Z.; Yu, X.; Pan, Y.; Zhao, L.; Ma, C.; Liu, X.; et al. Instruction-vit: Multi-modal prompts for instruction learning in vit. arXiv 2023, arXiv:2305.00201. [Google Scholar]
- Bao, Y. Prompt Tuning Empowering Downstream Tasks in Multimodal Federated Learning. In Proceedings of the 2024 11th International Conference on Behavioural and Social Computing (BESC), Harbin, China, 16–18 August 2024; pp. 1–7. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2015, arXiv:1405.0312. [Google Scholar] [CrossRef]
- Siragusa, I.; Contino, S.; La Ciura, M.; Alicata, R.; Pirrone, R. MedPix 2.0: A Comprehensive Multimodal Biomedical Data set for Advanced AI Applications. arXiv 2024, arXiv:2407.02994. [Google Scholar]
- Hu, Y.; Yuan, J.; Wen, C.; Lu, X.; Liu, Y.; Li, X. Rsgpt: A remote sensing vision language model and benchmark. ISPRS J. Photogramm. Remote Sens. 2025, 224, 272–286. [Google Scholar] [CrossRef]
- Karpathy, A.; Fei-Fei, L. Deep Visual-Semantic Alignments for Generating Image Descriptions. arXiv 2015, arXiv:1412.2306. [Google Scholar] [CrossRef]
- Zhou, J.; Zheng, Y.; Chen, W.; Zheng, Q.; Su, H.; Zhang, W.; Meng, R.; Shen, X. Beyond content relevance: Evaluating instruction following in retrieval models. arXiv 2024, arXiv:2410.23841. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
- Rao, Y.; Zhao, W.; Liu, B.; Lu, J.; Zhou, J.; Hsieh, C.J. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification. arXiv 2021, arXiv:2106.02034. [Google Scholar]
Configuration | Value |
---|---|
Vision Encoder Initialization | Pretrained BLIP2 ViT-L/14 |
LLM Initialization | Pretrained OPT-2.7B (frozen) |
Q-Former Initialization | BLIP2 Q-Former weights |
QScaler Initialization | Random |
Learning Rate | |
Optimizer | AdamW |
Epochs | 1 |
GPU Hardware | 1 × NVIDIA RTX 4090 (24 GB) |
Model | COCO | MedPix | RSICAP | Memory (MB) | Throughput (Images/s) |
---|---|---|---|---|---|
CLIP | 0.2572 | 0.2531 | 0.2726 | 2880 | 250.00 |
DynamicViT | 0.7982 | - | 0.5501 | 14,561 | 29.41 |
CrossGET | 0.9387 | 0.3478 | 0.4981 | 16,648 | 33.33 |
BLIP2 | 0.8569 | 0.3473 | 0.5394 | 7213 | 52.63 |
Ours | 0.9411 | 0.3231 | 0.4946 | 15,149 | 29.41 |
Setting | Sim↑ | R@5↑ | MRR↑ | Q-Len | FLOPs↓ (G) | Thr↑ (img/s) |
---|---|---|---|---|---|---|
With QScaler | 0.94 | 0.34 | 0.25 | 17.80 | 258.29 | 21.85 |
Without QScaler | 0.93 | 0.35 | 0.26 | 32.00 | 259.63 | 21.66 |
Setting | Sim↑ | R@5↑ | MRR↑ | QLen | FLOPs↓ (G) | Thr↑ (img/s) |
---|---|---|---|---|---|---|
Trainable Q-Former | 0.94 | 0.34 | 0.25 | 17.80 | 258.29 | 21.85 |
Frozen Q-Former | 0.93 | 0.27 | 0.20 | 16.42 | 258.10 | 21.89 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gu, G.; Xue, Y.; Wu, Z.; Song, L.; Liang, C. InstructSee: Instruction-Aware and Feedback-Driven Multimodal Retrieval with Dynamic Query Generation. Sensors 2025, 25, 5195. https://doi.org/10.3390/s25165195
Gu G, Xue Y, Wu Z, Song L, Liang C. InstructSee: Instruction-Aware and Feedback-Driven Multimodal Retrieval with Dynamic Query Generation. Sensors. 2025; 25(16):5195. https://doi.org/10.3390/s25165195
Chicago/Turabian StyleGu, Guihe, Yuan Xue, Zhengqian Wu, Lin Song, and Chao Liang. 2025. "InstructSee: Instruction-Aware and Feedback-Driven Multimodal Retrieval with Dynamic Query Generation" Sensors 25, no. 16: 5195. https://doi.org/10.3390/s25165195
APA StyleGu, G., Xue, Y., Wu, Z., Song, L., & Liang, C. (2025). InstructSee: Instruction-Aware and Feedback-Driven Multimodal Retrieval with Dynamic Query Generation. Sensors, 25(16), 5195. https://doi.org/10.3390/s25165195