A LLaMA-Based Efficient Fine-Tuning Method for Image Captioning Using Multi-Feature Dynamic Prompts
Abstract
1. Introduction
- it proposes the lightweight dual-stream VLM architecture that integrates both global and local visual features;
- it introduces a dynamic cross-modal alignment mechanism that overcomes the representation limitations of traditional PEFT methods;
- it provides a practical technical framework for efficient multimodal applications in resource-constrained environments.
2. Relation Work
2.1. Vision–Language Models
2.2. Adapter-Based PEFT
2.3. Prompt Learning
3. Method
3.1. Model Architecture
3.2. Image Encoder
3.3. Dynamic Prompt Adapter
3.4. LLaMA-7B Model Fine-Tuning
3.5. Training Method and Loss Function
4. Experiment and Result Analysis
4.1. Experimental Dataset Preparation
4.2. Experimental Environment
4.3. Model Performance Evaluation Method
- (1)
- Evaluation Methods for Model Intrinsic Performance Metrics
- (2)
- Evaluation Methods for Text Descriptions After Fine-Tuning
4.4. Experimental Parameter Configuration
4.5. Ablation Studies and Analysis
4.5.1. Ablation Study on the Encoding Component
4.5.2. Ablation Study on the DP-Adapter
4.5.3. Ablation Study on Partial Parameter Fine-Tuning of LLaMA
4.6. Comparative Experiments
4.6.1. Comparison on the MME Benchmark
4.6.2. Comparative Analysis on the MSCOCO Dataset
4.6.3. Comparison of Training Datasets and Trainable Parameter Scales
5. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
| Symbol | Meaning | Shape/Dimension |
|---|---|---|
| input image batch | ||
| CLIP visual token features | ||
| SAM visual token features | ||
| learnable query tokens for CLIP stream | ||
| learnable query tokens for SAM stream | ||
| concatenated CLIP tokens with queries | ||
| concatenated SAM tokens with queries | ||
| projected CLIP features after Linear | ||
| projected SAM features after Linear | ||
| fused visual features | ||
| embedded prompt representation at layer l | ||
| transformed prompt after MLP | ||
| LoRA low-rank matrix | ||
| LoRA low-rank matrix | ||
| LoRA weight update |
References
- OpenAI. GPT-4V(ision) System Card. 25 September 2023. Available online: https://openai.com/index/gpt-4v-system-card/ (accessed on 7 January 2026).
- Li, J.; Li, D.; Savarese, S.; Hoi, S.C.H. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. In Proceedings of the NIPS ’23: Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Advances in Neural Information Processing Systems 36. Curran Associates Inc.: Red Hook, NY, USA, 2023. [Google Scholar]
- Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A Visual Language Model for Few-Shot Learning. In Proceedings of the Neural Information Processing Systems (NeurIPS) 2022, New Orleans, LA, USA, 28 November–9 December 2022; pp. 23716–23736, Advances in Neural Information Processing Systems 35. [Google Scholar]
- Driess, D.; Xia, F.; Sajjadi, M.S.M.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. PaLM-E: An Embodied Multimodal Language Model. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 8469–8488. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Yu, J.; Wang, Z.; Vasudevan, V.; Yeung, L.; Seyedhosseini, M.; Wu, Y. CoCa: Contrastive Captioners are image-text foundation models. arXiv 2022, arXiv:2205.01917. [Google Scholar]
- Li, J.; Selvaraju, R.; Gotmare, A. Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural Inf. Process. Syst. 2021, 34, 9694–9705. [Google Scholar]
- Wang, Z.; Yu, J.; Yu, A.W.; Dai, Z.; Tsvetkov, Y.; Cao, Y. SimVLM: Simple visual language model pretraining with weak supervision. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
- Adeniji, A.; Xie, A.; Sferrazza, C. Language reward modulation for pretraining reinforcement learning. arXiv 2023, arXiv:2308.12270. [Google Scholar] [CrossRef]
- Mahmoudieh, P.; Pathak, D.; Darrell, T. Zero-shot reward specification via grounded natural language. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 14743–14752. [Google Scholar]
- Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Virtual, 1–6 August 2021; Volume 1, pp. 4582–4597, Long Papers. [Google Scholar]
- Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 2790–2799. [Google Scholar]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
- Hayou, S.; Ghosh, N.; Yu, B. LoRA+: Efficient low rank adaptation of large models. arXiv 2024, arXiv:2402.12354. [Google Scholar]
- Zi, B.; Qi, X.; Wang, L.; Wang, J.; Wong, K.-F.; Zhang, L. Delta-LoRA: Fine-tuning high-rank parameters with the delta of low-rank matrices. arXiv 2023, arXiv:2309.02411. [Google Scholar]
- Jiang, Z.; Xu, F.F.; Araki, J.; Neubig, G. How can we know what language models know? Trans. Assoc. Comput. Linguist. 2020, 8, 423–438. [Google Scholar] [CrossRef]
- Haviv, A.; Berant, J.; Globerson, A. BERTese: Learning to speak to BERT. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Virtual, 19–23 April 2021; pp. 3618–3623. [Google Scholar]
- Yuan, W.; Neubig, G.; Liu, P. BARTScore: Evaluating generated text as text generation. In Proceedings of the NIPS ’21: Proceedings of the 35th International Conference on Neural Information Processing Systems, Virtual, 6–14 December 2021; Advances in Neural Information Processing Systems 34. Curran Associates Inc.: Red Hook, NY, USA; pp. 27263–27277. [Google Scholar]
- Shin, T.; Razeghi, Y.; Logan, R.L., IV; Wallace, E.; Singh, S. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Virtual, 16–20 November 2020; pp. 4222–4235. [Google Scholar]
- Lester, B.; Al-Rfou, R.; Constant, N. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 3045–3059. [Google Scholar]
- Zhong, Z.; Friedman, D.; Chen, D. Factual probing is [MASK]: Learning vs. learning to recall. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Virtual, 6–11 June 2021; pp. 5017–5033. [Google Scholar]
- Qin, G.; Eisner, J. Learning how to ask: Querying LMs with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Virtual, 6–11 June 2021; pp. 5203–5212. [Google Scholar]
- Hambardzumyan, K.; Khachatrian, H.; May, J. WARP: Word-level adversarial reprogramming. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Virtual, 1–6 August 2021; Volume 1, pp. 4921–4933, Long Papers. [Google Scholar]
- Shu, M.; Nie, W.; Huang, D.-A.; Yu, Z.; Goldstein, T.; Anandkumar, A.; Xiao, C. Test-time prompt tuning for zero-shot generalization in vision-language models. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; pp. 14274–14289, Advances in Neural Information Processing Systems 35. [Google Scholar]
- Gao, T.; Fisch, A.; Chen, D. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Virtual, 1–6 August 2021; pp. 3816–3830. [Google Scholar]
- Ju, C.; Han, T.; Zheng, K.; Zhang, Y.; Xie, W. Prompting visual-language models for efficient video understanding. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 105–124. [Google Scholar]
- Shen, S.; Yang, S.; Zhang, T.; Zhai, B.; Gonzalez, J.E.; Keutzer, K.; Darrell, T. Multitask vision-language prompt tuning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2024; pp. 5656–5667. [Google Scholar]
- Jung, D.; Han, D.; Bang, J.; Song, H. Generating instance-level prompts for rehearsal-free continual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 11847–11857. [Google Scholar]
- Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
- Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 16816–16825. [Google Scholar]
- Khattak, M.U.; Rasheed, H.; Maaz, M.; Khan, S.; Khan, F.S. MaPLe: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 19113–19122. [Google Scholar]
- Jia, M.; Tang, L.; Chen, B.-C.; Cardie, C.; Belongie, S.; Hariharan, B.; Lim, S.-N. Visual prompt tuning. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 709–727. [Google Scholar]
- Khattak, M.U.; Wasim, S.T.; Naseer, M.; Khan, S.; Yang, M.-H.; Khan, F.S. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 15190–15200. [Google Scholar]





| Configuration | Version |
|---|---|
| Operating System | Ubuntu 22.04 |
| CPU | AMD EPYC 9754 128-Core |
| GPU | NVIDIA RTX 4090(24 GB) × 8 |
| CUDA | 12.1 |
| Python Version | 3.10 |
| Framework | PyTorch 2.1.0 |
| B@1 | B@2 | B@3 | B@4 | M | R | C | S | |
|---|---|---|---|---|---|---|---|---|
| MFDIT(CLIP) | 80.6 | 64.9 | 49.5 | 38.9 | 30.9 | 59.6 | 123.7 | 22.5 |
| MFDIT(SAM) | 79.9 | 64.5 | 48.7 | 38.4 | 30.5 | 59.3 | 123.1 | 23.0 |
| MFDIT(C+S) | 81.1 | 65.2 | 50.0 | 39.5 | 31.1 | 60.4 | 126.7 | 23.9 |
| B@1 | B@2 | B@3 | B@4 | M | R | C | S | |
|---|---|---|---|---|---|---|---|---|
| MFDIT(base) | 80.1 | 65.0 | 49.6 | 39.1 | 30.9 | 59.2 | 125.4 | 22.0 |
| MFDIT(DP) | 81.1 | 65.2 | 50.0 | 39.5 | 31.1 | 60.4 | 126.7 | 23.9 |
| B@1 | B@2 | B@3 | B@4 | M | R | C | S | |
|---|---|---|---|---|---|---|---|---|
| MFDIT(f) | 78.0 | 62.2 | 47.9 | 38.1 | 29.3 | 58.0 | 124.9 | 21.8 |
| MFDIT(Uf) | 81.1 | 65.2 | 50.0 | 39.5 | 31.1 | 60.4 | 126.7 | 23.9 |
| MFDIT | L-A V2 | BLIVA | GIT2 | BLIP-2 | |
|---|---|---|---|---|---|
| existence | 190.0 | 185.0 | 180.0 | 190.0 | 160.0 |
| count | 138.3 | 133.3 | 138.3 | 118.3 | 135.0 |
| position | 63.3 | 56.7 | 81.7 | 96.7 | 73.3 |
| color | 118.3 | 118.3 | 180.0 | 158.3 | 148.3 |
| posters | 150.7 | 148.0 | 155.1 | 112.6 | 141.8 |
| celebrity | 135.3 | 134.7 | 140.9 | 145.9 | 105.6 |
| scene | 155.5 | 156.3 | 151.5 | 158.5 | 145.3 |
| landmark | 166.0 | 167.8 | 89.5 | 140.5 | 138.0 |
| artwork | 123.5 | 123.5 | 133.3 | 146.3 | 136.5 |
| OCR | 110.0 | 102.5 | 87.5 | 65.0 | 110.0 |
| total | 1350.9 | 1326.1 | 1337.8 | 1332.1 | 1293.8 |
| MFDIT | L-A V2 | BLIVA | GIT2 | BLIP-2 | |
|---|---|---|---|---|---|
| commonsense_reasoning | 110.0 | 106.4 | 136.4 | 99.2 | 110.0 |
| numerical_calculation | 50.0 | 47.5 | 57.5 | 50.0 | 40.0 |
| text_translation | 112.5 | 112.5 | 77.5 | 67.5 | 65.0 |
| code_reasoning | 87.5 | 90.0 | 60.0 | 45.0 | 75.0 |
| total | 360.0 | 356.4 | 331.4 | 261.7 | 290.0 |
![]() | GT: 1. A table with wine glasses and bottles of wine 2. There are bottles of wine and wine glasses arranged on a table Baseline: Wine Tasting at the Wine Cellar Ours: Bottles of wine and wine glasses at the Wine Cellar |
![]() | GT: 1. Mushrooms are used in many variety of dishes 2. A chicken and rice dish being cooked. Baseline: Collage of images showing how to make a chicken and vegetable stew Ours: A chicken and rice dish is being cooked in a pan along with vegetables and mushrooms. |
![]() | GT: 1. A wooden table filled with wine glasses and a bowl of salad. 2. A table has three wine glasses, several bottles of wine, and a bowl of salad. Baseline: Wine and food pairing Ours: A table with wine glasses, bottles, and salad. |
![]() | GT: 1. A person is eating at a table with plates, a fork, knife, spoon, cup, and cell phone on it. 2. Overhead view of a table with a log and food on it. Baseline: A woman eats a sandwich while sitting on a couch with a coffee and a book. Ours: A woman enjoys a sandwich with coffee and a cell phone beside her. |
![]() | GT: 1. A plate of beans and eggs with a knife and fork on it. 2. A plate of beans, eggs, and toast is on the table. Baseline: Fried Eggs on Toast Ours: There are beans and eggs with a knife and fork on a plate. |
| Model | LID | ITPD | VID | FTP | MME Eva |
|---|---|---|---|---|---|
| MiniGPT-4 | 70 K | 134 M | 5 K | 13 B | 725.95 |
| LLaVA | 70 K | 595 K | 158 K | 13 B | 1826.67 |
| L-A V2 | 52 K | 567 K | 0 K | 14 M | 1682.5 |
| MFDIT | 52 K | 567 K | 0 K | 20 M | 1710.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yin, Y.; Cao, H.; Zhang, C.; Jin, F.; Liu, X.; Lin, J. A LLaMA-Based Efficient Fine-Tuning Method for Image Captioning Using Multi-Feature Dynamic Prompts. Appl. Sci. 2026, 16, 1857. https://doi.org/10.3390/app16041857
Yin Y, Cao H, Zhang C, Jin F, Liu X, Lin J. A LLaMA-Based Efficient Fine-Tuning Method for Image Captioning Using Multi-Feature Dynamic Prompts. Applied Sciences. 2026; 16(4):1857. https://doi.org/10.3390/app16041857
Chicago/Turabian StyleYin, Yongyang, Hengyu Cao, Chunsheng Zhang, Faxun Jin, Xin Liu, and Jun Lin. 2026. "A LLaMA-Based Efficient Fine-Tuning Method for Image Captioning Using Multi-Feature Dynamic Prompts" Applied Sciences 16, no. 4: 1857. https://doi.org/10.3390/app16041857
APA StyleYin, Y., Cao, H., Zhang, C., Jin, F., Liu, X., & Lin, J. (2026). A LLaMA-Based Efficient Fine-Tuning Method for Image Captioning Using Multi-Feature Dynamic Prompts. Applied Sciences, 16(4), 1857. https://doi.org/10.3390/app16041857






