Next Article in Journal
Modulating Peroxidase-like Activity of Fe3O4@Pt@poly-LDOPA and Its Application as Multifunctional Magnetic Probes Towards SARS-CoV-2 Detection
Previous Article in Journal
Oxidation Stability of SiO2 and TiO2 Nanofluids for High Voltage Insulation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A LLaMA-Based Efficient Fine-Tuning Method for Image Captioning Using Multi-Feature Dynamic Prompts

1
School of Electronic Science and Engineering, Nanjing University, Nanjing 210023, China
2
School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221008, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(4), 1857; https://doi.org/10.3390/app16041857
Submission received: 8 January 2026 / Revised: 3 February 2026 / Accepted: 9 February 2026 / Published: 12 February 2026
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

To address the trade-off between parameter scale and generation quality in Vision-Language Models (VLMs), this study proposes a Multi-Feature Dynamic Instruction Tuning (MFDIT) image captioning model based on LLaMA. By integrating CLIP-based global features with SAM-derived local features, the model constructs a multi-level visual representation. Additionally, a Dynamic Prompt Adapter is designed to enable cross-modal semantic alignment with adaptive flexibility. Combined with a Low-Rank Adaptation (LoRA) fine-tuning strategy, the proposed method enhances the model’s capability in describing diverse images while training only 20 million parameters, accounting for merely 0.05% of the total parameter volume. Experimental results demonstrate that the model achieves a CIDEr score of 126.7 on the MSCOCO dataset, surpassing traditional adapter-based approaches by 3.0 points. Moreover, in the MME Benchmark evaluation, the proposed model outperforms the mainstream LLaMA-Adapter V2 by 7.3% and 3.8% in OCR and object counting tasks, respectively. Ablation studies further validate the synergistic effects of multi-feature fusion and dynamic instruction optimization. This research provides an efficient solution for parameter-efficient multimodal model training and potential deployment in resource-constrained environments.

1. Introduction

In recent years, Vision-Language Models (VLMs), which integrate visual encoders with Large Language Models (LLMs), have demonstrated remarkable performance in multimodal understanding and generation tasks. However, current approaches face two major limitations. First, mainstream VLMs such as GPT-4V and BLIP-2 rely on large-scale parameter models, typically at the tens-of-billions level, to achieve effective multimodal alignment (as illustrated in Figure 1), resulting in extremely high training costs and limited deployability. Second, conventional single-encoder architectures, such as CLIP, suffer from restricted feature representation, making it challenging to simultaneously capture global scene semantics and local object relationships in a coherent manner. Therefore, developing parameter-efficient and feature-rich lightweight VLMs has emerged as a critical challenge for advancing the practical application of multimodal technologies.
Although current Parameter-Efficient Fine-Tuning (PEFT) techniques can alleviate the high training cost of VLMs, existing methods still exhibit significant limitations. Specifically, fixed prompts result in rigid cross-modal attention, which fails to adapt to the dynamic semantic demands of diverse images. Additionally, the absence of hierarchical structure in single-modality feature inputs hampers the model’s ability to achieve fine-grained understanding of complex scenes.
Our work is positioned at the intersection of dual-encoder visual representation and parameter-efficient adaptation for generative captioning. Different from prior dual-encoder VLMs that mainly focus on designing visual encoders or fusion for general vision–language understanding, we combine complementary global and instance-level visual cues specifically for LLaMA-based caption generation. Compared with prompt-learning methods that primarily optimize prompts in CLIP-style discriminative settings, our method targets a generative decoder and addresses cross-modal alignment via a lightweight adapter with gated modulation. Compared with adapter-based PEFT approaches that mostly study efficient adaptation of language models, we highlight how PEFT can be coupled with multi-feature visual conditioning for captioning without updating the frozen visual encoders.
To address these challenges, this paper proposes a Multi-Feature Dynamic Instruction Tuning (MFDIT) image captioning method, which introduces three core innovations:
Multi-Scale Feature Fusion Architecture: By integrating global semantic features extracted from CLIP (ViT-L/14) with local mask features provided by SAM, the model employs a gated attention mechanism to achieve collaborative representation of both scene-level and object-level information. Experimental results show that this dual-encoder fusion framework improves the CIDEr score by 3.0 points, outperforming single-encoder baselines.
Dynamic Prompt Adapter (DP-Adapter): A learnable prompt is embedded into the final Transformer layers of LLaMA, where a zero-initialized gating factor is introduced to dynamically adjust cross-modal attention weights. This design effectively addresses the semantic misalignment issue caused by static prompts in conventional approaches.
Hierarchical Parameter Tuning Strategy: Leveraging Low-Rank Adaptation (LoRA) to selectively fine-tune specific linear layers of LLaMA, combined with partial layer normalization parameter updates, the model enables efficient multimodal knowledge transfer while training only 20 million parameters, which accounts for merely 0.05% of the total model size.
In this study, we validate the proposed method on the MSCOCO benchmark and further evaluate its multimodal understanding ability on the MME benchmark. Cross-domain captioning evaluation on specialized datasets is beyond the scope of this work and is left for future study.
Experimental evaluations on the MSCOCO dataset demonstrate that MFDIT achieves BLEU-4 and CIDEr scores of 39.5 and 126.7, respectively, surpassing comparison models such as LLaMA-Adapter V2. Further assessment on the MME Benchmark confirms the model’s fine-grained reasoning advantages, with performance improvements of 7.3% in positional recognition and 3.8% in object counting tasks. Ablation studies and attention visualization analyses verify that the synergy between multi-feature fusion and dynamic instruction optimization is the key factor driving the performance gains.
The main contributions of this study are as follows:
  • it proposes the lightweight dual-stream VLM architecture that integrates both global and local visual features;
  • it introduces a dynamic cross-modal alignment mechanism that overcomes the representation limitations of traditional PEFT methods;
  • it provides a practical technical framework for efficient multimodal applications in resource-constrained environments.

2. Relation Work

2.1. Vision–Language Models

Vision-Language Models (VLMs), as pre-trained multimodal frameworks, are designed to embed images and text into a unified semantic space through large-scale training on paired image-text datasets, enabling effective cross-modal information integration. A milestone in this field is CLIP [6], which adopts a contrastive learning strategy to jointly train image and text encoders on approximately 400 million image-text pairs, ensuring that matched images and texts are positioned closely in the embedding space. This characteristic endows CLIP with powerful zero-shot transfer capabilities, allowing it to achieve competitive or even superior performance compared to traditional supervised models such as ResNet50 [7] on various visual tasks without requiring additional task-specific training.
Building upon CLIP, subsequent studies have introduced several improved models. For instance, CoCa [8] integrates image captioning capabilities by jointly optimizing contrastive loss and caption generation loss, aiming to obtain more fine-grained embeddings. ALBEF [9] introduces an image-text contrastive (ITC) loss prior to cross-modal fusion to better align unimodal representations, thereby enhancing performance on complex tasks. It is noteworthy that ALBEF adopts a pure encoder architecture, making it primarily suitable for vision-language understanding tasks. Models such as SimVLM [10] and BLIP [11], based on encoder–decoder architectures, further extend generative capabilities, particularly in image captioning.
Despite the remarkable ability of VLMs to establish strong visual-textual associations, existing studies have pointed out that these models often focus predominantly on recognizing “what” is present in an image while paying insufficient attention to the spatial relationships among objects [12,13]. Consequently, fine-tuning is generally required when adapting VLMs to specific tasks. Given CLIP’s open-source accessibility, extensively validated effectiveness, and universality as a foundational model, it is selected as the core starting point for the subsequent research in this study.

2.2. Adapter-Based PEFT

To address the substantial resource consumption associated with fine-tuning large-scale pre-trained models, PEFT techniques have emerged as a promising solution [14,15]. The core principle of PEFT lies in freezing the majority of the pre-trained model’s parameters while introducing a small number of additional trainable parameters—commonly referred to as adapters—to perform task-specific adaptation. This strategy eliminates the need to update the entire, often enormous, parameter set, thereby reducing the computational and memory requirements during fine-tuning.
Among various PEFT methods, LoRA [16,17,18] has emerged as one of the most widely validated and effective techniques. The core idea of LoRA is to inject trainable low-rank matrices into critical layers of models such as Transformers, particularly within attention modules. Consequently, adapter-based PEFT approaches have become the mainstream solution for efficiently transferring pre-trained model knowledge in resource-constrained scenarios.

2.3. Prompt Learning

Prompt Learning originated in the field of natural language processing, with its core principle centered on guiding pretrained large language models to adapt to downstream tasks by incorporating specific prefixes into input text. This approach enables models to achieve superior performance in few-shot and zero-shot scenarios. Subsequent research has focused on prompt optimization, exploring techniques such as text mining [19], prompt rewriting [20,21], and gradient-based optimization methods [14,22,23,24,25,26]. The paradigm has since been successfully extended to VLMs. CLIP pioneered the use of predefined hard prompts for zero-shot evaluation, while later approaches [27,28] introduced globally learnable soft prompts. Further advancements include the development of soft prompt sets trained per category or task [29,30] or the generation of input-dependent dynamic prompts [31]. Depending on the target of prompt application, methods can be categorized into those acting on the text encoder (e.g., CoOp [32], CoCoOp [33]) or the image encoder (e.g., MaPLe [34], VPT [35], PromptSRC [36]). Collectively, these developments underscore the immense potential of prompt learning in facilitating efficient and robust transfer learning.

3. Method

3.1. Model Architecture

The MFDIT image captioning model based on LLaMA proposed in this section primarily consists of three components: an image encoder, a dynamic instruction debugger, and a fine-tuned LLaMA-7B model, as illustrated in Figure 2.
The MFDIT model employs CLIP and SAM, with frozen pretrained parameters, as two image feature extractors for the image encoder. Image features are processed through a trainable dynamic query mechanism and linear layer for dimensional mapping, where the dynamic instruction debugger is trainable. For the LLaMA-7B model, this section freezes most parameters, designating only a subset as trainable to develop an image short-text description model. To explore diversity in image descriptions, the LoRA method is utilized to fine-tune the LLaMA-7B model, aligning it with image features and enhancing its capability for long-text description generation.

3.2. Image Encoder

This study employs a multi-feature encoding approach to extract complementary visual cues for image captioning. We combine a pretrained CLIP image feature extractor to represent global scene information with a pretrained SAM mask feature extractor to provide local instance-level cues. These local cues help describe fine-grained details and support more accurate grounding of objects and regions. As illustrated in Figure 2, the image encoder integrates CLIP and SAM to leverage their complementary strengths in feature extraction. Figure 3 demonstrates the distinct attention patterns exhibited by CLIP and SAM when processing the same image.
Given n images, global image features F c l i p = { g 1 , g 2 , g 3 , , g n } are extracted using the CLIP (ViT-B/16) image feature extractor. Let the image batch be X R n × C × H × W , where C denotes the number of image channels and H and W represent the image height and width, respectively. The specific formulation for CLIP feature extraction is as follows:
F c l i p = C L I P ( X ) R N × F × D 1
F denotes the number of tokens produced by CLIP, and D 1 is the feature dimension. To align image and text features for subsequent computations in the Adapter component, this section is inspired by common multimodal information interaction methods: for each image, the same query tokens Q R N × q × D are added as trainable image dynamic prompts. The queries are then concatenated with the CLIP features as follows:
F c l i p = C o n c a t ( Q , F c l i p ) R N × ( F + q ) × D 1
Here, q is a hyperparameter, which is set to be consistent with the prompt length used in the subsequent Adapter. Since this operation results in a feature dimension that does not meet the requirements of the downstream method, the dimension needs to be mapped to the required dimension D 0 . In this section, a simple yet effective trainable linear layer is used for this transformation, formulated as follows:
Z c l i p = L i n e a r ( F c l i p ) R N × ( F + q ) × D 0
In a similar manner, local image features F c l i p = { l 1 , l 2 , l 3 , , l n } are extracted using the SAM-B (ViT-B/16) image feature extractor. In the pre-trained SAM implementation, multiple annotation points are specified to generate high-probability tokens and their corresponding predicted category mask features. The subsequent operations are the same as those used for CLIP, and are defined by the following equation:
F S A M = S A M ( X ) R N × F × D 2
F S A M = C o n c a t ( Q , F S A M ) R N × ( F + q ) × D 2
Z S A M = L i n e a r ( F S A M ) R N × ( F + q ) × D 0
The linear layer L i n e a r ( · ) remains trainable.
To facilitate understanding of the proposed framework, Figure 2, Figure 3 and Figure 4 provide an overview of the overall processing pipeline. Given an input image, feature extraction is first performed using two frozen visual encoders in parallel. The CLIP image encoder is responsible for capturing global, scene-level semantic information, whereas the SAM encoder focuses on instance-level local features, highlighting object boundaries and spatial structures.
The resulting global and local visual representations are then projected into a unified embedding space using lightweight trainable linear layers. These features are subsequently fused to form a set of visual prompt tokens, allowing global contextual information and local object cues to be jointly preserved in a compact representation.
The visual prompt tokens are integrated into the LLaMA-based language model through the proposed Dynamic Prompt Adapter. By introducing a zero-initialized gating mechanism, the DP-Adapter enables a gradual modulation of attention in the higher Transformer layers, which helps stabilize training while supporting effective multimodal adaptation.
Based on the adapted representations, the LLaMA model finally generates image captions conditioned on both the visual prompts and the textual context, achieving coherent description generation with only a small number of trainable parameters.
In addition, we propose a method that fuses global feature queries and local feature queries to effectively generate the image feature query Q Z for subsequent image-text alignment tasks. In this section, a direct addition fusion strategy is adopted to simplify computation while maintaining consistent feature dimensions, thereby providing a convenient framework for subsequent interactive learning with the dynamic prompts.
Specifically, Z c l i p is used as the global feature query to capture global scene-level information, while Z S A M is used as the local feature query to provide instance-level cues for region-aware feature fusion. The image feature query Q Z is obtained by directly summing the global and local features, that is:
Q Z = Z c l i p + Z S A M
We adopt additive fusion because it is parameter-free and dimension-preserving, which avoids introducing extra trainable parameters in the fusion stage and keeps the trainable budget concentrated in the PEFT modules. This simple fusion also provides stable integration of complementary global CLIP cues and instance-level SAM cues within the same embedding space.
It is worth noting that each feature, after transformation through a linear layer, must be normalized by a normalization layer. This step is crucial for training stability and convergence speed. Normalization keeps the feature distributions consistent, effectively preventing gradient explosion or vanishing gradients. It enables effective adjustment of feature mean and variance, thereby accelerating the training process and improving convergence.

3.3. Dynamic Prompt Adapter

Traditional instruction-tuning models are highly sensitive to the quality of input prompts; even with identical model parameters, the final generation results can vary depending solely on the input prompt. Inspired by conventional PEFT methods, this section introduces a DP-Adapter designed for multimodal semantic alignment scenarios. Unlike fixed prompts, the proposed DP-Adapter dynamically adjusts prompts based on the training data, allowing the model to learn optimal prompts. The specific method of the DP-Adapter is illustrated in Figure 4.
Given a pre-trained LLaMA with N Transformer layers, a set of learnable dynamic prompts { P l } l = 1 L , where P l ϵ R p and p denotes the dynamic prompt length, is first transformed through a word embedding layer to obtain a learnable dynamic word embedding representation. This process is formulated as:
P l e = E m b e d d i n g ( P l )       P l e ϵ R p × C
Subsequently, the prompts are passed through an MLP to perform linear and non-linear transformations, resulting in P l ϵ R p × C , where the MLP is learnable. This process is defined as follows:
P l = M L P ( P l e )
Here, LLL denotes the final Transformer layer in LLaMA ( L N ), and C is the feature dimension of the Transformer. The MLP aims to enhance the alignment capability of the prompts between visual and textual modalities. Placing the DP-Adapter exclusively in the final L layers allows for better adjustment of language representations with higher-level semantics.
This placement keeps earlier layers unchanged, helping preserve the pretrained language modeling capability. It also concentrates adaptation in the high-level layers that more directly affect how visual semantics are integrated into generation. Moreover, limiting the adapter to the final layers keeps the added trainable footprint small compared with inserting adapters throughout the full stack.
The resulting P l is then integrated with the frozen LLaMA model for collaborative adjustment. The specific implementation is as follows: taking the l -th layer as an example, the prompt instruction is denoted by T l , and t l represents the concatenated current generated tokens, indicating the response generated after fusion with image features. The attention score is computed as:
Q l = L i n e a r q ( t l )
K l = L i n e a r K ( [ P l ; T l ; t l ] )
V l = L i n e a r V ( [ P l ; T l ; t l ] )
S l = Q l K l T C R 1 × ( p + M + 1 )
Here, M + 1 denotes the number of generated text tokens.
Since the early training stages may introduce noise that affects the stability of the model’s output, this study adopts the zero-initialized attention mechanism proposed by Gao et al. to mitigate potential disturbances when the model adapts to dynamic prompts. In this approach, the attention score S l is decomposed into two components:
S l = [ S l p ; S l M + 1 ] T
S l p represents the attention score for the dynamic prompt, and S l M + 1 denotes the attention score for the generated description. To control the perturbations introduced during the early training stages, a zero-initialized gating factor is introduced:
S l g = [ s o f t m a x ( S l K ) · tanh ( g l ) ; s o f t m a x ( S l M + 1 ) ] T
where t a n h ( · ) constrains g l within the range [ 1,1 ] . An independent gating factor g l is applied in each Transformer layer to ensure that its influence on the model’s pre-existing knowledge gradually increases layer by layer. The output after computing the attention-weighted sum is:
t l o = L i n e a r o ( S l g V l )
The DP-Adapter inserts dynamic prompts within the Transformer layers and introduces zero-initialized gating factors to control the influence of the prompts. This mechanism gradually adjusts the instruction information for new tasks during training, thereby enhancing the stability of fine-tuning and improving the model’s generalization capability.

3.4. LLaMA-7B Model Fine-Tuning

To align the multimodal information between images and text, using the DP-Adapter alone is not sufficient to fully fuse multimodal features; therefore, it is necessary to effectively fine-tune the LLaMA-7B model. In this section, the trainable parameters of LLaMA-7B are selectively unlocked to enhance the performance of the LLaMA-based Multi-Feature Dynamic Instruction Tuning image captioning model. This focuses on bias adjustments in linear layers, such as normalization, layer biases, and scaling factors. By increasing the model’s capacity for adjustment, knowledge can propagate throughout the entire LLM. Notably, these trainable parameters account for only about 0.04% of the total model, ensuring that the proposed LLaMA-based Multi-Feature Dynamic Instruction Tuning method remains a PEFT approach.
Each normalization layer in every Transformer block is unfrozen. Additionally, for each linear layer in the Transformer, a bias and a scaling factor are introduced as two learnable parameters. Let the input to a linear layer and its pre-trained weight be denoted by b and w, respectively. The corresponding formulation can be expressed as:
y = w ( W · x + b )
Here, W denotes the original weight of the linear layer in LLaMA, with the scaling factor w initialized to 1 and the bias b initialized to 0. Similarly to zero-initialized attention mechanisms, initializing the bias and scaling factor with 0 and 1, respectively, helps stabilize the training process in its early stages.
When generating long-text image descriptions, it is necessary to strengthen the alignment between the large language model and image features. To this end, this study also introduces LoRA for parameter-efficient fine-tuning. LoRA adjusts weights by adding low-rank matrices to the linear layers in LLaMA’s Transformer blocks, and the specific formulation is as follows:
W = W 0 + W
W = A · B
This is a common and effective fine-tuning method in LLMs, which reduces the number of trainable parameters by adding trainable low-rank matrices A and B while keeping most of the LLaMA weights frozen.

3.5. Training Method and Loss Function

In this experiment, when fine-tuning the model to generate short-text descriptions, the extracted visual features are incorporated into the dynamic adaptive prompts as modality inputs, using only the MSCOCO image captioning dataset. The core objective of this fine-tuning approach is to equip the LLaMA-based MFDIT image captioning model with the ability to generate concise textual descriptions. The loss function employed is the cross-entropy loss, which is defined as follows:
L = y i log ( y i ^ )
Here, y i denotes the ground-truth distribution of the target t o k e n , and y i ^ represents the probability distribution predicted by the model. This loss function measures the alignment between the model’s predicted text and the reference text, encouraging the generation of more accurate and natural textual descriptions.
When training a long-text image captioning model, this study adopts a joint training and visual knowledge integration strategy to further enhance multi-modal information fusion and reduce the interference of image understanding tasks with long-text generation capabilities. The model is trained using both image-text pair data and instruction-following data, while LoRA-based fine-tuning is applied only to the large language model during the final training epochs. Visual tokens are injected into the lower Transformer layers of LLaMA so that visual information is encoded at higher levels, ensuring that the model can fully utilize image features during language generation without substantially affecting the underlying language modeling capacity. Compared to conventional approaches that fuse visual features at lower Transformer layers, this strategy more effectively preserves the language generation ability while strengthening the influence of visual information, thereby improving performance in complex image captioning and reasoning tasks.
The combination of these three techniques enables the LLaMA-based multi-feature dynamic instruction-tuned image captioning model to achieve efficient parameter utilization and strong multimodal reasoning capability. Unlike traditional large-scale pre-training methods, this approach does not rely on massive amounts of high-quality multimodal instruction data, yet it still achieves efficient and accurate image captioning with excellent multimodal instruction-following performance, providing a novel solution for cost-effective and versatile vision-language model development. The abbreviations and tensor notations are provided in Appendix A.

4. Experiment and Result Analysis

4.1. Experimental Dataset Preparation

Fine-tuning the MFDIT image captioning model based on LLaMA aims to achieve two objectives: image-text alignment and verbal instruction following. To simultaneously optimize these dual objectives, this experiment adopts the parameter decoupling joint training method proposed by Gao et al. This approach not only mitigates the issue of scale disparities between image-text alignment and verbal instruction following datasets but also effectively reduces training computational costs. For the image-text dataset, this experiment utilizes the MSCOCO Caption dataset, which comprises 567 K image-text pairs.

4.2. Experimental Environment

Regarding computational efficiency, the training of the MFDIT model requires approximately 1.5 h per epoch on a single NVIDIA RTX 4090 GPU. The total training process converges within 10 epochs. For inference, the model achieves an average latency of 45 ms per image caption, demonstrating its suitability for real-time applications. This experiment was conducted on a server equipped with the Ubuntu 22.04 operating system, utilizing a GeForce GTX 4090 GPU and Python 3.10. Since CLIP and SAM are kept frozen in our PEFT setting, the training remains parameter-efficient, while enabling the SAM branch increases inference computation. The detailed experimental environment configuration is presented in Table 1.

4.3. Model Performance Evaluation Method

To evaluate the performance of the MFDIT model, this experiment employs two categories of assessment methods: (1) evaluation methods targeting the model’s intrinsic performance metrics, and (2) evaluation methods for text descriptions after fine-tuning.
(1)
Evaluation Methods for Model Intrinsic Performance Metrics
To comprehensively assess the capabilities of multimodal large models across various visual and language tasks, numerous benchmark evaluation methods have emerged in recent years, including MME Benchmark, VQAv2, OK-VQA, MathVista, and MMBench. This experiment adopts the MME Benchmark for model performance evaluation.
(2)
Evaluation Methods for Text Descriptions After Fine-Tuning
This experiment involves fine-tuning the model on the MSCOCO Caption dataset, utilizing mainstream evaluation metrics such as BLEU, ROUGE, SPICE, METEOR, and CIDEr.

4.4. Experimental Parameter Configuration

The experimental settings employed in this study are as follows: For image feature encoding in the proposed MFDIT framework, global features are extracted using CLIP (ViT-L/14), while local features are extracted using SAM-base. The length of the Dynamic Prompt (DP) is set to 20, and the hidden layer dimension of the MLP is configured to 256. Visual tokens and dynamic prompts are inserted into the final 30 layers of the network. The training phase consists of 5 epochs, with a warm-up period of 2 epochs, and an additional LoRA fine-tuning phase spanning 10 epochs. The LLaMA model utilized in the experiments contains 7 billion parameters and adopts a 32-layer Transformer architecture.

4.5. Ablation Studies and Analysis

To comprehensively verify the effectiveness of the proposed LLaMA-based MFDIT approach for image description generation, extensive ablation studies are conducted on the MSCOCO dataset. The MFDIT framework is fine-tuned on MSCOCO and evaluated on the 5K test set to systematically analyze the impact of each component on short text generation performance.
The stability of the DP-Adapter stems from its inherent hierarchical and adaptive design. Specifically, the dynamic prompts transition from focusing on spatial-structural grounding (via SAM features) in the initial layers to semantic-contextual synthesis (via CLIP features) in the deeper layers of the LLaMA backbone. Furthermore, the adapter exhibits input-dependent modulation: it generates concentrated feature weights for simple scenes to maintain precision while adopting a more distributed representation for complex, multi-object images to ensure comprehensive coverage. This mechanism allows the model to achieve a robust balance between local detail and global coherence without additional parameter overhead.

4.5.1. Ablation Study on the Encoding Component

The interaction between the DP-Adapter and LoRA is crucial for the model’s performance. While the DP-Adapter is responsible for the dynamic extraction and alignment of multi-scale visual features, LoRA facilitates the efficient task-specific adaptation of the LLaMA backbone. The synergy between these two components allows the MFDIT model to bridge the modality gap effectively: the adapter provides flexible visual prompts that prevent information loss, while LoRA ensures these prompts are accurately interpreted by the frozen language model for high-quality caption generation.
As shown in Table 2, the ablation experiments focus on the encoding stage. Specifically, MFDIT(CLIP) denotes the configuration where global image features are extracted using the CLIP image encoder. MFDIT(SAM) refers to using the SAM image encoder to extract local image features. MFDIT(C+S) indicates the combined use of CLIP for global feature extraction and SAM for local feature extraction. All configurations employ the DP-Adapter mechanism, partially unfreeze the LLM parameters at the decoder side, and utilize the LLaMA model without LoRA fine-tuning. The results are obtained after training, and the best-performing results are highlighted in bold.
The results presented in Table 2 indicate that when the model employs CLIP as the image encoder, it demonstrates superior capability in capturing global image features, outperforming the model that uses SAM as the encoder across most evaluation metrics. However, the SAM encoder also exhibits strong image feature extraction abilities, with a particular focus on capturing local features, such as semantic relationships between objects, showing heightened sensitivity to complex scenes. Consequently, the SAM-based model achieves better performance on the SPICE metric. To fully leverage the advantages of both encoders, this study further integrates the feature extraction capabilities of CLIP and SAM, while introducing learnable query tokens into the visual features. This combination leads to improved overall model performance, achieving a 3-point increase in the CIDEr score.

4.5.2. Ablation Study on the DP-Adapter

As shown in Table 3, to validate the effectiveness of the proposed DP-Adapter mechanism, ablation experiments are conducted specifically on the DP-Adapter component of the model.
In all configurations, the encoding component employs the collaborative use of both CLIP and SAM. The LLaMA backbone parameters are kept frozen, and LoRA is not applied, ensuring that the observed performance gain is solely due to the DP-Adapter. MFDIT(base) refers to the model that does not utilize the DP-Adapter and instead relies solely on the Adapter proposed in the baseline for multimodal alignment. MFDIT(DP) indicates the model that adopts the DP-Adapter method for multimodal alignment. The results in Table 3 demonstrate that the proposed DP-Adapter consistently outperforms the baseline model across all evaluation metrics. These experimental findings confirm that enhancing the feature representation capability of the word embedding layer by introducing a lightweight MLP with a small number of trainable parameters is an effective strategy. This approach improves model performance, achieving comprehensive improvements in BLEU scores and a 1.3-point increase in the CIDEr score, thereby supporting the effectiveness of the DP-Adapter in multimodal feature extraction and alignment.

4.5.3. Ablation Study on Partial Parameter Fine-Tuning of LLaMA

As shown in Table 4, to verify the effectiveness of fine-tuning the specific LLaMA parameters mentioned in this study, additional ablation experiments are conducted. In all configurations, the encoding component integrates both CLIP and SAM, and the DP-Adapter is employed to align the image and text modalities. To isolate the effect of partial parameter fine-tuning, the baseline MFDIT(f) keeps the specified LLaMA parameters frozen, while MFDIT(Uf) selectively unfreezes them for task-specific adaptation. MFDIT(f) denotes the model configuration where the specified LLaMA parameters remain frozen, while MFDIT(Uf) represents the configuration where the specified LLaMA parameters are unfrozen and fine-tuned. The final results are obtained after training, with the best-performing results highlighted in bold.
The experimental results in Table 4 indicate that the proposed method has a positive impact on improving model performance. The strategy of unfreezing specific parameters of the large language model leads to a significant performance gain, with the CIDEr score increasing by 1.8 points compared to the base configuration. By introducing additional trainable parameters, the multimodal alignment capability of the MFDIT model is enhanced, allowing the LLaMA model to better fit the data distribution of the dataset. These results further validate the effectiveness of the proposed method.

4.6. Comparative Experiments

To verify the performance of the proposed MFDIT model, this section presents extensive comparative experiments conducted on both the MME Benchmark and the MSCOCO dataset. The comparison includes several mainstream and representative methods, such as LLaMA-Adapter V2(L-A V2), BLIVA, GIT2, and BLIP-2, aiming to comprehensively evaluate the model’s performance across various scenarios. The experimental results are presented in the relevant tables and figures within this section.

4.6.1. Comparison on the MME Benchmark

In this experiment, the image long-text description model is trained using the LLaMA-Adapter V2 training paradigm, with further fine-tuning based on the LoRA approach. Specifically, this study improves upon the original pre-trained LLaMA-Adapter V2 model by modifying the multi-feature input mechanism of CLIP and SAM and integrating the DP-Adapter to enhance the model’s performance in multimodal tasks. This method retains the strong instruction-following ability of the original model while optimizing the feature fusion strategy, enabling the model to more accurately comprehend and generate responses under diverse input conditions. To validate the effectiveness of the proposed model, a comprehensive evaluation is conducted on the MME Benchmark dataset, and the results are compared with those of the current state-of-the-art methods. The experimental results are shown in Table 5 and Table 6.
The analysis of the experimental results in Table 5 reveals that the MFDIT model outperforms the baseline across most evaluation metrics, with an overall score that surpasses several mainstream models. Notably, the MFDIT model demonstrates a significant advantage in the OCR and Count metrics. This advantage is primarily attributed to the incorporation of the SAM model as the image feature extractor during the feature extraction stage. The SAM model possesses strong object segmentation and localization capabilities, enabling it to accurately capture the spatial positions and quantitative relationships of objects within an image. Consequently, this greatly enhances the model’s performance on tasks that rely on precise object recognition and counting, thereby contributing to the superior results observed in these specific metrics.
A further comparison illustrated in Figure 5 provides a more intuitive visualization of the performance differences among the various models across the relevant tasks, which further indicates improved performance of the MFDIT model in multimodal perception and reasoning capabilities.
The analysis of the experimental results in Table 6 indicates that the MFDIT model outperforms the baseline across most evaluation metrics, further confirming its effectiveness in multimodal tasks. However, it is noteworthy that the MFDIT model shows slightly inferior performance on the code_reasoning metric compared to the baseline. This may be due to the introduction of a larger number of trainable parameters during the fine-tuning process, which may have caused the model to focus more on the understanding and processing of image features, potentially affecting its performance in pure text reasoning tasks such as code reasoning. Despite this minor limitation in a single metric, the MFDIT model still demonstrates significant advantages in all other evaluation areas, further validating its overall performance improvement in multimodal reasoning tasks.

4.6.2. Comparative Analysis on the MSCOCO Dataset

To more directly illustrate the experimental results, this section selects images from the MSCOCO test set and compares the short-text descriptions generated by LLaMA-Adapter V2 (used as the baseline) and the MFDIT model (short-text description version) proposed in this study. The generated descriptions from both models are compared with the annotated ground-truth labels. In Table 7, Ours represents the descriptions generated by the proposed model, Baseline represents the descriptions generated by LLaMA-Adapter V2, and GT1 and GT2 are the ground-truth descriptions.
For example, In the first image, the proposed model outputs “Bottles”, which describes the bottle details in the image, while the baseline model fails to adequately capture this information, demonstrating the proposed model’s advantage in fine-grained detail representation. In the second image, the baseline model, benefiting from the prior knowledge of the LLaMA model, can broadly summarize the image, but the MFDIT model, refined with the DP-Adapter, can focus more on the detailed elements in the image, such as “mushrooms”. In the third image, the proposed model performs better, generating descriptions that are more complete and accurate, while the baseline model outputs “Wine and food pairing”, which lacks comprehensive detail due to insufficient image-text alignment and incomplete image feature extraction.
Despite the overall improvements, MFDIT still fails in challenging cases. Typical errors include missing small or heavily occluded objects, confusing fine-grained attributes (e.g., color or count), and occasionally hallucinating relations when the visual evidence is ambiguous.

4.6.3. Comparison of Training Datasets and Trainable Parameter Scales

Table 8 presents a comparative analysis of various multimodal vision-language models across five dimensions: language instruction dataset (LID), image-text pair dataset (ITPD), visual instruction dataset (VID), fine-tuned parameters (FTPs), and MME evaluation (MME Eva). Compared to models like MiniGPT-4 and LLaVA, MFDIT achieves a superior balance between efficiency and effectiveness, maintaining a high MME score with a significantly smaller data and parameter budget. To clarify the contributions under this PEFT setting, we further analyze the distribution of the 20 million trainable parameters. The majority of this budget is allocated to the LoRA modules integrated into the LLaMA backbone’s self-attention layers to facilitate linguistic adaptation. The DP-Adapter occupies a smaller fraction, focusing on the dynamic projection and alignment of multi-modal features, while a minimal set of selectively unfrozen layers ensures smooth cross-modal information flow. This strategic parameter allocation allows MFDIT to maximize representation gains while keeping the backbone encoders and the remaining LLaMA parameters strictly frozen.

5. Discussion

To address the trade-off between parameter scale and generation quality in vision-language models, this study proposes a parameter-efficient image captioning framework, termed MFDIT. By integrating CLIP global semantic features with SAM local instance segmentation features, a dual-stream visual encoder is constructed, which enhances fine-grained representation capacity while adding only 0.05% additional parameters, resulting in a 3.0 point improvement in the CIDEr score on the MSCOCO dataset. The SAM branch provides complementary instance-level local cues, while it also introduces additional inference computation compared with using only a single global encoder. DP-Adapter with zero-initialized gates is designed to adjust cross-modal attention weights adaptively via learnable gating factors, achieving a 7.3% accuracy gain on the location recognition task of the MME Benchmark and validating the effectiveness of the dynamic semantic routing mechanism. Combined with LoRA-based low-rank adaptation and normalization layer unfreezing, the model generates diverse image descriptions with a total trainable parameter count of only 20M. This supports parameter-efficient deployment in resource-constrained scenarios. Experimental results demonstrate that MFDIT exhibits significant advantages in complex scene understanding tasks, and visualized examples show that it can integrate fine-grained prior knowledge to generate more detailed and informative descriptions.
Nevertheless, the current study still faces limitations such as strong dependence on pre-trained models and insufficient coherence in long-text generation. While freezing the CLIP/SAM visual encoders and the LLaMA language backbone reduces training costs, it constrains the potential for joint cross-modal representation learning, and the transferability to specialized domains remains to be validated. The dynamic instruction tuning mechanism is better suited for short-text descriptions, whereas paragraph-level generation tends to suffer from semantic coherence degradation. Furthermore, in tasks requiring multi-step logical reasoning, the model still underperforms compared to billion-parameter VLMs, highlighting the challenges of modeling higher-order cognitive abilities within a small parameter regime.
Despite its efficiency, the proposed MFDIT framework has certain limitations. Relying on frozen CLIP, SAM, and LLaMA backbones restricts the potential for deep cross-modal co-adaptation, as the visual features cannot be iteratively refined to suit specific linguistic nuances during training. While this design choice ensures a lightweight training process, it may limit performance in highly specialized domains. Future research could explore partial fine-tuning of the visual encoders or the integration of trainable bridging layers to further enhance the synergy between visual perception and text generation.
Future research will focus on enhancing multi-modal joint representation learning and dynamic routing. Plans include exploring joint fine-tuning strategies for lightweight vision-language encoders, compressing CLIP/SAM models through knowledge distillation, and building an end-to-end trainable multimodal backbone. The dynamic routing mechanism will be extended by evolving layer-wise independent gating into a cross-layer attention routing network and incorporating reinforcement learning for task-adaptive dynamic prompt generation. To improve causal reasoning capabilities, causal graph convolution modules will be injected into LoRA fine-tuning, combined with chain-of-thought prompt engineering to enhance complex visual reasoning. In addition, a contrastive learning-based self-supervised pre-training framework will be developed to reduce reliance on human-annotated data and advance general vision-language understanding in more diverse scenarios. The proposed framework will be further evaluated in industrial quality inspection systems and extended to other application domains such as intelligent education and unmanned inspection, accelerating the practical application of lightweight multimodal models in edge computing environments.

Author Contributions

Conceptualization, Y.Y. and H.C.; methodology, H.C.; software, Y.Y.; validation, Y.Y.; formal analysis, Y.Y.; investigation, Y.Y.; resources, Y.Y.; data curation, F.J.; writing—original draft preparation, Y.Y.; writing—review and editing, Y.Y.; visualization, X.L.; supervision, C.Z.; project administration, J.L.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Jiangsu Province Major Scientific Project, grant number BG2024032.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

We would like to thank Bo Wen for his technical support, and Nanjing Panda Electronics Co., Ltd., for providing the necessary resources for this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Tensor notation and dimensions.
Table A1. Tensor notation and dimensions.
SymbolMeaningShape/Dimension
X input image batch R B × C × H × W
F c l i p CLIP visual token features R B × F × D 1
F S A M SAM visual token features R B × F × D 2
Q c l i p learnable query tokens for CLIP stream R B × q × D 1
Q S A M learnable query tokens for SAM stream R B × q × D 2
F c l i p concatenated CLIP tokens with queries R B × ( F + q ) × D 1
F S A M concatenated SAM tokens with queries R B × ( F + q ) × D 2
Z c l i p projected CLIP features after Linear R B × ( F + q ) × D 0
Z S A M projected SAM features after Linear R B × ( F + q ) × D 0
Q Z fused visual features R B × ( F + q ) × D 0
P l e embedded prompt representation at layer l R p × C L M
P l transformed prompt after MLP R p × C L M
A LoRA low-rank matrix R d × r
B LoRA low-rank matrix R r × k
Δ W LoRA weight update R d × k
Dimension symbols (not listed as table rows):
B batch size, C image channels, H, W image resolution, F number of visual tokens, q number of learnable query tokens, D 1 CLIP embedding dimension, D 2 SAM embedding dimension, D 0 projected fusion dimension, p prompt length.

References

  1. OpenAI. GPT-4V(ision) System Card. 25 September 2023. Available online: https://openai.com/index/gpt-4v-system-card/ (accessed on 7 January 2026).
  2. Li, J.; Li, D.; Savarese, S.; Hoi, S.C.H. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
  3. Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. In Proceedings of the NIPS ’23: Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Advances in Neural Information Processing Systems 36. Curran Associates Inc.: Red Hook, NY, USA, 2023. [Google Scholar]
  4. Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A Visual Language Model for Few-Shot Learning. In Proceedings of the Neural Information Processing Systems (NeurIPS) 2022, New Orleans, LA, USA, 28 November–9 December 2022; pp. 23716–23736, Advances in Neural Information Processing Systems 35. [Google Scholar]
  5. Driess, D.; Xia, F.; Sajjadi, M.S.M.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. PaLM-E: An Embodied Multimodal Language Model. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 8469–8488. [Google Scholar]
  6. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
  7. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  8. Yu, J.; Wang, Z.; Vasudevan, V.; Yeung, L.; Seyedhosseini, M.; Wu, Y. CoCa: Contrastive Captioners are image-text foundation models. arXiv 2022, arXiv:2205.01917. [Google Scholar]
  9. Li, J.; Selvaraju, R.; Gotmare, A. Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural Inf. Process. Syst. 2021, 34, 9694–9705. [Google Scholar]
  10. Wang, Z.; Yu, J.; Yu, A.W.; Dai, Z.; Tsvetkov, Y.; Cao, Y. SimVLM: Simple visual language model pretraining with weak supervision. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
  11. Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
  12. Adeniji, A.; Xie, A.; Sferrazza, C. Language reward modulation for pretraining reinforcement learning. arXiv 2023, arXiv:2308.12270. [Google Scholar] [CrossRef]
  13. Mahmoudieh, P.; Pathak, D.; Darrell, T. Zero-shot reward specification via grounded natural language. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 14743–14752. [Google Scholar]
  14. Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Virtual, 1–6 August 2021; Volume 1, pp. 4582–4597, Long Papers. [Google Scholar]
  15. Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 2790–2799. [Google Scholar]
  16. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
  17. Hayou, S.; Ghosh, N.; Yu, B. LoRA+: Efficient low rank adaptation of large models. arXiv 2024, arXiv:2402.12354. [Google Scholar]
  18. Zi, B.; Qi, X.; Wang, L.; Wang, J.; Wong, K.-F.; Zhang, L. Delta-LoRA: Fine-tuning high-rank parameters with the delta of low-rank matrices. arXiv 2023, arXiv:2309.02411. [Google Scholar]
  19. Jiang, Z.; Xu, F.F.; Araki, J.; Neubig, G. How can we know what language models know? Trans. Assoc. Comput. Linguist. 2020, 8, 423–438. [Google Scholar] [CrossRef]
  20. Haviv, A.; Berant, J.; Globerson, A. BERTese: Learning to speak to BERT. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Virtual, 19–23 April 2021; pp. 3618–3623. [Google Scholar]
  21. Yuan, W.; Neubig, G.; Liu, P. BARTScore: Evaluating generated text as text generation. In Proceedings of the NIPS ’21: Proceedings of the 35th International Conference on Neural Information Processing Systems, Virtual, 6–14 December 2021; Advances in Neural Information Processing Systems 34. Curran Associates Inc.: Red Hook, NY, USA; pp. 27263–27277. [Google Scholar]
  22. Shin, T.; Razeghi, Y.; Logan, R.L., IV; Wallace, E.; Singh, S. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Virtual, 16–20 November 2020; pp. 4222–4235. [Google Scholar]
  23. Lester, B.; Al-Rfou, R.; Constant, N. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 3045–3059. [Google Scholar]
  24. Zhong, Z.; Friedman, D.; Chen, D. Factual probing is [MASK]: Learning vs. learning to recall. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Virtual, 6–11 June 2021; pp. 5017–5033. [Google Scholar]
  25. Qin, G.; Eisner, J. Learning how to ask: Querying LMs with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Virtual, 6–11 June 2021; pp. 5203–5212. [Google Scholar]
  26. Hambardzumyan, K.; Khachatrian, H.; May, J. WARP: Word-level adversarial reprogramming. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Virtual, 1–6 August 2021; Volume 1, pp. 4921–4933, Long Papers. [Google Scholar]
  27. Shu, M.; Nie, W.; Huang, D.-A.; Yu, Z.; Goldstein, T.; Anandkumar, A.; Xiao, C. Test-time prompt tuning for zero-shot generalization in vision-language models. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; pp. 14274–14289, Advances in Neural Information Processing Systems 35. [Google Scholar]
  28. Gao, T.; Fisch, A.; Chen, D. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Virtual, 1–6 August 2021; pp. 3816–3830. [Google Scholar]
  29. Ju, C.; Han, T.; Zheng, K.; Zhang, Y.; Xie, W. Prompting visual-language models for efficient video understanding. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 105–124. [Google Scholar]
  30. Shen, S.; Yang, S.; Zhang, T.; Zhai, B.; Gonzalez, J.E.; Keutzer, K.; Darrell, T. Multitask vision-language prompt tuning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2024; pp. 5656–5667. [Google Scholar]
  31. Jung, D.; Han, D.; Bang, J.; Song, H. Generating instance-level prompts for rehearsal-free continual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 11847–11857. [Google Scholar]
  32. Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
  33. Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 16816–16825. [Google Scholar]
  34. Khattak, M.U.; Rasheed, H.; Maaz, M.; Khan, S.; Khan, F.S. MaPLe: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 19113–19122. [Google Scholar]
  35. Jia, M.; Tang, L.; Chen, B.-C.; Cardie, C.; Belongie, S.; Hariharan, B.; Lim, S.-N. Visual prompt tuning. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 709–727. [Google Scholar]
  36. Khattak, M.U.; Wasim, S.T.; Naseer, M.; Khan, S.; Yang, M.-H.; Khan, F.S. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 15190–15200. [Google Scholar]
Figure 1. Comparison of Parameter Scales in Typical Vision-Language Models (Data Source: References [1,2,3,4,5]).
Figure 1. Comparison of Parameter Scales in Typical Vision-Language Models (Data Source: References [1,2,3,4,5]).
Applsci 16 01857 g001
Figure 2. Overall architecture of the proposed MFDIT image captioning framework.
Figure 2. Overall architecture of the proposed MFDIT image captioning framework.
Applsci 16 01857 g002
Figure 3. Comparison of visual feature extraction by CLIP and SAM.
Figure 3. Comparison of visual feature extraction by CLIP and SAM.
Applsci 16 01857 g003
Figure 4. Structure of the Dynamic Prompt Adapter (DP-Adapter).
Figure 4. Structure of the Dynamic Prompt Adapter (DP-Adapter).
Applsci 16 01857 g004
Figure 5. Comparison of Perception Performance Results in MME Benchmark.
Figure 5. Comparison of Perception Performance Results in MME Benchmark.
Applsci 16 01857 g005
Table 1. Experimental Environment Configuration for MFDIT.
Table 1. Experimental Environment Configuration for MFDIT.
ConfigurationVersion
Operating SystemUbuntu 22.04
CPUAMD EPYC 9754 128-Core
GPUNVIDIA RTX 4090(24 GB) × 8
CUDA12.1
Python Version3.10
FrameworkPyTorch 2.1.0
Table 2. Results of the Encoding Component Ablation Study.
Table 2. Results of the Encoding Component Ablation Study.
B@1 B@2 B@3 B@4 M R C S
MFDIT(CLIP)80.664.949.538.930.959.6123.722.5
MFDIT(SAM)79.964.548.738.430.559.3123.123.0
MFDIT(C+S)81.165.250.039.531.1 60.4126.723.9
Table 3. Ablation Study on DP-Adapter.
Table 3. Ablation Study on DP-Adapter.
B@1 B@2 B@3 B@4 M R C S
MFDIT(base)80.165.049.639.130.959.2125.422.0
MFDIT(DP)81.165.250.039.531.160.4126.723.9
Table 4. Ablation Study on MFDIT Parameter Fine-tuning Strategies.
Table 4. Ablation Study on MFDIT Parameter Fine-tuning Strategies.
B@1 B@2 B@3 B@4 M R C S
MFDIT(f)78.062.247.938.129.358.0124.921.8
MFDIT(Uf)81.165.250.039.531.1 60.4126.7 23.9
Table 5. Comparison of Perception Performance on the MME Benchmark.
Table 5. Comparison of Perception Performance on the MME Benchmark.
MFDITL-A V2BLIVAGIT2BLIP-2
existence190.0185.0180.0190.0160.0
count138.3133.3138.3118.3135.0
position63.356.781.796.773.3
color118.3118.3180.0158.3148.3
posters150.7148.0155.1112.6141.8
celebrity135.3134.7140.9145.9105.6
scene155.5156.3151.5158.5145.3
landmark166.0167.889.5140.5138.0
artwork123.5123.5133.3146.3136.5
OCR110.0102.587.565.0110.0
total1350.91326.11337.81332.11293.8
Table 6. Comparison of Cognition Performance on the MME Benchmark.
Table 6. Comparison of Cognition Performance on the MME Benchmark.
MFDITL-A V2BLIVAGIT2BLIP-2
commonsense_reasoning110.0106.4136.499.2110.0
numerical_calculation50.047.557.550.040.0
text_translation112.5112.577.567.565.0
code_reasoning87.590.060.045.075.0
total360.0356.4331.4261.7290.0
Table 7. Comparison of MFDIT and Baseline Results.
Table 7. Comparison of MFDIT and Baseline Results.
Applsci 16 01857 i001GT:
1. A table with wine glasses and bottles of wine
2. There are bottles of wine and wine glasses arranged on a table
Baseline: Wine Tasting at the Wine Cellar
Ours: Bottles of wine and wine glasses at the Wine Cellar
Applsci 16 01857 i002GT:
1. Mushrooms are used in many variety of dishes
2. A chicken and rice dish being cooked.
Baseline: Collage of images showing how to make a chicken and vegetable stew
Ours: A chicken and rice dish is being cooked in a pan along with vegetables and mushrooms.
Applsci 16 01857 i003GT:
1. A wooden table filled with wine glasses and a bowl of salad.
2. A table has three wine glasses, several bottles of wine, and a bowl of salad.
Baseline: Wine and food pairing
Ours: A table with wine glasses, bottles, and salad.
Applsci 16 01857 i004GT:
1. A person is eating at a table with plates, a fork, knife, spoon, cup, and cell phone on it.
2. Overhead view of a table with a log and food on it.
Baseline: A woman eats a sandwich while sitting on a couch with a coffee and a book.
Ours: A woman enjoys a sandwich with coffee and a cell phone beside her.
Applsci 16 01857 i005GT:
1. A plate of beans and eggs with a knife and fork on it.
2. A plate of beans, eggs, and toast is on the table.
Baseline: Fried Eggs on Toast
Ours: There are beans and eggs with a knife and fork on a plate.
Table 8. Dataset Composition and Trainable Parameter Counts of Compared Models.
Table 8. Dataset Composition and Trainable Parameter Counts of Compared Models.
ModelLIDITPDVIDFTPMME Eva
MiniGPT-470 K134 M5 K13 B725.95
LLaVA70 K595 K158 K13 B1826.67
L-A V252 K567 K0 K14 M1682.5
MFDIT52 K567 K0 K20 M1710.9
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yin, Y.; Cao, H.; Zhang, C.; Jin, F.; Liu, X.; Lin, J. A LLaMA-Based Efficient Fine-Tuning Method for Image Captioning Using Multi-Feature Dynamic Prompts. Appl. Sci. 2026, 16, 1857. https://doi.org/10.3390/app16041857

AMA Style

Yin Y, Cao H, Zhang C, Jin F, Liu X, Lin J. A LLaMA-Based Efficient Fine-Tuning Method for Image Captioning Using Multi-Feature Dynamic Prompts. Applied Sciences. 2026; 16(4):1857. https://doi.org/10.3390/app16041857

Chicago/Turabian Style

Yin, Yongyang, Hengyu Cao, Chunsheng Zhang, Faxun Jin, Xin Liu, and Jun Lin. 2026. "A LLaMA-Based Efficient Fine-Tuning Method for Image Captioning Using Multi-Feature Dynamic Prompts" Applied Sciences 16, no. 4: 1857. https://doi.org/10.3390/app16041857

APA Style

Yin, Y., Cao, H., Zhang, C., Jin, F., Liu, X., & Lin, J. (2026). A LLaMA-Based Efficient Fine-Tuning Method for Image Captioning Using Multi-Feature Dynamic Prompts. Applied Sciences, 16(4), 1857. https://doi.org/10.3390/app16041857

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop