Next Article in Journal
The Effect of Non-Transferred Plasma Torch Electrodes on Plasma Jet: A Computational Study
Previous Article in Journal
Finite Element Analysis on Stress Development in Alveolar Bone During Insertion of a Novel Dental Implant Design
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MFEAM: Multi-View Feature Enhanced Attention Model for Image Captioning

School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(15), 8368; https://doi.org/10.3390/app15158368
Submission received: 25 June 2025 / Revised: 20 July 2025 / Accepted: 21 July 2025 / Published: 28 July 2025

Abstract

Image captioning plays a crucial role in aligning visual content with natural language, serving as a key step toward effective cross-modal understanding. Transformer has become the dominant language model in image captioning. Existing Transformer-based models seldom highlight important features from multiple views in the use of self-attention. In this paper, we propose MFEAM, an innovative network that leverages the multi-view feature enhanced attention. To accurately represent the entangled features of vision and text, the teacher model employs the multi-view feature enhanced attention to guide the student model training through knowledge distillation and model averaging from both visual and textual views. To mitigate the impact of excessive feature enhancement, the student model divides the decoding layer into two groups, which separately process instance features and the relationships between instances. Experimental results demonstrate that MFEAM attains competitive performance on the MSCOCO (Microsoft Common Objects in Context) when trained without leveraging external data.

1. Introduction

Images and text serve as two important mediums for conveying information. Images encompass more extensive information, but they are more challenging to interpret. On the other hand, text conveys information in a more straightforward manner. Early researches utilize Convolutional Neural Network (CNN) [1,2,3,4] and Recurrent Neural Network (RNN) [1,2,4,5,6]. RNN facilitates the transmission of information over different time steps, efficiently capturing temporal dependencies within the data. Nevertheless, RNN encounter difficulties such as gradient vanishing and explosion, which significantly limit the capacity to capture long-distance dependencies, especially when processing lengthy sequences. The Long Short-Term Memory (LSTM) network [7,8,9,10] represents a substantial advancement over RNN. The advent of the Transformer has rectified the computational inefficiency of LSTM in handling lengthy sequences, facilitating parallel processing of the input sequence and enhancing training speed. In addition, vision-language pre-trained models have dramatically enhanced the capability of cross-modal applications, such as Contrastive Language–Image Pretraining (CLIP) [11], by learning cross-modal alignment representations on large-scale image-text pairs. Nowadays, image captioning methods have significantly advanced with the adoption of Transformer-based models [12,13,14,15], while the visual feature extraction phase is swiftly advancing, transitioning to multimodal architectures [11,16,17,18] trained on large-scale datasets.
Despite advances in architectures and structural networks, most contemporary image captioning methods utilize a single language model to generate captions. Recently, the structure of two language models [13,19] has been employed to learn visual representations through knowledge distillation and model averaging [20,21,22]. Meanwhile, many image captioning datasets [23,24,25,26,27] contain large amounts of images and text, resulting in highly complex and diverse input features. As a result, Transformer-based image captioning models are capable of effectively extracting visual and textual features, but the characteristics of the input result in suboptimal performance in accurately expressing the entangled features of vision and text. To address this challenge, we introduce the multi-view feature enhanced attention to strengthen the representation of entangled features of vision and text. Through the integration of the Mean Teacher Learning approach, the teacher model can attain a superior comprehension of important features, facilitating more effective parameter updates for the student model.
In this paper, we propose a knowledge distillation framework named MFEAM, which is built upon a multi-view feature enhanced attention. MFEAM adopts an encoder–decoder architecture as its foundational framework. The CLIP model functions as the visual encoder, while the language decoder comprises the teacher model and the student model. The teacher model employs the multi-view feature enhanced attention to interpret important features from both visual and textual views. Through knowledge distillation and model averaging, the teacher model can gradually follow the state of the student model through time. In addition, inspired by the Guided-Embedding Network [28], we divide the decoder of the student model into two groups to separately process instance and relationship features, which helps prevent performance degradation caused by overemphasizing features. We formulate and evaluate strategies that implement this interaction paradigm during training phase. Experimental results demonstrate that MFEAM attains competitive performance on the MSCOCO when trained without leveraging external data.
  • We present a novel architecture called MFEAM, which incorporates the multi-view feature enhanced attention. By overcoming the limitations of relying on a single language model as the decoder and integrating the advanced Mean Teacher architecture, the model significantly improves upon baseline performance.
  • To strengthen the ability of the model in handling the entangled features of vision and text, we propose the multi-view feature enhanced attention (MFE). MFE is made up of the self-enhanced module and the sliding window module, which collaboratively assist the teacher model in comprehending entangled features from both visual and textual views.
  • To mitigate performance degradation caused by overemphasis on certain features, a novel feature processing strategy is adopted in the student model. Specifically, the decoder layers are divided into two groups, with one focusing on processing instance features and the other responsible for capturing relational features between instances.

2. Related Work

2.1. Image Captioning

There has been a marked enhancement in the effectiveness of image captioning in recent years. Similar to machine translation, image captioning research [13,14,29,30,31] follows the encoder–decoder architecture. The visual encoder extracts spatial information and learns visual representations from images. Lu et al. [9] use an attention mechanism module to predict the next word, deciding whether to emphasize visual content or rely on language signals. Traditional language models have mainly relied on RNNs, but they struggle with information loss and training difficulties in long sequences. Yao et al. [32] address this by demonstrating a two-layer LSTM structure designed to investigate the connections between visual features and components of the generated caption. However, with the emergence and success of the Transformer architecture [33], the encoder–decoder structure used for image captions has increasingly converged towards the Transformer paradigm. Our approach is predicated on Transformer-based image captioning and improves the capacity of the model to process the entangled features of vision and text through the introduction of the multi-view feature enhanced attention.

2.2. Knowledge Distillation

Knowledge distillation is first introduced in [34], where it enhances the effectiveness of the student model by transferring the teacher model’s supervisory signal, demonstrating broad applicability in multimodal tasks. In generation tasks, distillation techniques can improve the ability of the model to capture long-range semantic dependencies through feature-level or outline-level constraints, while reducing the overfitting issue resulting from limited training data. Besides logits, other forms of knowledge, such as intermediate representations and attention maps [13,35,36,37], have been utilized to transfer the information embedded within models. Zhou et al. [38] present the POS-SCAN. Utilizing a Part-of-Speech tagger, the model preserves noun words during the calculation of the matching score, and the subsequently re-trained POS-enhanced POS-SCAN fulfills the criteria of the downstream task. Bajpai et al. [39] present a novel self-distillation structure, which improves model performance by introducing an online learning algorithm to dynamically select the optimal threshold. In our case, we present a knowledge distillation architecture utilizing the multi-view feature enhanced attention. The teacher model enhances the student model using the multi-view feature enhanced attention. The teacher model leverages the multi-view feature enhanced attention to facilitate a deeper understanding of features within the student model.

2.3. Vision-Language Pretrained Model

In recent years, pre-training techniques based on large-scale cross-modal alignment have achieved notable advancements in multimodal comprehension. The CLIP model proposed by OpenAI is trained on extensive image-text pairs, enabling its visual encoder to extract semantically highly generalized visual features [11]. Models such as Google’s ALIGN extend the training dataset and model scale, and they verify the effectiveness and versatility of the cross-modal feature learning pre-training strategy [40,41]. These models lay the foundation for multimodal generation tasks through a unified embedding space.

2.4. Dynamic Training Strategy

The dynamic training strategy enhances the model’s robustness by adaptively modifying the optimization process. The Mean Teacher architecture [19] uses the exponential moving average parameter to generate a regularization target, which is proven to be effective in semi-supervised classification tasks. In the generation task, reinforcement learning is used for multi-objective optimization, and the diversity and accuracy of the generated text are further improved by directly optimizing CIDEr [42] and other indicators. Methods such as SCST [5] provide technical support for training stability.

3. Method

As shown in Figure 1a, the main network follows the encoder–decoder paradigm, where the visual encoder utilizes the CLIP model [11] for feature extraction. The language decoder, which comprises the teacher and student model, is responsible for transforming extracted features into captions. The next section provides an in-depth review of the teacher model, elucidating the fundamental principles of the multi-view feature enhanced attention within the teacher model. In addition, we provide a detailed description of the architecture of the student model, which divides the six decoder layers into two structurally identical groups to effectively alleviate performance degradation resulting from an overemphasis on local features. Finally, we elucidate the dual-stage training strategy that is frequently used for image captioning.

3.1. Visual Encoder

As depicted in Figure 1a, we employ the CLIP network as the visual encoder to extract visual features F = f 1 , f 2 , f 3 , , f n , where f i R 1 × d . CLIP converts image information into high-dimensional feature vectors, capturing both global and local features within the image. Specifically, for an image I, we derive the feature embedding e I R d by using the CLIP model. A two-layer MLP is utilized to produce the feature set. Formally,
e I = C L I P I
f 1 , f 2 , , f n = M L P e I
The visual features are fed into the language decoder, where the attention mechanism is employed to enhance information interaction among the features. Compared to conventional convolutional neural network approaches, CLIP provides more semantically rich and generalizable feature representations, enabling it to better accommodate image understanding tasks across diverse scenarios.

3.2. Teacher Language Model

As depicted in Figure 1b, the teacher model is the Transformer-based language model, incorporating the multi-view feature enhanced attention. The features F are sent to the encoder. After being processed by multiple encoding layers, the contextual features F c are obtained. These contextual features are passed to each decoding layer to facilitate the development of image descriptions. During the decoding stage, the multi-view feature enhanced attention not only captures the dependencies within the partially generated sequence but also further extracts important details from the F c .
The key idea of the MFE is to refine attention weights from two complementary views—the textual view and the visual view—improving the semantic representation capability across modalities of the model. Specifically, the self-enhanced module strengthens attention weights by analyzing their distribution, encouraging the model to focus more on semantically informative tokens. In parallel, the sliding window module applies a convolutional kernel over the visual attention map, enhancing the model’s spatial awareness and its ability to capture local structure. The outputs from both views are then fused and used to reweight the value vectors, resulting in more expressive multi-modal representations. The overall architecture is illustrated in Figure 2. Firstly, we compute this attention weight a t t i using the input vectors q i and k i , as defined by the following equation:
a t t i = S o f t m a x q i k i T d k
Compared to the self-attention mechanism, which performs calculations between a t t i and v i , the multi-view feature enhanced attention separately processes the attention weights from two views-visual features a t t i I and textual features a t t i T .
For the attention weights of textual features a t t i T , we enhance the textual features by the self-enhanced module.
α = 1 max a t t i T + min a t t i T
E A _ a t t i T = e α + 1 · γ · a t t i T
In Equation (4), α reflects the flatness of the attention distribution. When max a t t i T and min a t t i T are close, α becomes larger, indicating that the model lacks a strong semantic focus. A higher α increases the overall magnitude of the attention to encourage more discriminative attention weights. In Equation (5) we use γ as a weight parameter to prevent excessive enhancement of a t t i T . Over-enhancing a t t i T can lead to biased understanding of features by the model, resulting in an unstable training process. This adaptive mechanism enables more stable and effective modeling of textual focus in ambiguous or diffuse semantic contexts.
For the attention weights of visual features a t t i I , we enhance the visual features by the sliding window module. In contrast to the self-enhanced module, the sliding window module employs a convolution-based approach, where the convolutional kernel slides over the attention weight matrix to capture localized features and enhance the spatial awareness of the model. The convolutional kernel is defined as a 3 × 3 sharpening kernel.
H = 0 1 0 1 5 1 0 1 0
E A _ a t t i I = C o n v 2 D a t t i I
where C o n v 2 D denotes the two-dimensional convolution operation. The sliding window mechanism applies convolutional kernels over the surrounding regions of each position, enabling the model to effectively capture local variations.
Lastly, we fuse the attention weights of visual features and textual features, and perform a dot product operation with v i to obtain the output M i .
E A _ a t t i = C o n c a t E A _ a t t i I , E A _ a t t i T
M i = E A _ a t t i · v i
where E A _ a t t i is the attention weight obtained after the fusion of E A _ a t t i T and E A _ a t t i I .
In this way, the teacher model optimizes attention weight allocation from visual and textual feature views, enabling MFEAM to emphasize important information while avoiding redundant processing of unnecessary information. During the subsequent knowledge distillation process, the teacher model assists the student model understand features effectively, enhancing overall performance and accuracy.

3.3. Student Language Model

The student model adopts the Transformer-based architecture. Different from Transformer that process features layer by layer, inspired by the Guided-Embedding Network [28], we divide the structure of six decoder layers into two groups, while maintaining the identical architecture. As illustrated in Figure 3, one group functions as the Instance Decoder, responsible for detecting objects in the image and generating the corresponding features for each instance. The other group, termed the Relationship Decoder, focuses on identifying interaction relationships among objects.
As shown in Figure 3, the features F are sent to the encoder. After processing through multiple encoder layers, the encoder generates the contextual features F c . During the decoding stage, two different attention mechanisms are employed to extract important information from F c . In the instance decoder, the features generated by each decoder layer are fed into the relationship decoder. The calculation process is as follows:
R 0 I D = S e l f A t t n F c
R i I D = C r o s s A t t n R i 1 I D , F c
where i = 1 , 2 , 3 . In the relationship decoder, to guide the interaction features R i I D to explore informative regions, we take R i I D as input and send it to the i-th relationship decoder layer.
By leveraging the output of the instance decoder, each interaction feature in the relationship decoder can be directly linked to the corresponding object. In MFEAM, the decoder structure of the student model can better focus on the interaction relationships between targets after receiving feature knowledge from the teacher model. This mechanism helps mitigate performance degradation caused by excessive emphasis on features.

3.4. Interaction Strategy

The interaction strategy consists of knowledge distillation and model averaging. In the training phase, we leverage the teacher model to provide regression targets. This process is facilitated by knowledge distillation, where the teacher model updates its parameters according to the exponentially weighted. Given the time step τ , the loss function is as follows:
L θ s = min θ s τ f θ t f θ s 2
where θ s and θ t are the parameter sets of the student model and the teacher model, respectively. f θ s and f θ t are the predicted probabilities by the student model and the teacher model.
In model averaging, at the time step τ , the student model dynamically adjusts the teacher model based on the historical average of its model parameters, helping the teacher model provide a smoother and more stable learning target.
σ θ t + 1 σ θ s θ t
where σ is a target decay rate. During the training process, the value of σ is fixed. This method allows the teacher model to sustain a weighted average of the student model’s sequential states, achieving a type of model ensembling.

3.5. Training Strategy

The training procedure is divided into two main stages as follows: an initial phase utilizing cross-entropy loss and a subsequent stage based on self-critical sequence optimization. The cross-entropy loss encourages the model toward producing more precise and semantically relevant textual outputs by assigning a higher penalty to low probabilities by taking the logarithm of the probability the model assigns to the true word. The learning rate is modified according to the Transformer’s learning rate scheduling methodology.
L X E θ = t = 1 T log p θ y t * y 1 : t 1 * , I
where t indicates time, and θ denotes the set of all model parameters.
In the self-critical sequence training phase (SCST) [5], the model first samples a sequence and computes its evaluation metric score. Simultaneously, a baseline sequence is generated using greedy decoding as a reference. The disparity in scores between the sampled and baseline sequences is used as a reward signal, encouraging the generation of higher-scoring sequences.
θ L θ = 1 k i = 1 k r y 1 : T i b θ log D y 1 : T i
where y 1 : T i is the i-th sentence, k is the beam size, r · is the reward function, and b = 1 k i = 1 k r y 1 : T i is the baseline, calculated as the average reward obtained for the sampled sequences.

4. Experiments

4.1. Experimental Setup

4.1.1. Dataset

This study employs the MSCOCO dataset [43], a prominent standard in image captioning. This dataset has more than 120,000 images and every image is accompanied by five manually created annotations. In our tests, we employ the Karpathy split method to partition the dataset to ensure a thorough and balanced dataset for assessing the efficacy of our proposed model.

4.1.2. Evaluation Metrics

To quantitatively evaluate the quality of the generated captions, we adopt several widely used benchmark metrics, including BLEU-1/4 [44], METEOR [45], ROUGE [46], CIDEr [42], and SPICE [47]. These metrics are abbreviated as B-1, B-4, M, R, C, and S. In our experimental analysis, we employ the official evaluation scripts released by MSCOCO, ensuring a fair comparison from previous models.

4.1.3. Implementation Details

In our study, the presented model was developed with PyTorch 1.9.0, a widely used framework for deep learning. The visual encoder comprised of three layers, each with a dimensionality of 512. Similarly, the decoder also consisted of three layers, each incorporating a feed-forward network with a dimensional size of 2048. In the experimental setup, the Adam optimizer is employed with a beam size of 5. For word representation, we employed Byte Pair Encoding (BPE) [48]. Additionally, standard sinusoidal positional encoding [33] was employed to ensure the positional representation of every word.
During the cross-entropy training phase, we set the batch size to 50. During the SCST fine-tuning phase, we employed a reduced batch size of 30. Our experiments were conducted on a standard desktop machine operating on Ubuntu 22.04, equipped with three RTX 8000 GPUs (NVIDIA, Santa Clara, CA, USA).

4.2. Experimental Results

Table 1 presents a comparative analysis of the SCST training results between MFEAM and other image captioning models. In the comparative analysis, we incorporate models based on LSTM and methods that focus on regional attention mechanisms (i.e., Up-Down [4], HIP [49]), approaches utilizing the dual information flow fusion (such as DIFNet [50]), methods enhanced by graph network structures (i.e., GCN-LSTM [9]), models employing self-attention mechanisms (i.e., AoANet [51]), and models with decoders primarily structured around Transformers (i.e., X-Transformer [52], DLCT [53], RSTNet [54], BENet [55], VAT [29] and EURAIC [56]).
As shown in Table 1, MFEAM achieves consistently superior performance across standard evaluation metrics such as BLEU-1, BLEU-4, METEOR, ROUGE, CIDEr, and SPICE. For example, on the MSCOCO Karpathy test split, MFEAM achieves the highest CIDEr score of 140.1, outperforming prior Transformer-based models. This improvement stems from the combination of the multi-view feature enhanced attention and the Mean Teacher Learning approach, which strengthens the model’s ability to encode cross-modal semantics.
We compare MFEAM with several strong baselines in terms of parameter size and inference speed. As shown in Figure 4, MFEAM achieves the highest CIDEr score, while also having the smallest parameter size and second-fastest inference time. These results highlight the efficiency of our architecture and its suitability for real-time or resource-constrained scenarios.

4.3. Ablation Study

As presented in Table 2, different combinations of the enhancement module are evaluated to comprehensively demonstrate the impact of the multi-view feature enhanced attention. As illustrated in Table 3, Experiments are performed to validate the efficacy of the model averaging. Furthermore, as shown in Table 4, comprehensive ablation experiments are conducted to assess the impact of the Mean Teacher architecture.
Role of the multi-view feature enhanced attention. Table 2 presents the experimental results of the enhancement modules, where the models employing individual enhancement modules, namely the self-enhanced Module (SEM) and the sliding window module (SWM), are compared with the proposed MFEAM model. As can be seen, MFEAM achieves competitive performance compared to models that process entangled features from either the visual or textual view alone. The evaluation metrics of models using only the self-enhanced module or the sliding window module indicate that relying on a single enhancement module leads the teacher model to focus solely on either the visual or textual modality, which limits its ability to capture rich and comprehensive features. In contrast, MFEAM offers a more diverse mechanism for processing image features, enabling a more precise representation of entangled features of vision and text from both views.
Role of the model averaging. Table 3 presents the results obtained without applying the model averaging operation. In this setting, knowledge is transferred from the teacher model to the student model by optimizing a distillation-based objective that encourages the student to mimic the output behavior of the teacher. Meanwhile, instead of relying solely on static supervision, the teacher model dynamically incorporates guidance from the student model by tracking and utilizing the exponential moving average of the student’s parameters, thereby promoting a mutually reinforcing optimization process. This collaborative interaction the teacher model to provide a smoother and more stable learning target for the student model, thereby effectively reducing uncertainty during training. The performance of MFEAM which uses CLIP-RN50×16 and model averaging significantly surpasses that of the approach using CLIP-RN50×16 without model averaging. Model averaging enables the teacher model to retain a temporally weighted accumulation of the student model’s evolving parameters. This mechanism effectively serves as a form of model ensembling, enhancing the generalization capacity of the teacher model.
Role of the Mean Teacher architecture. We evaluate the performance of MFEAM using several image feature extraction modules, as indicated in Table 4. For models without the MFEAM architecture, a single language model is designated as the language decoder. The integration of all previously discussed components within the MFEAM framework leads to a substantial performance improvement, highlighting the effectiveness of the architecture. These results empirically validate the suitability of adopting the mean teacher learning paradigm. The results in Table 4 demonstrate that MFEAM attains the highest performance when employing the CLIP-RN50×16 in conjunction with the Mean Teacher architecture, where the distillation weight is set to 0.1. This configuration facilitates an effective balance between the supervisory signal provided by the teacher model and the representational learning capacity, leading to the enhanced alignment of visual and textual modalities.

4.4. Qualitative Studies

Figure 5 showcases image examples with ground truth and captions produced by CaMEL [13] and MFEAM, respectively. The qualitative examples show that captions produced by MFEAM adhere more closely to the surrounding context and exhibit markedly smaller semantic deviations. As illustrated in Figure 5, MFEAM accurately recognizes the spatial dependencies between objects present in the visual scene. The model also excels at capturing fine-grained information, particularly evident in its correct identification of both the cat’s color and its behavior in the third panel of Figure 5. These results suggest that the proposed MFEAM architecture conveys scene content and details with higher fidelity than CaMEL, offering improved alignment between visual semantics and textual descriptions. This advantage is attributed to the multi-view feature enhanced attention, which enables better cross-modal interaction. Taken together with the quantitative gains, these qualitative observations provide strong evidence for the superiority of MFEAM in generating contextually rich and semantically accurate captions.

4.5. Discussion

The results presented in Table 1 demonstrate that MFEAM achieves state-of-the-art performance across all major evaluation metrics, including BLEU, METEOR, ROUGE, CIDEr, and SPICE. In particular, MFEAM attains a CIDEr score of 140.1, outperforming strong baselines such as BENet (137.6), RSTNet (135.6), and DLCT (133.8). This improvement can be attributed to the dual-channel attention refinement strategy introduced by the multi-view feature enhanced attention, as well as the decoupled decoding structure in the student model.
In addition to the quantitative gains, qualitative results in Figure 5 further illustrate the effectiveness of MFEAM in generating accurate, context-aware, and semantically rich captions. Compared to CaMEL, MFEAM produces descriptions with smaller semantic deviations and better alignment with spatial and relational cues in the image.
Furthermore, compared with recently proposed methods, MFEAM offers a competitive advantage by achieving high caption quality without relying on external data or excessive model complexity. These results confirm the effectiveness of our framework in bridging visual and textual modalities while maintaining computational efficiency.

5. Conclusions

In this paper, we present a novel architecture named MFEAM, which incorporates the multi-view feature enhanced attention and knowledge distillation framework for image captioning. The model employs the CLIP architecture to encode image features, followed by a combination of knowledge distillation and model averaging to transfer the learned features. This mechanism enables the teacher model to guide the training process effectively. Specifically, the teacher model incorporates the multi-view feature enhanced attention mechanism, which integrates the self-enhanced module and the sliding window module to process the entangled features of vision and text from diverse views. This design enriches the generation of descriptions by improving the understanding of complex visual and textual interactions. In addition, a novel feature processing strategy is introduced by dividing the decoder layers into two groups in the student model as follows: one dedicated to capturing instance-specific features and the other to modeling inter-instance relational features. This approach mitigates potential performance degradation caused by excessive feature emphasis. Experimental results on the MSCOCO demonstrate that MFEAM outperforms existing methods in terms of accuracy and caption quality. In future work, we aim to further reduce the model’s parameter count and improve inference speed by exploring lightweight attention modules and compression strategies, enabling broader applicability in real-time and resource-constrained scenarios.

Author Contributions

Conceptualization, Y.C. and J.Z.; Methodology, Y.C.; Software, Y.C.; Validation, Y.C.; Formal analysis, Y.C.; Investigation, Y.C.; Resources, Y.C.; Data curation, Y.C.; Writing—original draft, Y.C.; Writing—review and editing, J.Z.; Visualization, Y.C.; Supervision, J.Z.; Project administration, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that they have no conflict of interest.

References

  1. Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
  2. Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3128–3137. [Google Scholar]
  3. Aneja, J.; Deshpande, A.; Schwing, A.G. Convolutional image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5561–5570. [Google Scholar]
  4. Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
  5. Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024. [Google Scholar]
  6. Wang, H.; Wang, H.; Xu, K. Evolutionary recurrent neural network for image captioning. Neurocomputing 2020, 401, 249–256. [Google Scholar] [CrossRef]
  7. Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar]
  8. Wang, C.; Yang, H.; Bartz, C.; Meinel, C. Image captioning with deep bidirectional LSTMs. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 988–997. [Google Scholar]
  9. Lu, J.; Xiong, C.; Parikh, D.; Socher, R. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 375–383. [Google Scholar]
  10. Zhu, X.; Li, L.; Liu, J.; Li, Z.; Peng, H.; Niu, X. Image captioning with triple-attention and stack parallel LSTM. Neurocomputing 2018, 319, 55–65. [Google Scholar] [CrossRef]
  11. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PmLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
  12. Wang, C.; Shen, Y.; Ji, L. Geometry Attention Transformer with position-aware LSTMs for image captioning. Expert Syst. Appl. 2022, 201, 117174. [Google Scholar] [CrossRef]
  13. Barraco, M.; Stefanini, M.; Cornia, M.; Cascianelli, S.; Baraldi, L.; Cucchiara, R. CaMEL: Mean teacher learning for image captioning. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 4087–4094. [Google Scholar]
  14. Zhang, X.; Fan, M.; Hou, M. Mobilenet V3-transformer, a lightweight model for image caption. Int. J. Comput. Appl. 2024, 46, 1–9. [Google Scholar] [CrossRef]
  15. Chen, J.; Ge, C.; Xie, E.; Wu, Y.; Yao, L.; Ren, X.; Wang, Z.; Luo, P.; Lu, H.; Li, Z. PIXART-sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2025; pp. 74–91. [Google Scholar]
  16. Moratelli, N.; Caffagni, D.; Cornia, M.; Baraldi, L.; Cucchiara, R. Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization. arXiv 2024, arXiv:2408.14547. [Google Scholar] [CrossRef]
  17. Wang, F.; Mei, J.; Yuille, A. Sclip: Rethinking self-attention for dense vision-language inference. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2025; pp. 315–332. [Google Scholar]
  18. Moratelli, N.; Cornia, M.; Baraldi, L.; Cucchiara, R. Fluent and Accurate Image Captioning with a Self-Trained Reward Model. In Proceedings of the International Conference on Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2025; pp. 209–225. [Google Scholar]
  19. Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proceedings of the 31st International Conference on Neural Information Processing System, Long Beach, CA, USA, 4–9 December 2017; pp. 1195–1204. [Google Scholar]
  20. Gu, Y.; Dong, L.; Wei, F.; Huang, M. MiniLLM: Knowledge distillation of large language models. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna Austria, 7–11 May 2024. [Google Scholar]
  21. Kang, M.; Lee, S.; Baek, J.; Kawaguchi, K.; Hwang, S.J. Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks. Adv. Neural Inf. Process. Syst. 2024, 36, 48573–48602. [Google Scholar]
  22. Li, Z.; Li, X.; Fu, X.; Zhang, X.; Wang, W.; Chen, S.; Yang, J. Promptkd: Unsupervised prompt distillation for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26617–26626. [Google Scholar]
  23. Nguyen, T.; Gadre, S.Y.; Ilharco, G.; Oh, S.; Schmidt, L. Improving multimodal datasets with image captioning. Adv. Neural Inf. Process. Syst. 2024, 36, 22047–22069. [Google Scholar]
  24. Mahmoud, A.; Elhoushi, M.; Abbas, A.; Yang, Y.; Ardalani, N.; Leather, H.; Morcos, A.S. Sieve: Multimodal dataset pruning using image captioning models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 22423–22432. [Google Scholar]
  25. Awadalla, A.; Xue, L.; Shu, M.; Yan, A.; Wang, J.; Purushwalkam, S.; Shen, S.; Lee, H.; Lo, O.; Park, J.S.; et al. BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions. arXiv 2024, arXiv:2411.07461. [Google Scholar]
  26. Yu, Q.; Sun, Q.; Zhang, X.; Cui, Y.; Zhang, F.; Cao, Y.; Wang, X.; Liu, J. Capsfusion: Rethinking image-text data at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–20 June 2024; pp. 14022–14032. [Google Scholar]
  27. Chen, L.; Li, J.; Dong, X.; Zhang, P.; He, C.; Wang, J.; Zhao, F.; Lin, D. Sharegpt4v: Improving large multi-modal models with better captions. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2025; pp. 370–387. [Google Scholar]
  28. Liao, Y.; Zhang, A.; Lu, M.; Wang, Y.; Li, X.; Liu, S. Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20123–20132. [Google Scholar]
  29. Li, J.; Wang, Y.; Zhao, D. Layer-wise enhanced transformer with multi-modal fusion for image caption. Multimed. Syst. 2023, 29, 1043–1056. [Google Scholar] [CrossRef]
  30. Yang, C.; Li, Z.; Zhang, L. Bootstrapping interactive image-text alignment for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–12. [Google Scholar] [CrossRef]
  31. Ansari, K.; Srivastava, P. An efficient automated image caption generation by the encoder decoder model. Multimed. Tools Appl. 2024, 83, 66175–66200. [Google Scholar] [CrossRef]
  32. Yao, T.; Pan, Y.; Li, Y.; Mei, T. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 684–699. [Google Scholar]
  33. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 17, 6000–6010. [Google Scholar]
  34. Hinton, G. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
  35. Zhou, Y.; Zhang, Y.; Hu, Z.; Wang, M. Semi-autoregressive transformer for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 3139–3143. [Google Scholar]
  36. Sameni, S.; Kafle, K.; Tan, H.; Jenni, S. Building Vision-Language Models on Solid Foundations with Masked Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 14216–14226. [Google Scholar]
  37. Ren, K.; Hu, C.; Xi, H.; Li, Y.; Fan, J.; Liu, L. EDIR: An expert method for describing image regions based on knowledge distillation and triple fusion. Appl. Intell. 2025, 55, 62. [Google Scholar] [CrossRef]
  38. Zhou, Y.; Wang, M.; Liu, D.; Hu, Z.; Zhang, H. More grounded image captioning by distilling image-text matching model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4777–4786. [Google Scholar]
  39. Bajpai, D.J.; Hanawal, M.K. CAPEEN: Image Captioning with Early Exits and Knowledge Distillation. arXiv 2024, arXiv:2410.04433. [Google Scholar] [CrossRef]
  40. Cohen, G.H. ALIGN: A program to superimpose protein coordinates, accounting for insertions and deletions. Appl. Crystallogr. 1997, 30, 1160–1161. [Google Scholar] [CrossRef]
  41. Xiao, B.; Wu, H.; Xu, W.; Dai, X.; Hu, H.; Lu, Y.; Zeng, M.; Liu, C.; Yuan, L. Florence-2: Advancing a unified representation for a variety of vision tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4818–4829. [Google Scholar]
  42. Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
  43. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
  44. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
  45. Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
  46. Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 22 July 2004; pp. 74–81. [Google Scholar]
  47. Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. Spice: Semantic propositional image caption evaluation. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part V 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 382–398. [Google Scholar]
  48. Sennrich, R.; Haddow, B.; Birch, A. Neural machine translation of rare words with subword units. arXiv 2015, arXiv:1508.07909. [Google Scholar]
  49. Yao, T.; Pan, Y.; Li, Y.; Mei, T. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF INTERNATIONAL Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2621–2629. [Google Scholar]
  50. Wu, M.; Zhang, X.; Sun, X.; Zhou, Y.; Chen, C.; Gu, J.; Sun, X.; Ji, R. Difnet: Boosting visual information flow for image captioning. In Proceedings of the IEEE/CVF Conference on Computer vision and Pattern Recognition, Seoul, Republic of Korea, 27 October–2 November 2022; pp. 18020–18029. [Google Scholar]
  51. Huang, L.; Wang, W.; Chen, J.; Wei, X.Y. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4634–4643. [Google Scholar]
  52. Pan, Y.; Yao, T.; Li, Y.; Mei, T. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10971–10980. [Google Scholar]
  53. Luo, Y.; Ji, J.; Sun, X.; Cao, L.; Wu, Y.; Huang, F.; Lin, C.W.; Ji, R. Dual-level collaborative transformer for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 2286–2293. [Google Scholar]
  54. Zhang, X.; Sun, X.; Luo, Y.; Ji, J.; Zhou, Y.; Wu, Y.; Huang, F.; Ji, R. Rstnet: Captioning with adaptive attention on visual and non-visual words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15465–15474. [Google Scholar]
  55. Yan, P.; Li, Z.; Hu, R.; Cao, X. BENet: Bi-directional enhanced network for image captioning. Multimed. Syst. 2024, 30, 48. [Google Scholar] [CrossRef]
  56. Wei, J.; Li, Z.; Zhu, J.; Ma, H. Enhance understanding and reasoning ability for image captioning. Appl. Intell. 2023, 53, 2706–2722. [Google Scholar] [CrossRef]
Figure 1. Overview of the proposed MFEAM framework. (a) The full architecture includes a visual encoder, a student model, and a teacher model. The teacher model guides the student model via knowledge distillation and model averaging from both visual and textual views. (b) Detailed internal structure of the teacher model, where the Multi-View Feature Enhanced Attention (MFE) module is embedded in both encoder and decoder.
Figure 1. Overview of the proposed MFEAM framework. (a) The full architecture includes a visual encoder, a student model, and a teacher model. The teacher model guides the student model via knowledge distillation and model averaging from both visual and textual views. (b) Detailed internal structure of the teacher model, where the Multi-View Feature Enhanced Attention (MFE) module is embedded in both encoder and decoder.
Applsci 15 08368 g001
Figure 2. Architecture of the Multi-View Feature Enhanced Attention (MFE). MFE consists of the self-enhanced module and the sliding window module, which enhance attention weights from the textual and visual views. Visual and textual features are processed through separate channels for processing, enhancing the relationship between visual and textual features.
Figure 2. Architecture of the Multi-View Feature Enhanced Attention (MFE). MFE consists of the self-enhanced module and the sliding window module, which enhance attention weights from the textual and visual views. Visual and textual features are processed through separate channels for processing, enhancing the relationship between visual and textual features.
Applsci 15 08368 g002
Figure 3. The student decoder is split into two functional groups: an Instance Decoder, which processes the visual features, and a Relationship Decoder, which captures the relational dependencies among instances. Each group contains three Transformer decoder layers, forming two parallel decoding pathways for feature modeling.
Figure 3. The student decoder is split into two functional groups: an Instance Decoder, which processes the visual features, and a Relationship Decoder, which captures the relational dependencies among instances. Each group contains three Transformer decoder layers, forming two parallel decoding pathways for feature modeling.
Applsci 15 08368 g003
Figure 4. Comparison of the computational efficiency of different image captioning methods on MSCOCO dataset.
Figure 4. Comparison of the computational efficiency of different image captioning methods on MSCOCO dataset.
Applsci 15 08368 g004
Figure 5. Four image examples from the MSCOCO dataset with sentence generation results.
Figure 5. Four image examples from the MSCOCO dataset with sentence generation results.
Applsci 15 08368 g005
Table 1. Performance of MFEAM on the MSCOCO dataset.
Table 1. Performance of MFEAM on the MSCOCO dataset.
MethodB-1B-4MRCS
Up-Down [4]79.836.327.756.9120.121.4
GCN-LSTM [9]80.538.328.558.3127.622.0
AoANet [51]80.238.929.258.8129.822.4
EURAIC [56]80.939.529.459.4130.3-
HIP [49]-39.128.959.2130.622.3
VAT [29]81.239.029.359.4131.822.8
X-Transformer [52]80.939.729.559.1132.823.4
DLCT [53]81.439.829.559.1133.823.0
RSTNet [54]81.840.129.859.5135.623.3
DIFNet [50]81.740.029.759.4136.223.2
BENet [55]82.140.330.059.7137.623.6
MFEAM82.840.830.560.2140.124.3
Table 2. Results of enhancement modules when training with cross-entropy loss.
Table 2. Results of enhancement modules when training with cross-entropy loss.
SEMSWMB-1B-4MRCS
--77.837.728.657.5122.821.5
-78.238.128.957.8123.521.8
-78.638.929.158.6125.322.1
79.2 39.729.359.1127.322.4
Note: “” indicates that the corresponding enhancement module (SEM or SWM) is enabled, while “-” indicates that the module is not used.
Table 3. Ablation study on the use of the model averaging when training with cross-entropy loss.
Table 3. Ablation study on the use of the model averaging when training with cross-entropy loss.
ModelModel AveragingB-1B-4MRCS
CLIP-VIT-B/16-78.038.128.958.2123.121.8
CLIP-VIT-B/32-76.336.628.156.8118.321.1
CLIP-RN50-75.936.327.856.6115.920.6
CLIP-RN101-76.637.128.157.1118.721.0
CLIP-RN50×4-77.337.828.557.7122.021.5
CLIP-RN50×16-78.739.129.158.8126.722.3
MFEAM79.2 39.729.359.1127.322.4
Note: “” indicates that model averaging is enabled in the corresponding configuration, while “-” indicates that it is not used.
Table 4. Results of different image feature extraction modules when training with cross-entropy loss.
Table 4. Results of different image feature extraction modules when training with cross-entropy loss.
ModelMFEAMB-1B-4MRCS
CLIP-VIT-B/16-77.838.029.158.0122.521.8
CLIP-VIT-B/1678.838.929.258.6125.722.0
CLIP-VIT-B/32-76.536.728.156.9118.121.0
CLIP-VIT-B/3276.536.927.956.9119.721.1
CLIP-RN50-75.736.228.156.7115.720.8
CLIP-RN5076.336.628.256.8116.220.7
CLIP-RN101-76.537.228.457.2118.621.2
CLIP-RN10177.137.428.557.5119.021.1
CLIP-RN50×4-77.237.628.757.6121.921.6
CLIP-RN50×477.938.328.658.0122.421.7
CLIP-RN50×16-78.038.829.458.6125.022.2
CLIP-RN50×1679.239.729.359.1127.322.4
Note: “” indicates that the proposed MFEAM architecture is used, while “-” indicates that a single language model serves as the language decoder.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cui, Y.; Zhang, J. MFEAM: Multi-View Feature Enhanced Attention Model for Image Captioning. Appl. Sci. 2025, 15, 8368. https://doi.org/10.3390/app15158368

AMA Style

Cui Y, Zhang J. MFEAM: Multi-View Feature Enhanced Attention Model for Image Captioning. Applied Sciences. 2025; 15(15):8368. https://doi.org/10.3390/app15158368

Chicago/Turabian Style

Cui, Yang, and Juan Zhang. 2025. "MFEAM: Multi-View Feature Enhanced Attention Model for Image Captioning" Applied Sciences 15, no. 15: 8368. https://doi.org/10.3390/app15158368

APA Style

Cui, Y., & Zhang, J. (2025). MFEAM: Multi-View Feature Enhanced Attention Model for Image Captioning. Applied Sciences, 15(15), 8368. https://doi.org/10.3390/app15158368

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop