Unpaired Image Captioning via Cross-Modal Semantic Alignment

Yang, Yong; Zhou, Kai; Ren, Ge

doi:10.3390/app152111588

Open AccessArticle

Unpaired Image Captioning via Cross-Modal Semantic Alignment

by

Yong Yang

¹,

Kai Zhou

² and

Ge Ren

^1,*

¹

School of Computer Science and Technology, Xinjiang Normal University, Urumqi 830054, China

²

School of Software, Xinjiang University, Urumqi 830091, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(21), 11588; https://doi.org/10.3390/app152111588

Submission received: 9 October 2025 / Revised: 26 October 2025 / Accepted: 27 October 2025 / Published: 30 October 2025

Download

Browse Figures

Versions Notes

Abstract

Image captioning, as a representative cross-modal task, faces significant challenges, including high annotation costs and modality alignment difficulties. To address these issues, this paper proposes CMSA, an image captioning framework that does not require paired image-text data. The framework integrates a generator, a discriminator, and a reward module, employing a collaborative multi-module optimization strategy to enhance caption quality. The generator builds multi-level joint feature representations based on a contrastive language-image pretraining model, effectively mitigating the modality alignment problem and guiding the language model to generate text highly consistent with image semantics. The discriminator learns linguistic styles from external corpora and evaluates textual naturalness, providing critical reward signals to the generator. The reward module combines image-text relevance and textual quality metrics, optimizing the generator parameters through reinforcement learning to further improve semantic accuracy and language expressiveness. CMSA adopts a progressive multi-stage training strategy that, combined with joint feature modeling and reinforcement learning mechanisms, significantly reduces reliance on costly annotated data. Experimental results demonstrate that CMSA significantly outperforms existing methods across multiple evaluation metrics on the MSCOCO and Flickr30k datasets, exhibiting superior performance and strong cross-dataset generalization ability.

Keywords:

mage captioning; reinforcement learning; cross-modal learning; multimodal generation

1. Introduction

Image captioning, as a core task at the intersection of computer vision and natural language processing, aims to accurately describe image content using natural language. Currently, mainstream approaches to image captioning can be broadly categorized into two types: supervised learning based on paired data and unpaired cross-modal generation. The first category relies on large-scale annotated datasets (e.g., MSCOCO) and trains models in an end-to-end manner to achieve precise visual-language alignment [1]. However, their performance is constrained by the diversity and scale of the available data. The second category attempts to alleviate reliance on extensive annotated data by integrating unimodal pretrained models (such as BERT and GPT) with cross-modal mapping mechanisms [2,3,4]. While these methods partially mitigate the problem of data scarcity, many of them lack explicit visual-semantic constraints, leading to insufficient modality alignment and generated captions that may deviate from entities actually present in the image. This misalignment is particularly pronounced in models that rely solely on text-pretrained decoders, which often overfit to linguistic priors and struggle to capture spatial relationships or fine-grained visual features—ultimately impairing the accuracy and relevance of generated descriptions [5,6,7,8,9]. Therefore, effectively incorporating visual-semantic constraints under unpaired frameworks remains a critical challenge in current research.

To address the aforementioned issues, this paper proposes a zero-shot learning framework based on the vision-language pre-trained model CLIP [10]. The framework innovatively integrates multi-level joint feature representations with a reinforcement learning-based reward mechanism, enhancing the model’s capability to capture image semantics and effectively mitigating semantic shift caused by insufficient modality alignment. Consequently, this leads to improved fluency and accuracy in generated captions. Specifically, the framework first leverages CLIP’s open-domain recognition ability in zero-shot scenarios to extract fine-grained visual semantic information from images and convert it into textual prompts. Subsequently, image features and prompt features are fused to construct multi-level joint feature representations. Next, a composite reward function is designed by combining CLIP-based image-text semantic alignment metrics with language fluency evaluation metrics, optimizing the decoding process through reinforcement learning. This method does not rely on paired image-text data and achieves deep integration of visual understanding and language generation, effectively enhancing the semantic accuracy and naturalness of generated captions. Experimental results on benchmark datasets such as MSCOCO demonstrate that, compared to existing unpaired methods, the proposed model achieves significant improvements on metrics including BLEU-4 and CIDEr, and produces captions with superior semantic consistency and linguistic fluency. The main contributions of this work are as follows:

(1): A multi-level joint feature representation module based on CLIP’s cross-modal alignment capability is proposed, which can automatically extract semantic entities from images and deeply fuse them with image features to construct rich cross-modal joint features, effectively guiding the language model to generate descriptions that are highly consistent with the image content and semantically precise.
(2): Design a joint reward function that integrates image-text semantic alignment and language fluency, combined with a progressive multi-stage optimization strategy, to significantly enhance the semantic accuracy and naturalness of the generated text.
(3): A novel progressive multi-stage optimization framework is developed, which seamlessly combines initialization and reinforcement learning to ensure training stability and enhance cross-modal adaptability in unpaired conditions.

2. Related Work

2.1. Unpaired Image Captioning

Image captioning aims to automatically generate natural language descriptions for input images. Traditional methods typically adopt an encoder-decoder architecture, where visual features are extracted and then decoded into text. However, such approaches heavily rely on large-scale paired image-text datasets, the acquisition of which is costly and often impractical for real-world applications. To reduce the dependence on paired data, researchers have proposed cross-modal generation methods based on unpaired datasets, seeking to enable content translation and generation without strict image-text correspondences. Among these, adversarial training-based methods align latent representations of images and texts to generate captions. However, due to insufficient alignment, these methods often suffer from semantic ambiguity or inconsistency, making it difficult to accurately capture image details and complex semantic relationships [11,12,13,14]. In recent years, cross-modal pre-trained models such as CLIP have demonstrated remarkable semantic alignment capabilities by mapping images and texts into a shared feature space through contrastive learning, opening new directions for image captioning. Building on this, some studies have attempted to construct language generation systems relying solely on textual data without image input, leveraging CLIP’s alignment ability to further reduce reliance on paired datasets [5,6,7,8,9,15]. Despite these advancements, several challenges remain: adversarial training often suffers from unstable modality alignment, leading to semantic deviations; and text-only training, lacking visual information, tends to produce captions that are irrelevant or only partially consistent with the image content, falling short of the semantic accuracy and content relevance required for image captioning.

To address the issues of semantic inconsistency and content deviation under unpaired conditions, this paper proposes a Cross-modal Semantic Alignment (CMSA) framework. By integrating visual features with semantic prompts, the framework constructs multi-level joint representations to guide language generation. Without relying on paired training data, CMSA effectively enhances the semantic relevance and descriptive accuracy of the generated captions, thereby improving the adaptability and generalization of cross-modal generation methods in complex application scenarios.

2.2. Reinforcement Learning

Reinforcement learning plays a pivotal role in complex text generation tasks such as image captioning, effectively guiding generation models toward improved generation policies. By designing appropriate reward mechanisms, reinforcement learning not only enhances the semantic relevance of generated texts but also improves their linguistic quality. Traditional methods, such as Self-Critical Sequence Training (SCST) [16], typically use n-gram based metrics like CIDEr [17] as the sole reward signal to optimize generation results by measuring similarity with reference captions. However, metrics like CIDEr have limitations in capturing semantic depth and linguistic diversity, often leading to template-like and less varied generated outputs.

To overcome these issues, this paper introduces a multidimensional composite reward strategy within the reinforcement learning framework. Specifically, it combines cross-modal semantic alignment scores (CLIPScore) with language naturalness evaluations (scored by a RoBERTa MLP-based module) to construct a richer reward function, enabling the model to optimize both image-text semantic consistency and text readability simultaneously. This strategy effectively balances content accuracy and linguistic fluency, significantly enhancing the quality and diversity of generated texts, making them more consistent with human language expression patterns.

3. Method

The Cross-Modal Semantic Alignment-driven Unpaired Image Captioning Model (CMSA) is an image captioning method that does not require paired image-text data. It is based on hierarchical joint feature representation and reinforcement learning. As illustrated in Figure 1, CMSA adopts a progressive multi-stage optimization strategy, consisting of an Initialization phase and a Reinforce phase. In the Initialization phase, a language model is trained solely on textual data to establish basic language generation capabilities. The Reinforce phase involves the collaborative operation of three core components: the Generator, Discriminator, and Reward Model, which iteratively refine the quality of the generated captions. Specifically, the Generator produces descriptive text under the guidance of joint feature representations; the Discriminator, leveraging adversarial training and reinforcement learning, evaluates the authenticity and fluency of the text and transforms the evaluation into reward signals for the Generator. The Reward Mechanism integrates both image-text matching scores and linguistic naturalness to construct a composite reward function, thereby enhancing the accuracy and fluency of the generated descriptions. The following sections detail the design and implementation of the Initialize, Generator, Discriminator, and Reward modules.

3.1. Initialize

The initialization stage is designed to equip the generator with language modeling capabilities while implicitly establishing a foundation for cross-modal representation learning. A large-scale pure-text training corpus is first constructed using all annotated captions from the MSCOCO dataset [18]. Nouns are extracted from these captions using the Natural Language Toolkit (NLTK) [19], and training pairs are constructed by incorporating them into predefined prompt templates (e.g., “Describe the picture in terms of the noun1, noun2, and noun3”). The annotated captions are encoded using CLIP and then transformed via a learnable projector. These projected representations are fused with the corresponding prompts and fed into a large language model (LLM) for autoregressive generation. The objective is to minimize the cross-entropy loss. Although no image features are involved in this phase, the prompt-driven generation task provides the model with a preliminary capability for semantic alignment, laying the groundwork for cross-modal generation.

3.2. Generator

The generator in the proposed model adopts a dual-stream hybrid architecture that integrates visual-semantic features with textual prompt features to construct multi-level joint representations, thereby enabling dual-guided generation. This mechanism leverages the cross-modal alignment capability of the CLIP model [10] to guide the autoregressive decoder in generating text that is highly semantically aligned with the image, even under unpaired training conditions. Specifically, during the visual feature extraction phase, a frozen CLIP visual encoder is employed to encode the input image I, yielding a high-level semantic representation in the form of an image vector.

V_{clip} = CLIP_ImageEncoder (I)

(1)

This design strategy effectively retains the visual-language associative knowledge acquired by CLIP through large-scale contrastive learning, providing a visually grounded representation with language awareness for the subsequent generation process. To further enhance the spatial adaptability between visual features and text generation, CMSA introduces a trainable mapping network (Projector), which projects the original image features into a sequence of visual prompt features with matching dimensions.

P_{v} = Projector (v_{c l i p}) = W_{2} (Tanh (W_{1} v_{c l i p} + b_{1})) + b_{2}

(2)

This feature sequence not only encodes the core semantic information of the image but also preserves the structural compatibility with the latent space of the text decoder, providing fine-grained cross-modal guidance signals for the generation process.

Visual features support the construction of textual prompt features. Specifically, visual features serve as cross-modal query vectors to drive knowledge base retrieval based on semantic similarity. First, CMSA constructs an open-domain entity concept knowledge base

C = {c_{i}}_{i = 1}^{N}

, where each concept

c_{i}

corresponds to a textual description (e.g., “dog,” “mountain,” “person,” etc.). These concepts are encoded using a frozen CLIP text encoder to obtain a normalized textual embedding matrix. After computing semantic similarities, the top 3 most similar nouns are selected and incorporated into a prompt template to form

t_{n}

, which serves as the textual prompt feature. After encoding, the visual prompt and textual prompt are fused and fed into a large language model (LLM) to instantiate the caption generator.

E_{i n} = Concat (P_{v}, E (t_{n}))

(3)

where

E (\cdot)

represents the embedding layer of the LLM,

t_{n}

is the text prompt, and the LLM uses the fused prompt with an autoregressive decoding strategy to gradually generate the descriptive text based on the conditional probability distribution. For each time step t, the selection of the current word

c_{t}

is determined by maximizing the conditional probability:

c_{t} = a r g max_{w_{i} \in V} E (w_{i} ∣ E_{1}, E_{2}, \dots, E_{k}, c_{1}, c_{2}, \dots, c_{t - 1})

(4)

where

{E_{i}}_{i = 1}^{k}

represents the mapped image prefix

P_{v}

and the joint representation of the prompt embedding

E (t_{n})

,

{c_{j}}_{j = 1}^{t - 1}

is the already generated word sequence, and V is the vocabulary. When the sequence is sufficiently long or reaches the end token ("END"), the decoding process will stop. The maximum length is set to 20, and the "END" token is set to ".".

3.3. Discriminator

The discriminator is composed of RoBERTa [20] and an MLP scoring network. Drawing on the principles of Generative Adversarial Networks (GAN) [21], it is integrated into a reinforcement learning framework to enhance the quality of caption generation. Specifically, RoBERTa serves as a Transformer-based semantic encoder that extracts deep semantic features from the generated text, while the MLP network evaluates these features and outputs a naturalness score. This score is then fed back to the generator as a reward signal, encouraging it to produce captions that more closely resemble human language, thereby improving both semantic plausibility and linguistic fluency.

During the training process, the discriminator simultaneously receives both pseudo-real captions and generated captions, and computes their respective naturalness scores. These scores are not only used by the discriminator to learn its discriminative ability but also serve as reward signals that are fed back to the generator, encouraging it to continuously improve the quality of generated captions in terms of grammatical correctness, semantic coherence, and overall readability. Specifically, the model leverages CLIP’s cross-modal similarity retrieval capability [10] to select the most semantically relevant pseudo-real text from an external database and inputs both this text and the text generated by the generator into the RoBERTa encoder for processing. The naturalness score is then output through the MLP network. The scoring process is as follows:

p_{real} = σ (M L P (R o B E R T a (Tokenize (T_{real}))))

(5)

p_{gen} = σ (M L P (R o B E R T a (Tokenize (T_{gen}))))

(6)

where

T_{real}

and

T_{gen}

denote the pseudo-real and generated captions, respectively. Both types of text are first encoded by RoBERTa to obtain semantic representations, which are then passed through the MLP scoring network. The pseudo-real captions are retrieved human-written sentences from an external corpus, while the generated captions are produced by the generator. The MLP is trained jointly with the RoBERTa encoder to distinguish these two kinds of inputs and to output a naturalness score through the Sigmoid function

σ (x)

. To optimize the discriminator, a binary cross-entropy (BCE) [22] loss is used, allowing the model to accurately differentiate between pseudo-real and generated captions. The loss function is defined as follows:

L_{D} = - \frac{1}{N} \sum_{i = 1}^{N} [log p_{real}^{(i)} + log (1 - p_{gen}^{(i)})] .

(7)

where N is the batch size, representing the amount of data input during training in one iteration,

p_{real}^{(i)}

is the naturalness score of the i-th pseudo-real text, and

p_{gen}^{(i)}

is the naturalness score of the i-th generated text. By minimizing this loss function, the discriminator can effectively improve its classification ability, providing more stable training signals for the generator.

3.4. Reward

The Reward module is responsible for providing guidance signals that steer the generator toward producing descriptions that are both semantically aligned with the image and consistent with human language habits. Since text generation is inherently a discrete decision-making process, it cannot be directly optimized through gradient backpropagation [23]. To address this, CMSA employs a reinforcement learning strategy [24], leveraging the Policy Gradient method combined with Self-Critical Sequence Training (SCST) to enhance generation performance. During training, two types of reward signals are introduced to guide the parameter updates of the generator:

(1) Cross-modal semantic alignment reward: Given the input image features I and the generated text

\hat{x}

, CMSA employs CLIPScore [25] to compute the matching score between the generated text and the image, denoted as

r_{I} (\hat{x}, I)

. This metric effectively captures the deep semantic associations between visual content and linguistic descriptions.

(2) Textual semantic naturalness reward: To more comprehensively evaluate the linguistic quality of generated text, CMSA incorporates a scoring module composed of RoBERTa and an MLP network within the discriminator to compute the naturalness score

p_{gen}

of the generated sentence. This score serves as a key reward signal in the reinforcement learning process, effectively guiding the generator toward improved semantic coherence, language fluency, and overall readability. To integrate the two reward signals, CMSA defines a composite reward function:

L_{reward} = α \cdot r_{I} (\hat{x}, I) + (1 - α) \cdot p_{gen}

(8)

where

α \in [0, 1]

is a learnable balancing coefficient that dynamically weights cross-modal alignment and textual semantic fidelity. Under the framework of Self-Critical Sequence Training (SCST), CMSA adopts greedily decoded captions as the baseline b to reduce the variance of gradient estimation. The policy gradient loss for the generator is defined as:

L_{PG} = - E_{\hat{x} \sim P_{G}} [(L_{reward} - b) \cdot log P_{G} (\hat{x} ∣ I)]

(9)

There

log P_{G} (\hat{x} ∣ I)

represents the log-probability assigned by the generator G to the generated caption

\hat{x}

, conditioned on the input image I. By computing the difference between the reward and the baseline b, the model captures the relative improvement signal over the greedy decoding output, thereby encouraging the generator to produce higher-quality captions. To fully leverage the direct feedback from reward signals as well as the exploratory capabilities of policy gradients, CMSA combines the above components into the final generator loss function:

L_{G} = λ \cdot L_{reward} + (1 - λ) \cdot L_{PG}

(10)

λ \in [0, 1]

is a dynamic weighting factor that adjusts the contribution of the reward term via a warm-up strategy, enabling a balanced optimization between the reward loss and the policy gradient loss. Through this joint optimization strategy, the generator can not only leverage the image-text matching signals provided by CLIP to ensure semantic consistency between the generated text and the input image, but also use the

p_{gen}

score as a textual fluency reward. This effectively guides the generation process towards improving semantic coherence and linguistic naturalness, thereby significantly enhancing the overall semantic quality and readability of the captions.

4. Experimental

4.1. Dataset

The CMSA model is trained in an unsupervised manner using unpaired data, where no explicit correspondence is established between images and texts. This training strategy enables the model to autonomously learn cross-modal alignment from a large collection of unpaired images and captions. To evaluate the model’s generalization performance under varying data resource conditions, including high-resource, low-resource, and cross-dataset scenarios—we utilize two widely adopted image captioning datasets: the MSCOCO dataset [26] and the Flickr30k dataset [27]. Specifically, the MSCOCO dataset contains 123,287 images, each annotated with five human-written natural language descriptions; the Flickr30k dataset comprises 31,783 images collected from the Flickr website, with each image also accompanied by five descriptive sentences. In our experiments, we follow the standard data split protocol proposed by Karpathy et al. [18], where the MSCOCO dataset is divided into 113,287 training images, 5000 validation images, and 5000 test images, and the Flickr30k dataset is split into 29,783 training images, 1000 validation images, and 1000 test images, each image annotated with five human-written captions. This ensures the comparability and generality of the evaluation results.

4.2. Evaluation Metrics

For performance evaluation, we employ five standard metrics widely used in image captioning tasks: BLEU-4 (B4) [28], METEOR (M) [19], ROUGE-L (R) [29], CIDEr (C) [17], and SPICE (S) [30]. These metrics are computed based on the human-annotated ground-truth captions in the test set and are used to assess the consistency and similarity between the generated captions and the reference captions.

BLEU-4 (Bilingual Evaluation Understudy), proposed by Papineni et al. in 2002 [28] evaluates the lexical overlap between generated and reference texts by calculating n-gram precision, and is widely adopted for measuring word-level similarity in image captioning tasks.

METEOR (Metric for Evaluation of Translation with Explicit ORdering), introduced by Banerjee and Lavie in 2014 [19], incorporates stemming, synonym matching, and paraphrase alignment to provide a more linguistically sensitive evaluation, making it suitable for multiple natural language generation tasks.

ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation) was originally designed for automatic text summarization, focusing on the longest common subsequence (LCS) between generated and reference sentences to capture sequence-level coherence.

CIDEr (Consensus-based Image Description Evaluation), proposed in 2015, is specifically designed for image captioning tasks. It measures the consensus between the generated caption and multiple human annotations through TF-IDF weighted n-gram matching, reflecting semantic relevance and descriptive fidelity.

SPICE (Semantic Propositional Image Caption Evaluation), introduced in 2016, evaluates captions based on semantic scene graphs composed of objects, attributes, and relationships, thereby aligning more closely with human judgment of caption quality.

Among these metrics, higher scores indicate greater consistency between generated captions and reference descriptions, signifying better overall model performance.

4.3. Implementation Details

In the proposed CMSA model, the generator employs the ViT-L/14 variant of CLIP [10] as the image encoder, while the text generation module is based on GPT-2 [31] provided by Hugging Face. The projector consists of a two-layer MLP, where the hidden layer contains 1024 neurons, and the output dimension matches the input size of GPT-2. A Tanh activation function is applied after each hidden layer to introduce non-linearity and enhance feature representation. The discriminator is built upon a pre-trained RoBERTa model with a pooling layer, followed by a two-layer MLP (hidden dimension: 384, output dimension: 1). The optimizer used is AdamW, with hyperparameters configured as follows: learning rate of

1 \times 10^{- 5}

, weight decay of 0.05,

β = (0.9, 0.999)

, and

ϵ = 1 \times 10^{- 8}

. The generator undergoes a warm-up for 3000 steps, while no warm-up is applied to the discriminator. The batch size is set to 16. Policy gradients are estimated by sampling five times from the generator and taking the average. During inference, beam search with a beam size of 1 is applied. In the initialization phase, the model is trained with a batch size of 80 and a learning rate of

2 \times 10^{- 5}

, still using AdamW as the optimizer, with a warm-up of 5000 steps to stabilize training. All experiments are conducted on a single NVIDIA RTX 4090 GPU. The model is trained for 1 epoch in the Initialization stage and 50 epochs in the Reinforce stage. Final evaluation metrics are computed using the official COCO evaluation toolkit to ensure standardization and comparability.

4.4. Comparison on Benchmarks

In the task of unpaired image caption generation, this study systematically evaluates a variety of representative existing methods. The baseline models can be categorized into three groups: the first group—including ZeroCap [5], MAGIC [15], DeCap [7], CapDec [6], ViECap [9], and MacCap [8] relies on text-driven strategies to achieve unpaired generation through prompt design or language model optimization; the second group—such as PL-UIC [32] and ESPER-Style [33] leverages CLIP to build vision-language alignment; the third group including UIC-GAN [11], WS-UIC [12], SCS [13], and PCH [14] models cross-modal relationships without using CLIP by adopting weakly-supervised or unsupervised learning techniques. Experimental results show that CMSA consistently achieves the best performance across multiple benchmark datasets, demonstrating its strong generalization ability and effective module collaboration. These results validate the effectiveness and superiority of CMSA in generating high-quality image captions under unpaired settings.

For the qualitative analysis of image captioning, we selected several representative samples from the test set to compare the outputs of human annotations (Ground Truth), CapDec, and the proposed CMSA, as shown in Figure 2. CMSA exhibits greater precision in capturing key visual elements and semantic relationships, and the generated captions show superior detail and overall semantic consistency compared to competing methods, further highlighting its comprehensive capabilities in visual understanding and language generation.

4.5. Overall Performance Evaluation of the Model

Experimental results on the MSCOCO and Flickr30K datasets (Table 1 and Table 2) demonstrate that CMSA outperforms existing methods across various mainstream evaluation metrics, with particularly significant improvements in CIDEr and SPICE. The increase in CIDEr indicates that the captions generated by CMSA more closely match human annotations in terms of lexical similarity, while the improvement in SPICE reflects the model’s stronger capability in capturing visual semantics, enhancing semantic plausibility. In addition, CMSA also performs well on ROUGE, suggesting better coherence in long-text generation and greater conformity to natural language patterns. Overall, the results validate that CMSA maintains strong image-text alignment and natural language generation capabilities even in the absence of paired data.

4.6. Cross-Domain Generalization Performance Analysis

To evaluate the model’s generalization ability in high-low resource cross-domain scenarios, the model was trained on Flickr30K and tested on COCO (Table 3). The results show that CMSA outperforms other models across all evaluation metrics, with CIDEr and SPICE reaching 60.8 and 14.5, respectively, demonstrating a significant lead. This indicates that the model retains effective learning and transferability even under low-resource conditions. Similarly, when the model is trained on COCO and tested on Flickr30K (Table 4), it achieves excellent performance, especially with a notable advantage in CIDEr and SPICE. These results validate CMSA’s stable generalization ability across different data distributions, suggesting that it can maintain strong semantic understanding and text generation performance even without paired data. This makes it particularly suitable for low-resource image captioning tasks, with strong transferability and application potential.

4.7. Ablation Studies

To evaluate the impact of each component of CMSA on overall performance, we conducted an ablation experiment (Table 5), where I-F denotes image features, T-F denotes prompt text features, C-R represents CLIP-based rewards, and B-R represents RoBERTa-based rewards.

The results show that using image features, CLIP rewards, or BERT rewards alone cannot achieve the performance of the complete model. For instance, in the image feature-only mode, CIDEr and SPICE scores drop to 82.5 and 14.2, indicating that the absence of textual modality weakens generation quality. Similarly, CLIP rewards and RoBERTa rewards, when applied individually, yield inferior results compared to their joint effect, demonstrating their complementarity in visual–semantic modeling.

The complete model achieves the best performance across all metrics, with CIDEr reaching 100.2 and SPICE reaching 18.9, highlighting the effectiveness of integrating multimodal features with dual reward signals. These results confirm that CMSA effectively leverages multimodal information through collaborative optimization, thereby improving both the coherence and semantic consistency of generated captions.

As shown in Figure 3, the number of retrieved entities (

t o p_k

) significantly influences the captioning performance of CMSA. In the generator,

t o p_k

determines how many visual concepts are retrieved from the CLIP-based knowledge base and injected into the textual prompt. When

t o p_k = 1

, the limited entity set restricts the descriptive scope, resulting in incomplete semantic coverage and lower CIDEr and SPICE scores. As

t o p_k

increases, richer contextual information helps produce captions with finer details and more accurate semantics. The model achieves its best overall performance when

t o p_k = 3

, reaching a CIDEr score of 100.2 and a BLEU-4 score of 29.0, while SPICE, METEOR, and ROUGE_L also remain at relatively high levels. Compared with

t o p_k = 1

, this configuration significantly improves the accuracy and semantic completeness of the generated captions. However, when

t o p_k \geq 5

, redundant or weakly related entities are introduced, slightly reducing semantic consistency and increasing computational cost. Therefore,

t o p_k = 3

is adopted as the default configuration, as it achieves the optimal trade-off between semantic richness and efficiency.

5. Conclusions

This paper proposes CMSA, a cross-modal image captioning framework that eliminates the need for paired image-text data. Leveraging multi-level joint feature representations and the powerful cross-modal alignment capabilities of CLIP, CMSA effectively integrates visual features and textual prompts to significantly enhance the semantic consistency of generated captions. By introducing a composite reward function that balances image-text semantic alignment and language fluency, combined with a staged progressive optimization strategy, CMSA achieves notable improvements in both semantic accuracy and naturalness of expression. Moreover, it demonstrates strong robustness under challenging low-resource conditions. Ablation studies further validate the synergistic benefits of visual alignment and semantic modeling, highlighting the key design factors driving model performance. This study offers an innovative and effective solution for unpaired image captioning tasks, with future work focusing on developing more generalizable prompting mechanisms and efficient model architectures to enhance adaptability and practical value.

Author Contributions

Conceptualization, Y.Y. and K.Z.; methodology, K.Z.; software, K.Z.; validation, K.Z.; formal analysis, K.Z.; investigation, K.Z.; resources, K.Z.; data curation, K.Z.; writing—original draft preparation, K.Z.; writing—review and editing, K.Z.; visualization, K.Z.; supervision, Y.Y. and G.R.; project administration, Y.Y. and G.R.; funding acquisition, Y.Y. and G.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 62167008); Smart Education Engineering Technology Research Center of Xinjiang Normal University (Grant No. XJNU-ZHJY202405); and Smart Education Engineering Technology Research Center of Xinjiang Normal University (Grant No. XJNU-ZHJY202406).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. These datasets can be accessed at https://cocodataset.org/#home (accessed on 10 October 2025) and https://shannon.cs.illinois.edu/DenotationGraph/ (accessed on 10 October 2025). Additional materials and implementation details are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable feedback.

Conflicts of Interest

The authors declare no conflict of interest.

References

Huang, L.; Wang, W.; Chen, J.; Wei, X.Y. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4634–4643. [Google Scholar]
Changpinyo, S.; Sharma, P.; Ding, N.; Soricut, R. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 3558–3568. [Google Scholar]
Wang, Z.; Yu, J.; Yu, A.W.; Dai, Z.; Tsvetkov, Y.; Cao, Y. Simvlm: Simple visual language model pretraining with weak supervision. arXiv 2021, arXiv:2108.10904. [Google Scholar]
Gu, S.; Clark, C.; Kembhavi, A. I can’t believe there’s no images! learning visual tasks using only language supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 2672–2683. [Google Scholar]
Tewel, Y.; Shalev, Y.; Schwartz, I.; Wolf, L. Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17918–17928. [Google Scholar]
Nukrai, D.; Mokady, R.; Globerson, A. Text-only training for image captioning using noise-injected clip. arXiv 2022, arXiv:2211.00575. [Google Scholar]
Li, W.; Zhu, L.; Wen, L.; Yang, Y. Decap: Decoding clip latents for zero-shot captioning via text-only training. arXiv 2023, arXiv:2303.03032. [Google Scholar]
Qiu, L.; Ning, S.; He, X. Mining fine-grained image-text alignment for zero-shot captioning via text-only training. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 4605–4613. [Google Scholar]
Fei, J.; Wang, T.; Zhang, J.; He, Z.; Wang, C.; Zheng, F. Transferable decoding with visual entities for zero-shot image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 3136–3146. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PmLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Feng, Y.; Ma, L.; Liu, W.; Luo, J. Unsupervised Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4125–4134. [Google Scholar]
Zhu, P.; Wang, X.; Luo, Y.; Sun, Z.; Zheng, W.S.; Wang, Y.; Chen, C. Unpaired image captioning by image-level weakly-supervised visual concept recognition. IEEE Trans. Multimed. 2022, 25, 6702–6716. [Google Scholar] [CrossRef]
Ben, H.; Pan, Y.; Li, Y.; Yao, T.; Hong, R.; Wang, M.; Mei, T. Unpaired image captioning with semantic-constrained self-learning. IEEE Trans. Multimed. 2021, 24, 904–916. [Google Scholar] [CrossRef]
Ben, H.; Wang, S.; Wang, M.; Hong, R. Pseudo Content Hallucination for Unpaired Image Captioning. In Proceedings of the 2024 International Conference on Multimedia Retrieval (ICMR), Phuket, Thailand, 10–14 June 2024; pp. 320–329. [Google Scholar]
Su, Y.; Lan, T.; Liu, Y.; Liu, F.; Yogatama, D.; Wang, Y.; Kong, L.; Collier, N. Language models can see: Plugging visual controls in text generation. arXiv 2022, arXiv:2205.02655. [Google Scholar] [CrossRef]
Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024. [Google Scholar]
Vedantam, R.; Zitnick, C.L.; Parikh, D. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3128–3137. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Mao, A.; Mohri, M.; Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 23803–23828. [Google Scholar]
Yu, J.; Li, H.; Hao, Y.; Zhu, B.; Xu, T.; He, X. CgT-GAN: Clip-guided text GAN for image captioning. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 2252–2263. [Google Scholar]
Zhang, L.; Sung, F.; Liu, F.; Xiang, T.; Gong, S.; Yang, Y.; Hospedales, T.M. Actor-critic sequence training for image captioning. arXiv 2017, arXiv:1706.09601. [Google Scholar] [CrossRef]
Hessel, J.; Holtzman, A.; Forbes, M.; Bras, R.L.; Choi, Y. Clipscore: A reference-free evaluation metric for image captioning. arXiv 2021, arXiv:2104.08718. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Plummer, B.A.; Wang, L.; Cervantes, C.M.; Caicedo, J.C.; Hockenmaier, J.; Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2641–2649. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. SPICE: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 382–398. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Zhu, P.; Wang, X.; Zhu, L.; Sun, Z.; Zheng, W.S.; Wang, Y.; Chen, C. Prompt-based learning for unpaired image captioning. IEEE Trans. Multimed. 2023, 26, 379–393. [Google Scholar] [CrossRef]
Yu, Y.; Chung, J.; Yun, H.; Hessel, J.; Park, J.; Lu, X.; Ammanabrolu, P.; Zellers, R.; Le Bras, R.; Kim, G.; et al. Multimodal knowledge alignment with reinforcement learning. arXiv 2022, arXiv:2205.12630. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed CMSA framework for unpaired image captioning.

Figure 2. Qualitative results of image captioning: GT vs. CapDec vs. ours.

Figure 3. Ablation study on different

t o p_{k}

values. The line charts show the variation trends of CIDEr (C), SPICE (S), METEOR (M), BLEU (B), and ROUGE_L (R), while the bottom-right table summarizes the numerical results.

Figure 3. Ablation study on different

t o p_{k}

values. The line charts show the variation trends of CIDEr (C), SPICE (S), METEOR (M), BLEU (B), and ROUGE_L (R), while the bottom-right table summarizes the numerical results.

Table 1. Comparison of different methods on the MSCOCO dataset.

Method	B-4	M	R	C	S
ZeroCap [5]	7.0	15.4	-	34.5	9.2
MAGIC [15]	12.9	17.4	-	49.3	-
UIC-GAN [11]	18.6	17.9	43.1	54.9	11.1
WS-UIC [12]	22.6	20.9	46.6	69.2	14.1
MacCap [8]	17.4	22.3	45.9	69.7	15.7
SCS [13]	22.8	21.4	47.7	74.7	15.1
PCH [14]	24.2	22.5	48.8	77.2	16.3
PL-UIC [32]	25.0	22.6	49.4	77.9	15.1
ESPER-Style [33]	21.9	21.9	-	78.2	-
DeCap [7]	24.7	25.0	-	91.2	18.7
CapDec [6]	26.4	25.1	-	91.8	18.2
ViECap [9]	27.2	24.8	-	92.9	18.2
CMSA (Ours)	29.0	25.5	53.0	100.2	18.9

The bold numbers indicate the best performance among all compared methods.

Table 2. Comparison of different methods on the Flickr30k dataset.

Method	B-4	M	R	C	S
ZeroCap [5]	5.4	11.8	-	16.8	6.2
MAGIC [15]	-	-	-	-	-
UIC-GAN [11]	10.8	14.2	33.4	15.4	-
WS-UIC [12]	-	-	-	-	-
MacCap [8]	-	-	-	-	-
SCS [13]	14.3	15.6	38.5	20.5	-
PCH [14]	-	-	-	-	-
PL-UIC [32]	21.4	20.1	-	8.8	13.6
ESPER-Style [33]	-	-	-	-	-
DeCap [7]	21.2	21.8	-	56.7	15.2
CapDec [6]	17.7	20.0	43.9	39.1	-
ViECap [9]	21.4	20.1	-	47.9	13.6
CMSA (Ours)	22.7	21.3	48.3	58.2	14.5

The bold numbers indicate the best performance among all compared methods.

Table 3. Cross-domain results on Flickr30k → MSCOCO.

Method	B-4	M	R	C	S
MAGIC [15]	5.2	12.5	—	18.3	5.7
DeCap [7]	12.1	18.0	—	44.4	10.9
CapDec [6]	9.2	16.3	—	27.3	17.2
ViECap [9]	12.6	19.3	—	54.2	12.5
CMSA (Ours)	21.8	20.9	47.8	60.8	14.5

The bold numbers indicate the best performance among all compared methods.

Table 4. Cross-domain results on MSCOCO → Flickr30k.

Method	B-4	M	R	C	S
MAGIC [15]	6.2	12.2	—	17.5	5.9
DeCap [7]	16.3	17.9	—	35.7	11.1
CapDec [6]	17.3	18.6	—	35.7	—
ViECap [9]	17.4	18.0	—	38.4	11.2
CMSA (Ours)	17.4	17.7	42.3	39.2	13.8

The bold numbers indicate the best performance among all compared methods.

Table 5. Ablation study on different model components. I-F: Image Feature, T-F: Text Feature, C-R: CLIP Reward, B-R: BERT Reward.

Method				Evaluation Metric
I-F	T-F	C-R	B-R	B-4	M	R	C	S
✓		✓	✓	23.1	20.8	47.3	82.5	14.2
✓	✓	✓		26.7	23.1	49.5	85.4	15.8
✓	✓		✓	25.9	24.3	51.2	78.6	17.2
✓	✓	✓	✓	29.0	25.5	53.0	100.2	18.9

The bold numbers indicate the best performance among all compared methods. The checkmark symbol indicates that the corresponding module is enabled in each ablation setting.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, Y.; Zhou, K.; Ren, G. Unpaired Image Captioning via Cross-Modal Semantic Alignment. Appl. Sci. 2025, 15, 11588. https://doi.org/10.3390/app152111588

AMA Style

Yang Y, Zhou K, Ren G. Unpaired Image Captioning via Cross-Modal Semantic Alignment. Applied Sciences. 2025; 15(21):11588. https://doi.org/10.3390/app152111588

Chicago/Turabian Style

Yang, Yong, Kai Zhou, and Ge Ren. 2025. "Unpaired Image Captioning via Cross-Modal Semantic Alignment" Applied Sciences 15, no. 21: 11588. https://doi.org/10.3390/app152111588

APA Style

Yang, Y., Zhou, K., & Ren, G. (2025). Unpaired Image Captioning via Cross-Modal Semantic Alignment. Applied Sciences, 15(21), 11588. https://doi.org/10.3390/app152111588

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unpaired Image Captioning via Cross-Modal Semantic Alignment

Abstract

1. Introduction

2. Related Work

2.1. Unpaired Image Captioning

2.2. Reinforcement Learning

3. Method

3.1. Initialize

3.2. Generator

3.3. Discriminator

3.4. Reward

4. Experimental

4.1. Dataset

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Comparison on Benchmarks

4.5. Overall Performance Evaluation of the Model

4.6. Cross-Domain Generalization Performance Analysis

4.7. Ablation Studies

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI