In this section, we present the experimental evaluation of our method and discuss the results. We start by introducing the publicly available datasets that were used in our experiments, as well as the evaluation metrics commonly employed in the field. Then, we provide details of the implementation setup, including the model configuration and training procedure. Additionally, we perform a series of ablation experiments to evaluate the individual contributions of each module in our method and to compare the performance with the state-of-the-art approaches.
4.2. Evaluation Metrics
We utilized a comprehensive set of nine evaluation metrics to evaluate the performance of our proposed method and assess the quality of the generated sentences. These metrics include BLEU-n (n = 1, 2, 3, 4), Recall-Oriented Understudy for Gisting Evaluation (ROUGE_L), Metric for Translation Evaluation with Explicit Ordering (METEOR), Consensus-Based Image Description Evaluation (CIDEr), Semantic Propositional Image Caption Evaluation (SPICE), and Sm.
BLEU-n: The Bilingual Evaluation Understudy [
42] is a widely used metric originally developed for evaluating machine translation systems. It measures the co-occurrence of n-grams (contiguous sequences of n words) between the generated sentences and the reference (ground truth) sentences. The value of n can be chosen as 1, 2, 3, or 4, representing unigrams, bigrams, trigrams, and four-grams, respectively.
ROUGE_L: The ROUGE_L metric [
43] is a widely used evaluation metric in the fields of automatic summarization and machine translation. It measures the F-measure of the longest common subsequence (LCS) between the generated sentences and the reference (ground truth) sentences. The LCS represents the longest sequence of words that appears in both the generated and reference sentences. By computing the F-measure based on the LCS, it provides an indication of how well the generated sentences capture the key information and content of the reference sentences.
METEOR: This method is a widely used metric for evaluating the quality of machine translation output [
44]. It measures the similarity between a generated sentence and a reference sentence by considering word-to-word matches and aligning the words in both sentences. The final METEOR score is calculated as the harmonic mean of precision and recall, providing a balanced evaluation of the generated sentence quality.
CIDEr: This metric is specifically designed for evaluating image captioning tasks [
45]. It takes into account the term frequency-inverse document frequency (TF-IDF) weights of n-grams in both the generated and ground truth sentences. By applying TF-IDF weights, CIDEr captures the importance of specific n-grams in the context of the entire corpus of captions. It considers not only the presence of relevant n-grams but also their rarity across the dataset. This allows CIDEr to provide a more comprehensive evaluation of the generated captions, taking into account both the accuracy of the generated descriptions and their distinctiveness compared to other captions.
SPICE: SPICE metric [
46] constructs tuples from both the candidate (generated) captions and reference captions and then calculates the F-score based on the matching tuples. Unlike traditional n-gram-based metrics, SPICE focuses on capturing the semantic meaning of captions rather than relying on specific word sequences. It represents objects, attributes, and relationships in a graph-based representation, which makes it less sensitive to the specific choice of n-grams.
Sm: Sm is a metric proposed in the 2017 AI Challenger competition to evaluate the quality of generated sentences. It is the arithmetic mean of four popular evaluation metrics: BLEU-4, METEOR, ROUGE_L, and CIDEr. We also take the SPICE metric into consideration:
4.3. Experimental Details
In our CRSR model, the visual encoder, semantic refinement module, and sentence decoder are constructed with , , and Transformer blocks with a hidden state size of 512. The projection length of the visual mapper network is set to for three datasets. In the semantic retrieval and query prediction task, we extract high-frequency nouns and adjectives from the ground truth sequences of the datasets to serve as labels for filtering and prediction. To ensure meaningful and representative labels, we consider the variations in dataset sizes and the occurrence of words within the captions. Specifically, we select words that appear more than 15, 50, and 50 times in the Sydney, UCM, and RSICD datasets, respectively. The frequency of word appearances is calculated based on the condition that a word must appear at least three times within each image’s five captions. And the query length l for predicting omitted words is set as the average length of the overall overlooked words in the current dataset. For the Sydney, UCM, and RSICD datasets, l is set to 4, 2, and 7.
During training, we employ beam search decoding with a beam size of three. The modulating factor
for the semantic refinement loss is set to
. The entire architecture is optimized for 20 epochs with a batch size of 16. The Adam optimizer [
47] is employed with a learning rate of 4 ×
(warmup: 20,000 iterations). The experiments are conducted on a Tesla V100 GPU using PyTorch version 1.13.1.
4.4. Experiments on Image Encoder
In this section, we investigate the efficiency of different image features extracted from the pre-trained image extractors and explore the impact of varying projection lengths in the Transformer Mapper network. We conducted these experiments on all three datasets, RSICD, UCM-Captions, and Sydney-Captions, to evaluate the impact of different feature extractors and Transformer Mapper projection length on the overall caption generation performance.
4.4.1. Different Image Feature Extractors
We performed a comprehensive comparison of different image feature extractors in the CLIP model. Specifically, we evaluated the performance of the following feature extractors: RN50, RN50x4, RN101, ViT-B/16, and ViT-B/32 [
20]. The results of the comparison experiments under different image feature extractors are shown in
Table 2. The best results are highlighted in bold.
For the RSICD and Sydney-Captions datasets, different image feature extractors significantly affect the experiment results. The best CIDEr score obtained for the RSICD dataset is 3.0687, while the best SPICE score is 0.5276 when ViT-B/32 is used as the feature extractor. Similarly, for the Sydney-Captions dataset, our model achieves the best BLEU1-4, METEOR, ROUGE-L, and SPICE scores based on ViT-B/32. What is remarkable is that even when employing other feature extractors, our model maintains state-of-the-art performance, showcasing its robustness and effectiveness across different image feature representations. This demonstrates the adaptability and generalization capabilities of our model, which can deliver competitive results regardless of the feature extractor used.
Thus, based on these comprehensive results, ViT-B/32 stands out as the preferred choice for image feature extraction in our model, as it consistently delivers outstanding performance across the overall three datasets, ensuring optimal caption generation results.
4.4.2. Transformer Mapper Projection Lengths
By varying the projection length in the Transformer Mapper network, we can effectively control the intricacies of the visual information that the model can capture. To investigate the impact of different projection lengths on caption generation performance, we conducted experiments with projection lengths of 5, 10, 15, and 20.
The results, as shown in
Table 3, clearly demonstrate that the choice of projection length in the Transformer Mapper network has a significant impact on the model’s caption generation performance. The best results of different length settings are highlighted in bold. Among the tested lengths, setting the projection length to 10 consistently yields the best results on all three datasets. This indicates that a projection length of 10 strikes an optimal balance between capturing relevant visual features and managing the complexity of the visual information. When the projection length is too short (e.g., 5), the model’s visual representation lacks sufficient context and detail, leading to a degradation in caption quality. Conversely, when the projection length is too long (e.g., 15 or 20), the model may become overwhelmed with excessive visual information, leading to a scattered and less-focused representation, which also negatively impacts caption generation performance.
By setting the projection length to 10, the model is allowed to effectively capture contextually meaningful visual representations and avoid overwhelming amounts of information, enabling it to generate more accurate and coherent captions. This emphasizes the importance of selecting an appropriate projection length to ensure optimal performance in the caption generation task and highlights the effectiveness of the Transformer Mapper network in handling visual information at a moderate granularity level.
4.5. Ablation Studies
In this section, we conduct an ablation study to investigate how each design in our CRSR model influences the overall performances on the RSICD, UCM-Captions, and Sydney-Captions datasets.
baseline model: In the baseline model (denoted as “bs”), we use a Transformer-based encoder–decoder structure, which utilizes only CLIP features as visual inputs and does not incorporate any supplemented semantic information. This serves as the foundation for our CRSR model, which incorporates additional components and modifications to enhance its performance.
bs+m: This denotes the baseline model with the Transformer Mapper network added to the visual encoder. The modification aims to enhance the visual encoding process, resulting in a more comprehensive and informative visual representation.
bs+mq: “mq” signifies the Transformer Mapper network with the inclusion of query prediction for additional semantic information. This guides the generated visual tokens to focus more on the critical regions of the image.
bs+sr: “sr” denotes the semantic refinement module, which introduces the retrieved words and the filtering of semantic tokens to our model.
bs+mq+sr: This denotes that both the Transformer Mapper network with learnable queries and the semantic refinement module are introduced into the baseline model. With the addition of both modules, our model can generate captions that have a more accurate and comprehensive structure.
We analyzed the effect of each submodule in combination with the experimental results. The results demonstrate the impact of each submodule on the overall performance of our model. The best results among three datasets are highlighted in bold.
As shown in
Table 4, the comparison between the baseline model and “bs+m” model underscores the significance of incorporating the projection and attention mechanism of the Transformer Mapper network. This inclusion leads to increased BLEU1-4 and
scores across all three datasets, indicating that the model effectively captures the relationships and dependencies within the visual features. The BELU-4 scores increased by 0.67%, 2.07%, and 2.23% in the Sydney-Captions, UCM-Captions, and RSICD datasets, respectively. We observed more significant improvements in the larger dataset, which indicates that the Transformer Mapper network is particularly effective in capturing complex relationships and dependencies within the visual features when dealing with larger and more diverse datasets. The larger dataset provides a richer and more diverse set of visual information, benefiting from the self-attention mechanism of the Transformer Mapper network. As a result, the model can better understand the spatial and contextual information present in the images, leading to more accurate and informative captions. This highlights the scalability and generalization capabilities of the Transformer Mapper network, making it a valuable addition to the caption generation model for larger and more challenging datasets.
Comparing the results of “bs+m” and “bs+mq”, the additional inclusion of learnable queries significantly enhances the model’s performance. By fusing the projected image features with learnable queries that predict critical semantic information, the attention mechanism further improves the extraction of semantically relevant features in the image. The overall metric scores exhibit great improvements on the Sydney-Captions and RSICD datasets, with scores increasing by 5.28% and 1.56%, respectively, while there are relatively smaller improvements on UCM-Captions. This discrepancy can be attributed to the shorter query length of UCM-Captions compared to the other two datasets, which could limit the potential for additional improvements. Given that the retrieval results on UCM-Captions already include enough semantic information, setting a longer query length for repeated semantic information is not necessary.
With the added semantic refinement module, the experimental results of “bs” and “bs+sr” in
Table 4 demonstrate notable improvements in caption generation. Incorporating the retrieval and filtering of semantic tokens results in improved metric scores across all three datasets. For instance, in the largest dataset, RSICD, BLEU-4, SPICE, and
scores increased by 1.98%, 1.13%, and 2.89%, respectively. In the UCM-Captions and Sydney-Captions datasets, there is even greater improvement with the semantic refinement module. Specifically, in UCM-Captions, SPICE and
scores are 1.49% and 7.03% higher, respectively. In Sydney-Captions, SPICE and
scores improved by 2.59% and 7.61%, compared to the baseline model. These results demonstrate the effectiveness of the semantic retrieval and refinement module in refining generated captions and enhancing overall performance.
Integrating both the Transformer Mapper network and the semantic refinement module into the model, the results of “bs+m+sr” in comparison to “bs+m” and “bs+sr” further affirm the cumulative benefits of both submodules. Notably, within the RSICD and Sydney-Captions datasets, there are significant improvements in BLEU1-4, CIDEr, and metrics. Moreover, the CIDEr, SPICE, and scores in the UCM-Captions dataset also exhibit compelling enhancements, showcasing the effectiveness of these combined submodules across diverse datasets.
4.6. Comparison with Other Methods
In this section, we conduct extensive comparative experiments with seventeen state-of-the-art methods to demonstrate the effectiveness of our proposed CRSR method. The overall experiments result of three datasets are shown in
Table 5,
Table 6 and
Table 7. The best results among three datasets are highlighted in bold.
Regarding the Sydney-Captions dataset, our proposed CRSR model exhibits competitive performance when compared with state-of-the-art methods. Notably, it achieved the highest SPICE scores, although it lags slightly behind the DTFB method in overall performance. This achievement is significant given that the Sydney-Captions dataset is relatively small, consisting of only 613 RSIs, in comparison to other datasets. This limited data size might have contributed to the relatively lower scores obtained by our model. As shown in
Table 5, despite this limitation, our model still gained a competitive score of 1.0397 for the comprehensive metric
compared with the overall methods.
In the case of the UCM-Captions dataset, our CRSR model outperforms existing methods, achieving the highest scores in most of the metrics, with the exception of SPICE and CIDEr, where it remains competitive. A substantial improvement is observed in our model’s BLEU1-4 scores compared to the previous state-of-the-art methods. Furthermore, the metric score remains highly competitive when compared to the CASK method, demonstrating the capability to generate more descriptive and contextually relevant captions of our method.
Furthermore, on the RSICD dataset, which is the largest among the three datasets, our CRSR model continues to deliver remarkable performance. It achieves the highest scores in all evaluated metrics, outperforming the previous state-of-the-art methods. Notably, our model exhibits a substantial improvement of 13.44% in the CIDEr metric over the CASK method, which obtained the second-highest score. Moreover, there is a significant increase in all other evaluated metrics, reaffirming the CRSR model’s superiority in generating high-quality captions for RSIs.
4.7. Analysis of Training and Testing Time
In the context of practical applications, algorithmic efficiency holds paramount importance. For a comprehensive evaluation of the efficiency of our approach, we measured key parameters, including training time, testing time, and the total number of parameters with the
metric. The comparison was conducted between the baseline model and our proposed method using the RSICD dataset, and the results are presented in
Table 8.
Analyzing the results from the comparative experiments, it is evident that our method, incorporating a Transformer Mapper and semantic refinement module, leads to an increase in the number of parameters compared to the baseline model. Despite this increment in parameters, the performance gains in caption generation are substantial. Therefore, when weighing the time cost against performance factors, our method exhibits a favorable trade-off, incurring a relatively small increase in time cost for a significant improvement in performance.