TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning
Abstract
:1. Introduction
- We propose a TSFE model for the RSIC task, which is capable of extracting fine-grained features of remote sensing images and introduces global features in the decoder, thus enhancing the accuracy of the generated sentences;
- To obtain fine-grained features of remote sensing images, we propose an adaptive multi-scale feature fusion (AMFF) module and local feature squeeze and enhancement (LFSE) module, capable of adaptively assigning weights to features of different channels, while also establishing connections between different regions of remote sensing images;
- To integrate global features in the decoder, we propose a feature interaction decoder (FID), which synchronously fuses global image and text features in the decoder to improve the accuracy of the sentences.
2. Related Work
2.1. Remote Sensing Image Captioning
2.2. Feature Enhancement Strategies
2.3. Decoding Strategies
3. Method
3.1. Adaptive Multi-Scale Feature Fusion Module
3.2. Local Feature Squeeze and Enhancement Module
3.3. Feature Interaction Decoder
3.4. Objective Functions
4. Experiments
4.1. Datasets
4.1.1. RSICD
4.1.2. UCM-Captions
4.1.3. Sydney-Captions
4.2. Evaluation Metrics
4.3. Implementation Details
4.4. Compared Models
- (1)
- Up–Down [53]: This approach introduces a mechanism that combines bottom-up and top-down visual attention, allowing the model to dynamically select and weigh various regions in the image.
- (2)
- SAT [54]: This method extracts visual features using a pre-trained CNN, applies spatial attention to obtaining context vectors, and converts them to natural language using LSTM.
- (3)
- MLA [30]: In the multi-level attention model (MLA), a multi-level attention mechanism is designed to choose whether to apply images or sequences as the main information to generate new words.
- (4)
- Structured-Attention [33]: It employs a fine-grained and structured attention mechanism to extract nuanced segmentation features from specific regions within an image.
- (5)
- RASG [38]: The method introduces a cyclic attention mechanism and semantic gates to fuse visual features with attentional features with higher precision, thus assisting the decoder in understanding the semantic content more comprehensively.
- (6)
- MSISA [40]: This method uses a two-layer LSTM for decoding. The first layer integrates multi-source information, while the second layer uses the output of the first layer and image features to generate the sentences.
- (7)
- VRTMM [34]: This method utilizes a variational autoencoder mode to extract image features and also uses a Transformer instead of LSTM as the decoder for higher performance.
- (8)
- CASK [35]: This method can extract corresponding semantic concepts from images and utilize the Consensus Exploitation (CE) block to integrate image features and semantic concepts.
- (9)
- MGTN [16]: This approach proposes a mask-guided Transformer network with a topic token, which is added to the encoder as a priori knowledge to focus on the global semantic information.
4.5. Comparisons with State of the Art
4.6. Ablation Study
- (1)
- “Baseline”: The image features are extracted using Swin Transformer pre-trained on ImageNet, and the decoder uses LSTM to generate captions.
- (2)
- “Baseline+Fine-tuning”: The AMFF module is used to extract multi-scale features, utilizing a fine-tuning task to adapt the encoder to the remote sensing dataset.
- (3)
- “Baseline+Fine-tuning+LFSE”: Fine-tuning and LFSE are used simultaneously for the remote sensing image captioning task.
- (4)
- “Baseline+Fine-tuning+LFSE+FID (ours)”: Our proposed TSFE model.
4.7. Qualitative Analysis
4.8. Visualization
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
RSIC | Remote sensing image captioning |
CNN | Convolutional neural networks |
LSTM | Long short-term memory |
GAP | Global average pooling |
ViT | Vision Transformer |
TSFE | Two-stage feature enhancement |
AMFF | Adaptive multi-scale feature fusion |
LFSE | Local feature squeeze and enhancement |
FID | Feature interaction decoder |
NIC | Natural image captioning |
RNN | Recurrent neural network |
MLA | Multi-level attention |
LAM | Label-attention mechanism |
GVFGA | Global visual feature-guided attention |
GLCM | Global–local captioning model |
MLAT | Multi-layer aggregated Transformer |
BLEU | Biingual evaluation understudy |
Rouge-L | Recall-oriented understudy for gisting evaluation—Longest |
Meteor | Metric for Evaluation of translation with explicit ordering |
CIDEr | Consensus-based image description evaluation |
UCM | UC Merced |
MLP | Multi-layer perceptron |
Up–Down | Bottom-up and top-down attention |
RASG | Recurrent attention and semantic gate |
VRTMM | Variational autoencoder and reinforcement learning- |
based two-stage multitask learning model | |
MGTN | Mask-guided Transformer network |
SENet | Squeeze and excitation network |
I | The input remote sensing image |
The output from each block of the Swin Transformer | |
c | The operation of concatenation in the channel dimension |
F | Concatenation features |
The output of SENet | |
H | The height of the feature map |
W | The width of the feature map |
C | The channel dimension of the feature map |
The local feature | |
The global feature | |
The intermediate weighting maps | |
The scaled weighting maps | |
The refined local feature | |
The visual features of the input LSTM | |
The generated word at time | |
The features of the input LSTM | |
The input gate of the LSTM | |
The forget gate of the LSTM | |
The cell state of the LSTM | |
The output gate of the LSTM | |
The output vector of the LSTM at time t |
References
- Liu, Q.; Ruan, C.; Zhong, S.; Li, J.; Yin, Z.; Lian, X. Risk assessment of storm surge disaster based on numerical models and remote sensing. Int. J. Appl. Earth Obs. 2018, 68, 20–30. [Google Scholar] [CrossRef]
- Liu, Y.; Wu, L. Geological disaster recognition on optical remote sensing images using deep learning. Procedia Comput. Sci. 2016, 91, 566–575. [Google Scholar] [CrossRef]
- Huang, W.; Wang, Q.; Li, X. Denoising-based multiscale feature fusion for remote sensing image captioning. IEEE Geosci. Remote Sens. Lett. 2020, 18, 436–440. [Google Scholar] [CrossRef]
- Lu, X.; Zheng, X.; Li, X. Latent semantic minimal hashing for image retrieval. IEEE Trans. Image Process. 2016, 26, 355–368. [Google Scholar] [CrossRef]
- Recchiuto, C.T.; Sgorbissa, A. Post-disaster assessment with unmanned aerial vehicles: A survey on practical implementations and research approaches. J. Field Robot. 2018, 35, 459–490. [Google Scholar] [CrossRef]
- Liu, L.; Gao, Z.; Luo, P.; Duan, W.; Hu, M.; Mohd Arif Zainol, M.R.R.; Zawawi, M.H. The influence of visual landscapes on road traffic safety: An assessment using remote sensing and deep learning. Remote Sens. 2023, 15, 4437. [Google Scholar] [CrossRef]
- Li, S.; Kulkarni, G.; Berg, T.; Berg, A.; Choi, Y. Composing simple image descriptions using web-scale n-grams. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, Portland, OR, USA, 23–24 June 2011; pp. 220–228. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 2014, 2, 3104–3112. [Google Scholar]
- Wang, Q.; Huang, W.; Zhang, X.; Li, X. Word–Sentence framework for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2020, 59, 10532–10543. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhang, W.; Yan, M.; Gao, X.; Fu, K.; Sun, X. Global visual feature and linguistic state guided attention for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5615216. [Google Scholar] [CrossRef]
- Liu, C.; Zhao, R.; Shi, Z. Remote-sensing image captioning based on multilayer aggregated transformer. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6506605. [Google Scholar] [CrossRef]
- Cheng, Q.; Huang, H.; Xu, Y.; Zhou, Y.; Li, H.; Wang, Z. NWPU-captions dataset and MLCA-net for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5629419. [Google Scholar] [CrossRef]
- Fu, K.; Li, Y.; Zhang, W.; Yu, H.; Sun, X. Boosting memory with a persistent memory mechanism for remote sensing image captioning. Remote Sens. 2020, 12, 1874. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Proc. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Ren, Z.; Gou, S.; Guo, Z.; Mao, S.; Li, R. A mask-guided transformer network with topic token for remote sensing image captioning. Remote Sens. 2022, 14, 2939. [Google Scholar] [CrossRef]
- Jia, J.; Pan, M.; Li, Y.; Yin, Y.; Chen, S.; Qu, H.; Chen, X.; Jiang, B. GLTF-Net: Deep-Learning Network for Thick Cloud Removal of Remote Sensing Images via Global–Local Temporality and Features. Remote Sens. 2023, 15, 5145. [Google Scholar] [CrossRef]
- He, J.; Zhao, L.; Hu, W.; Zhang, G.; Wu, J.; Li, X. TCM-Net: Mixed Global–Local Learning for Salient Object Detection in Optical Remote Sensing Images. Remote Sens. 2023, 15, 4977. [Google Scholar] [CrossRef]
- Ye, F.; Wu, K.; Zhang, R.; Wang, M.; Meng, X.; Li, D. Multi-Scale Feature Fusion Based on PVTv2 for Deep Hash Remote Sensing Image Retrieval. Remote Sens. 2023, 15, 4729. [Google Scholar] [CrossRef]
- Liu, S.; Zou, H.; Huang, Y.; Cao, X.; He, S.; Li, M.; Zhang, Y. ERF-RTMDet: An Improved Small Object Detection Method in Remote Sensing Images. Remote Sens. 2023, 15, 5575. [Google Scholar] [CrossRef]
- Li, Z.; Xiong, F.; Zhou, J.; Lu, J.; Zhao, Z.; Qian, Y. Material-Guided Multiview Fusion Network for Hyperspectral Object Tracking. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5509415. [Google Scholar] [CrossRef]
- Cheng, G.; Yang, C.; Yao, X.; Guo, L.; Han, J. When deep learning meets metric learning: Remote sensing image scene classification via learning discriminative CNNs. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2811–2821. [Google Scholar] [CrossRef]
- Jia, S.; Wang, Y.; Jiang, S.; He, R. A Center-masked Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5510416. [Google Scholar] [CrossRef]
- Li, Y.; Chen, W.; Huang, X.; Gao, Z.; Li, S.; He, T.; Zhang, Y. MFVNet: A deep adaptive fusion network with multiple field-of-views for remote sensing image semantic segmentation. Sci. China Inform. Sci. 2023, 66, 140305. [Google Scholar] [CrossRef]
- Ghamisi, P.; Couceiro, M.S.; Martins, F.M.; Benediktsson, J.A. Multilevel image segmentation based on fractional-order Darwinian particle swarm optimization. IEEE Trans. Geosci. Remote Sens. 2013, 52, 2382–2394. [Google Scholar] [CrossRef]
- Jiang, W.; Ma, L.; Chen, X.; Zhang, H.; Liu, W. Learning to guide decoding for image captioning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Yao, T.; Pan, Y.; Li, Y.; Qiu, Z.; Mei, T. Boosting image captioning with attributes. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4904–4912. [Google Scholar]
- Qu, B.; Li, X.; Tao, D.; Lu, X. Deep semantic understanding of high resolution remote sensing image. In Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems, Istanbul, Turkey, 16–18 December 2016; pp. 1–5. [Google Scholar]
- Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring models and data for remote sensing image caption generation. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2183–2195. [Google Scholar] [CrossRef]
- Li, Y.; Fang, S.; Jiao, L.; Liu, R.; Shang, R. A multi-level attention model for remote sensing image captions. Remote Sens. 2020, 12, 939. [Google Scholar] [CrossRef]
- Zhang, X.; Wang, X.; Tang, X.; Zhou, H.; Li, C. Description generation for remote sensing images using attribute attention mechanism. Remote Sens. 2019, 11, 612. [Google Scholar] [CrossRef]
- Zhang, Z.; Diao, W.; Zhang, W.; Yan, M.; Gao, X.; Sun, X. LAM: Remote sensing image captioning with label-attention mechanism. Remote Sens. 2019, 11, 2349. [Google Scholar] [CrossRef]
- Zhao, R.; Shi, Z.; Zou, Z. High-resolution remote sensing image captioning based on structured attention. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5603814. [Google Scholar] [CrossRef]
- Shen, X.; Liu, B.; Zhou, Y.; Zhao, J.; Liu, M. Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning. Knowl.-Based Syst. 2020, 203, 105920. [Google Scholar] [CrossRef]
- Li, Y.; Zhang, X.; Cheng, X.; Tang, X.; Jiao, L. Learning consensus-aware semantic knowledge for remote sensing image captioning. Pattern Recognit. 2024, 145, 109893. [Google Scholar] [CrossRef]
- Zhang, X.; Wang, Q.; Chen, S.; Li, X. Multi-scale cropping mechanism for remote sensing image captioning. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium(IGARSS), Yokohama, Japan, 28 July–2 August 2019; pp. 10039–10042. [Google Scholar]
- Yuan, Z.; Zhang, W.; Fu, K.; Li, X.; Deng, C.; Wang, H.; Sun, X. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4404119. [Google Scholar] [CrossRef]
- Li, Y.; Zhang, X.; Gu, J.; Li, C.; Wang, X.; Tang, X.; Jiao, L. Recurrent attention and semantic gate for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5608816. [Google Scholar] [CrossRef]
- Wang, Q.; Huang, W.; Zhang, X.; Li, X. GLCM: Global–local captioning model for remote sensing image captioning. IEEE Trans. Cybern. 2022, 53, 6910–6922. [Google Scholar] [CrossRef] [PubMed]
- Zhang, X.; Li, Y.; Wang, X.; Liu, F.; Wu, Z.; Cheng, X.; Jiao, L. Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning. Remote Sens. 2023, 15, 579. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
- Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
- Ramos, R.; Martins, B. Using neural encoder-decoder models with continuous outputs for remote sensing image captioning. IEEE Access 2022, 10, 24852–24863. [Google Scholar] [CrossRef]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar]
- Zhang, F.; Du, B.; Zhang, L. Saliency-guided unsupervised feature learning for scene classification. IEEE Trans. Geosci. Remote Sens. 2014, 53, 2175–2184. [Google Scholar] [CrossRef]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
- Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
- Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Chen, Z.; Badrinarayanan, V.; Lee, C.Y.; Rabinovich, A. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 794–803. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
Method | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE-L | CIDEr |
---|---|---|---|---|---|---|---|
SAT | 73.36 | 61.29 | 51.90 | 44.02 | 35.49 | 64.19 | 224.86 |
Up–Down | 76.79 | 65.79 | 56.99 | 49.62 | 35.34 | 65.90 | 260.22 |
MLA | 77.25 | 62.90 | 53.28 | 46.08 | 34.71 | 69.10 | 236.37 |
Structured-Attention | 70.16 | 56.14 | 46.48 | 39.34 | 32.91 | 57.06 | 170.31 |
RASG | 77.29 | 66.51 | 57.82 | 50.62 | 36.23 | 66.91 | 275.49 |
MSISA | 78.36 | 66.79 | 57.74 | 50.42 | 36.72 | 67.30 | 284.36 |
VRTMM | 78.13 | 67.21 | 56.45 | 51.23 | 37.37 | 67.13 | 271.50 |
CASK | 79.65 | 68.56 | 59.64 | 52.24 | 37.45 | 68.33 | 293.43 |
MGTN | 80.42 | 69.96 | 61.36 | 54.14 | 39.37 | 70.58 | 298.39 |
Ours | 80.54 | 70.64 | 62.08 | 54.86 | 37.93 | 70.20 | 305.70 |
Method | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE-L | CIDEr |
---|---|---|---|---|---|---|---|
SAT | 79.93 | 73.55 | 67.90 | 62.44 | 41.74 | 74.41 | 300.38 |
Up–Down | 83.56 | 77.48 | 72.64 | 68.33 | 44.47 | 79.67 | 336.26 |
MLA | 84.06 | 78.03 | 73.33 | 69.16 | 53.30 | 81.96 | 311.93 |
Structured-Attention | 85.38 | 80.35 | 75.72 | 71.49 | 46.32 | 81.41 | 334.89 |
RASG | 85.18 | 79.25 | 74.32 | 69.76 | 45.71 | 80.72 | 338.87 |
MSISA | 87.27 | 80.96 | 75.51 | 70.39 | 46.52 | 82.58 | 371.29 |
VRTMM | 83.94 | 77.85 | 72.83 | 68.28 | 45.27 | 80.26 | 349.48 |
CASK | 89.00 | 84.16 | 79.87 | 75.75 | 49.31 | 85.78 | 383.14 |
MGTN | 89.36 | 84.82 | 80.57 | 76.50 | 50.81 | 85.86 | 389.92 |
Ours | 89.85 | 85.46 | 81.26 | 76.45 | 48.93 | 88.00 | 367.01 |
Method | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE-L | CIDEr |
---|---|---|---|---|---|---|---|
SAT | 79.05 | 70.20 | 62.32 | 54.77 | 39.25 | 72.06 | 220.13 |
Up–Down | 81.80 | 74.84 | 68.79 | 63.05 | 39.72 | 72.70 | 267.66 |
MLA | 81.52 | 74.44 | 67.55 | 61.39 | 45.60 | 70.62 | 199.24 |
Structured-Attention | 77.95 | 70.19 | 63.92 | 58.61 | 39.54 | 72.99 | 237.91 |
RASG | 80.00 | 72.17 | 65.31 | 59.09 | 39.08 | 72.18 | 263.11 |
MSISA | 76.43 | 69.19 | 62.83 | 57.25 | 39.46 | 71.72 | 281.22 |
VRTMM | 74.43 | 67.23 | 61.72 | 56.99 | 37.48 | 66.98 | 252.85 |
CASK | 79.08 | 72.00 | 66.05 | 60.88 | 40.31 | 73.54 | 267.88 |
MGTN | 83.38 | 75.72 | 67.72 | 59.80 | 43.46 | 76.60 | 269.82 |
Ours | 82.79 | 76.06 | 69.51 | 62.90 | 42.28 | 77.13 | 265.61 |
Dataset | Model | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE-L | CIDEr |
---|---|---|---|---|---|---|---|---|
RSICD | Baseline | 75.57 | 64.47 | 55.19 | 47.40 | 34.23 | 65.97 | 264.40 |
Baseline+Fine-tuning | 77.68 | 67.53 | 59.10 | 52.10 | 35.74 | 69.00 | 278.61 | |
Baseline+Fine-tuning+LFSE | 78.54 | 68.71 | 60.57 | 53.80 | 36.78 | 70.24 | 294.60 | |
Ours | 80.54 | 70.64 | 62.08 | 54.86 | 37.93 | 70.20 | 305.70 | |
UCM-Captions | Baseline | 78.46 | 68.71 | 65.32 | 62.83 | 38.29 | 72.36 | 302.57 |
Baseline+Fine-tuning | 81.77 | 76.92 | 72.48 | 67.47 | 42.42 | 78.58 | 321.17 | |
Baseline+Fine-tuning+LFSE | 84.87 | 79.80 | 74.99 | 70.02 | 43.38 | 78.94 | 355.28 | |
Ours | 89.85 | 85.46 | 81.26 | 76.45 | 48.93 | 88.00 | 367.01 | |
Sydney-Captions | Baseline | 73.87 | 65.72 | 58.40 | 51.80 | 36.12 | 66.59 | 210.93 |
Baseline+Fine-tuning | 79.62 | 73.32 | 66.68 | 59.65 | 38.49 | 69.18 | 240.85 | |
Baseline+Fine-tuning+LFSE | 80.15 | 73.87 | 67.11 | 60.00 | 39.17 | 71.32 | 250.22 | |
Ours | 82.79 | 76.06 | 69.51 | 62.90 | 42.28 | 77.13 | 265.61 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Guo, J.; Li, Z.; Song, B.; Chi, Y. TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning. Remote Sens. 2024, 16, 1843. https://doi.org/10.3390/rs16111843
Guo J, Li Z, Song B, Chi Y. TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning. Remote Sensing. 2024; 16(11):1843. https://doi.org/10.3390/rs16111843
Chicago/Turabian StyleGuo, Jie, Ze Li, Bin Song, and Yuhao Chi. 2024. "TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning" Remote Sensing 16, no. 11: 1843. https://doi.org/10.3390/rs16111843
APA StyleGuo, J., Li, Z., Song, B., & Chi, Y. (2024). TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning. Remote Sensing, 16(11), 1843. https://doi.org/10.3390/rs16111843