Remote Sensing Image Captioning via Self-Supervised DINOv3 and Transformer Fusion

Mehmood, Maryam; Shahzad, Ahsan; Hussain, Farhan; Caceres-Najarro, Lismer Andres; Usman, Muhammad

doi:10.3390/rs18060846

Open AccessArticle

Remote Sensing Image Captioning via Self-Supervised DINOv3 and Transformer Fusion

by

Maryam Mehmood

^1,2

,

Ahsan Shahzad

¹

,

Farhan Hussain

¹

,

Lismer Andres Caceres-Najarro

³

and

Muhammad Usman

^4,*

¹

Department of Computer and Software Engineering, National University of Sciences and Technology (NUST), Islamabad 44000, Pakistan

²

Department of Software Engineering, National University of Modern Languages (NUML), Islamabad 44000, Pakistan

³

Department of Computer Engineering, Chosun University, Gwangju 61452, Republic of Korea

⁴

Department of Mathematics and Computer Science, Karlstad University, 65188 Karlstad, Sweden

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(6), 846; https://doi.org/10.3390/rs18060846

Submission received: 10 January 2026 / Revised: 1 March 2026 / Accepted: 6 March 2026 / Published: 10 March 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A novel encoder–decoder framework integrating self-supervised DINOv3 with a hybrid Transformer–LSTM decoder achieves 9–12% improvement over CNN and Vision Transformer baselines across BLEU, CIDEr, METEOR, and ROUGE-L metrics on RSICD and UCM-Captions datasets.
DINOv3’s self-supervised visual representations eliminate the need for domain-specific supervised pretraining while producing semantically rich features that outperform traditional supervised encoders (VGG16, ResNet50) for remote sensing image description tasks.

What are the implications of the main findings?

Self-supervised vision transformers represent a robust alternative to supervised CNN-based encoders for multi-modal remote sensing applications, particularly valuable when annotated training data is scarce or expensive to obtain.
The proposed LSTM aggregation module between encoder and decoder effectively captures spatial continuity in structured patterns (roads, rivers, boundaries), demonstrating that lightweight sequential processing enhances caption coherence for aerial imagery analysis.

Abstract

Effective interpretation of coherent and usable information from aerial images (e.g., satellite imagery or high-altitude drone photography) can greatly reduce human effort in many situations, both natural (e.g., earthquakes, forest fires, tsunamis) and man-made (e.g., highway pile-ups, traffic congestion), particularly in disaster management. This research proposes a novel encoder–decoder framework for captioning of remote sensing images that integrates self-supervised DINOv3 visual features with a hybrid Transformer–LSTM decoder. Unlike existing approaches that rely on supervised CNN-based encoders (e.g., ResNet, VGG), the proposed method leverages DINOv3’s self-supervised learning capabilities to extract dense, semantically rich features from aerial images without requiring domain-specific labeled pretraining. The proposed hybrid decoder combines Transformer layers for global context modeling with LSTM layers for sequential caption generation, producing coherent and context-aware descriptions. Feature extraction is performed using the DINOv3 model, which employs the gram-anchoring technique to stabilize dense feature maps. Captions are generated through a hybrid of Transformer with Long Short-Term Memory (LSTM) layers, which adds contextual meaning to captions through sequential hidden layer modeling with gated memory. The model is first evaluated on two traditional remote sensing image captioning datasets: RSICD and UCM-Captions. Multiple evaluation metrics like Bilingual Evaluation Understudy (BLEU), Consensus-based Image Description Evaluation (CIDEr), Recall-Oriented Understudy for Gisting Evaluation (ROUGE-L), and Metric for Evaluation of Translation with Explicit Ordering (METEOR), are used to quantify the performance and robustness of the proposed DINOv3 hybrid model. The proposed model outperforms conventional Convolutional Neural Network (CNN) and Vision Transformers (ViT)-based models by approximately 9–12% across most evaluation metrics. Attention heatmaps are also employed to qualitatively validate the proposed model when identifying and describing key spatial elements. In addition, the proposed model is evaluated on advanced remote sensing datasets, including RSITMD, DisasterM3, and GeoChat. The results demonstrate that self-supervised vision transformers are robust encoders for multi-modal understanding in remote sensing image analysis and captioning.

Keywords:

remote sensing image analysis; image captioning; DINOv3; transformers; LSTM

1. Introduction

Extracting meaningful information from high-dimensional images, especially remote sensing data, is a fundamental challenge due to scale variability, spatial heterogeneity, and illumination variations. Images captured by satellites or high-altitude cameras are categorized as remote sensing images and contain complex spatial, spectral, and contextual patterns that must be effectively represented for interpretation tasks such as classification, object detection, or image description. Feature extraction plays a critical role in this process by converting raw pixel data into compact, discriminative representations that highlight relevant spatial and semantic characteristics [1].

Earlier research relied heavily on handcrafted features, including spectral indices, texture descriptors, and structural features derived from techniques such as the Scale-Invariant Feature Transform or Histogram of Oriented Gradients. Although these methods were effective for specific applications, they are limited by scale, illumination, and imaging conditions that are often serious challenges in remote sensing datasets [2].

With advancements in Convolutional Neural Networks (CNNs), a major shift has been observed, enabling models to automatically learn hierarchical features directly from data. CNN-based architectures, such as VGGNet, ResNet, DenseNet, and EfficientNets, are commonly used in remote sensing to extract spatial and spectral representations for scene classification and object recognition. However, CNNs inherently focus on local receptive fields, and while they are effective when sufficient labeled data is available, their performance may be limited when adapting to new remote sensing domains or when labeled data is scarce [3].

Ref. [4] proposed an enhanced transformer that fuses positional and channel-wise semantic information to better model spatial relationships in remote sensing images. Their positional channel semantic fusion significantly improves caption accuracy by strengthening global–local feature alignment. Ref. [5] introduced a cross-modal retrieval and semantic refinement framework that effectively aligns visual and textual representations for remote sensing image captioning. By refining semantic consistency across modalities, the method improves descriptive precision and retrieval performance. A two-stage feature enhancement approach is introduced in [6] that first strengthens visual representations and then refines semantic features for caption generation. This staged enhancement strategy improves the model’s ability to capture complex scene structures in remote sensing imagery. A patch-level, region-aware module, combined with a multi-label learning framework, is proposed in [7] to better capture fine-grained object semantics. This approach enhances caption diversity and accuracy by explicitly modeling multiple co-occurring semantic categories. The role of object-level semantics in Transformer-based image captioning models is explored in [8]. The study demonstrates that using richer semantic representations yields more coherent and contextually accurate captions.

Recent research on Vision Transformers (ViTs) has emerged as a promising alternative for visual feature extraction. Unlike CNNs, ViTs treat images as sequences of patches and model the global relationships using self-attention mechanisms [9]. They are capable of capturing both local and long-range dependencies, which are often crucial for understanding complex spatial structures, particularly in remote sensing imagery. The integration of transformer-based architectures with self-supervised learning (SSL) has further advanced this field by enabling models to learn rich, generalizable representations from unlabeled data, which is especially advantageous given the limited availability of annotated remote sensing datasets. Models like DINOv3 exemplify this trend by producing robust, transferable, and semantically meaningful feature embeddings for downstream tasks such as image captioning and description generation [10].

Existing remote sensing image captioning methods largely rely on supervised CNN backbones or region-based features, which are limited in their ability to generalize across complex spatial layouts and scale variations commonly present in aerial imagery. Moreover, supervised pretraining introduces domain bias due to the scarcity of labeled remote sensing data. To address these limitations, this work adopts DINOv3 as a self-supervised visual backbone, designed to learn discriminative, scale-aware representations without manual annotations. We hypothesize that such representations, when fused through a Transformer-based multi-modal architecture, can better preserve fine-grained spatial semantics and global contextual relationships required for accurate caption generation. In summary, the main contributions of this research are:

Novel Architecture Design: To propose a novel encoder–decoder framework that integrates pretrained DINOv3 with a hybrid Transformer–LSTM decoder for remote sensing image captioning. To the best of our knowledge, this is the first work to leverage DINOv3’s self-supervised visual representations for remote sensing image description tasks, eliminating the need for domain-specific supervised pretraining.
LSTM-Enhanced Feature Aggregation: To introduce a lightweight LSTM aggregation module between the DINOv3 encoder and the Transformer decoder that processes patch tokens sequentially to capture spatial continuity and reduce feature redundancy. This module enhances the model’s ability to handle spatially structured patterns in remote sensing imagery (e.g., roads, rivers, and agricultural boundaries).
Comprehensive Benchmark Evaluation: To conduct extensive comparative experiments against three baseline architectures (VGG16 + LSTM, ResNet50 + Transformer + LSTM, and the proposed DINOv3 + Transformer + LSTM) on two benchmark datasets (RSICD and UCM-Captions), demonstrating a 9–12% improvement across multiple evaluation metrics (BLEU, CIDEr, METEOR, and ROUGE-L), along with both quantitative and qualitative analysis using attention visualization.

The rest of this paper is structured as follows. Section 2 provides background details and compares the proposed work with closely related studies in the literature. Section 3 presents the proposed methodology and experimentation environment, followed by evaluation results in Section 4. Section 5 discusses qualitative results. Finally, Section 6 concludes the paper.

2. Related Work

2.1. Evolution of Feature Extraction in Remote Sensing

Feature extraction is one of the most essential tasks in remote sensing image analysis, as it enables models to transform complex spectral and spatial data into compact and informative representations suitable for various downstream tasks. Early research in this area focused mainly on handcrafted features, such as spectral indices (e.g., NDVI) and texture descriptors (e.g., GLCM), as well as keypoint-based methods such as the Scale-Invariant Feature Transform (SIFT) and the Histogram of Oriented Gradients (HOG). Although these techniques provided interpretable and low-cost features, they failed to generalize across varying imaging conditions, spatial resolutions, and scene complexities [11].

With advances in deep learning, particularly CNNs, feature extraction in remote sensing has improved significantly. CNN-based models like AlexNet, VGGNet, ResNet, DenseNet, and EfficientNet learn hierarchical representations from raw pixel data, capturing both low-level features (e.g., edges and textures) and high-level semantics (e.g., objects and land-cover types) [3]. These models have become standard backbones for remote sensing applications, including scene classification, object detection, and land-use mapping. Ref. [12] reported that CNN models pretrained on ImageNet can be effectively fine-tuned for remote sensing classification tasks, outperforming traditional handcrafted approaches. However, the reliance of CNN models on large labeled datasets, along with their limited receptive fields, restricts their adaptability to data-scarce or highly variable remote sensing domains.

To address the limitations of CNN Models, ViTs were introduced [13]. ViTs treat images as sequences of non-overlapping patches and employ self-attention mechanisms to capture long-range dependencies. These models have demonstrated improved scalability and generalization, particularly when pretrained on large-scale datasets. Their ability to model global context makes them highly effective for understanding the complex spatial relationships inherent in remote sensing imagery. Subsequent research has adapted ViTs for remote sensing tasks such as multispectral scene classification and cloud segmentation, demonstrating improved interpretability and feature richness compared to CNN-based models.

2.2. Self-Supervised Learning and the DINO Family

Self-supervised learning is a technique in which a model learns from unlabeled data by creating its own supervision signals. Instead of relying on human-labeled data, the system automatically generates labels from the data itself. The growing demand for large-scale labeled datasets in deep learning has highlighted the need for self-supervised learning (SSL), which allows models to learn meaningful representations without human annotation. SSL methods train models using pretext tasks that exploit the inherent structure of data, enabling the extraction of general features that are transferable to downstream tasks. This paradigm is particularly important for remote sensing, where labeled data are expensive and time-consuming to obtain [14].

Early SSL approaches, such as SimCLR [15] and MoCo [16], used contrastive learning frameworks that maximize agreement between representations of augmented views of the same image while contrasting them against other samples. Subsequently, BYOL [17] and SimSiam [18] proposed non-contrastive methods that avoid negative pairs, instead relying on asymmetric architectures and stop-gradient mechanisms to prevent representational collapse. These studies demonstrated that self-supervised pretraining can surpass supervised training in visual recognition performance.

Building on these approaches, Ref. [19] introduced DINO (Distillation with No Labels), a self-supervised framework based on a student-teacher distillation setup. The model learns by aligning the student’s output distribution with that of the teacher across multiple augmented views, encouraging invariance and stability without explicit labels. DINO also revealed an emergent property: semantic attention maps that naturally localize objects without supervision. Enhancements to the DINOv2 framework, such as leveraging larger, more diverse datasets and adopting multi-crop training, led to performance gains across tasks including image classification, retrieval, and segmentation, reinforcing the effectiveness of self-distillation as a reliable pretraining approach [20].

The latest version, DINOv3, further improves dense feature stability through the use of Gram Anchoring and register tokens. Gram Anchoring regularizes the feature space by encouraging consistent pairwise feature correlations, while register tokens mitigate attention-related artifacts by providing dedicated, non-image tokens that stabilize attention maps and reduce spurious focus patterns across layers [21]. Compared to earlier Dino-based models, DINOv3 demonstrates improved generalization, robustness, and cross-domain transferability, which is particularly improvement for remote sensing image analysis and other domains where domain shifts and varying spatial resolutions are common. As a feature extractor, DINOv3 can provide rich, semantically structured representations that support downstream sequential models, such as LSTMs or Transformers, for high-level tasks like remote sensing image description [22].

2.3. Sequential Modeling in Image Description

Translating visual representations into textual sequences using natural language processing is known as sequential modeling. In early deep learning-based image captioning research, models combined CNN feature extractors with Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) decoders. LSTMs, introduced by [23], address the vanishing gradient problem by introducing gating mechanisms that control the flow of information, allowing them to model long-term dependencies in sequential data.

In remote sensing, CNN–LSTM architectures have become a popular choice for image description and scene captioning. Works such as [24,25] applied these architectures to generate semantic descriptions of satellite images, effectively bridging the gap between visual and linguistic modalities. These models demonstrated that integrating spatial feature representations with temporal sequence models can yield meaningful textual summaries of remote sensing scenes. However, despite their interpretability, LSTM-based models are limited by their sequential computation, which hinders parallelization and reduces efficiency when processing long sequences [26].

The rise of Transformer architectures [27] brought a paradigm shift in sequence modeling. Unlike RNNs, Transformers rely solely on self-attention, enabling them to model global dependencies and contextual relationships among tokens in a highly parallelized manner [28]. Transformer-based captioning models like ViT-GPT2, BLIP, and Flamingo have shown promising results in multi-modal tasks by aligning visual embeddings with language tokens through cross-attention mechanisms [29]. In remote sensing, recent works such as RSICap and RemoteCLIP have begun adopting Transformers for vision-language modeling, leading to improved fluency, diversity, and semantic accuracy in image descriptions compared with traditional CNN–LSTM models [30].

2.4. Hybrid Architectures: Combining DINO Features with Sequential Models

Recent studies have explored hybrid architectures that integrate self-supervised visual feature extractors with sequence generation models for multi-modal understanding. The rationale behind this design is to leverage the generalization capabilities of pretrained encoders alongside the language modeling strength of decoders. In such architectures, pretrained models like CLIP, ViT, or DINO are often used as frozen feature extractors, feeding high-level embeddings to LSTM- or Transformer-based decoders for text generation [31].

Ref. [32] employed ViT-based features with Transformer decoders for scene captioning, demonstrating improved contextual preservation. Similarly, in remote sensing approaches such as RSICap+ integrated pretrained CNNs with attention-based decoders to generate descriptive captions of satellite scenes [33]. However, most existing research relies on supervised visual encoders, such as ResNet or CLIP, which can limit performance on domain-specific imagery [34,35]. To date, no studies have explored DINOv3 as a visual backbone for remote sensing description tasks, despite its ability to generate semantically rich, domain-agnostic representations without requiring labeled data. This gap presents a promising research direction: combining DINOv3 feature embeddings with Transformer- or LSTM-based sequence decoders to generate accurate, human-like textual descriptions of remote sensing imagery.

2.5. Comparative Insights and Research Gaps

While various pretrained vision backbones have been adopted for remote sensing captioning and related vision-language tasks, including supervised CNNs and contrastively trained models (e.g., CLIP-style encoders) [36], none to our knowledge have employed DINOv3 as the primary visual backbone. DINOv3 represents a recent advancement in self-supervised vision models, trained on an extremely large corpus of unlabeled data (~1.7 B images) and designed to produce high-quality dense, patch-level features without requiring labels or text annotations [22]. In contrast to other self-supervised backbones, such as DINOv2 and masked autoencoders (MAE), DINOv3 demonstrates superior dense representation quality and generalization across diverse downstream tasks without fine-tuning [37]. For instance, DINOv3’s feature maps preserve detailed spatial semantics crucial for dense prediction tasks such as segmentation and depth estimation, outperforming prior self-supervised learning models on benchmarks that are directly relevant to generating accurate, spatially grounded image captions. Meanwhile, CLIP-style pretrained models optimize joint image–text alignment using paired data, producing strong global semantic representations but often failing to capture finer spatial and structural details when used as frozen backbones for dense captioning tasks [38]. Compared with these alternatives, DINOv3’s dense patch embeddings provide a uniquely advantageous foundation for remote sensing image captioning.

The literature also points to a developmental trend in visual understanding, from hand-engineered feature engineering to deep supervised learning, and most recently to self-supervised, attention-based architectures. Although CNN-LSTM frameworks have established a solid foundation for remote sensing image captioning, they remain constrained by their reliance on labeled data and limited ability to model global context, thereby limiting scalability. Transformer-based decoders and self-supervised ViTs, such as DINOv3, offer a pathway to overcome these limitations by integrating effective representation learning with scalable sequence modeling. Despite these developments, several research gaps remain:

A lack of studies investigating self-supervised vision transformer (particularly DINOv3) in remote sensing image captioning.
Limited comparative benchmarks between DINOv3-based and CNN-based feature extractors for caption generation tasks.
Underexplored performance of hybrid architecture (DINO + LSTM/Transformer) on smaller or domain-specific remote sensing datasets.

This study addresses these gaps by investigating the integration of DINOv3 feature extraction with Transformer- and LSTM-based decoders for remote sensing image captioning, and by comparing their effectiveness, interpretability, and robustness across diverse remote sensing datasets.

3. Methods and Materials

3.1. Feature Extraction

The DINOv3 ViT-B/16 distilled model is used as an image encoder for feature extraction. DINOv3, a self-supervised vision transformer, is pretrained on large-scale datasets to learn discriminative representations without explicit labels, making it particularly suitable for remote sensing images where annotated data may be limited. The graphical representation of this model is illustrated in Figure 1.

DINOv3 works on the student-teacher model. It shares two distinct segmented views of a single image and passes them to both encoder models: the student and the teacher. Both encoders split the image into smaller patches of 16 × 16 size called special learnable class tokens or CLS, resulting in a sequence of 196 patch tokens plus one class token, yielding a total sequence length of 197. Each token is embedded in a 768-dimensional vector space. The ViT-B configuration consists of 12 transformer encoder layers, with 12 attention heads per layer, an embedding dimension of 768, and a feed-forward network (FFN) inner dimension of 3072. This setup enables the extraction of hierarchical features that capture both local textures (e.g., building structures) and global contexts (e.g., land use patterns) in remote sensing scenes using pretrained DINOv3. These are then fed into the projection head to obtain logits and then apply the SoftMax function to produce probability distributions, with the method that trains the student model to match the prediction of the teacher model by minimizing the cross entropy between them (by stopping the gradient from flowing into the teacher’s model during training). Only the parameters of the student model are updated to match the predictions of the teacher’s model, while the teacher model provides a stable target. This approach is known as knowledge distillation. The idea behind using this approach is to transfer the knowledge distilled from a powerful source (i.e., the teacher’s model) to a smaller model (student), making the inference more efficient. However, the architecture and model parameters of both models (student and teacher) are the same. To train a model without labels is also a challenge here.

To overcome this issue, output dimensions are set to a large number, i.e., 65,000. This enables the model to discover and represent a wide range of visual concepts without requiring us to define them in advance. Also, in knowledge distillation, the teacher model is already trained with unlabeled data, to implement this teacher model, the weights of the teacher model are updated as a moving average (i.e., exponential moving average (EMA)) of the student’s weight over time; due to this, the teacher model changes gradually, providing a stable and consistent target for the student model to learn from. After each training step, the parameters of the teacher model are updated using the formula:

θ_{t} \leftarrow λ θ_{t} + (1 - λ) θ_{s}

(1)

The value

λ

controls the update rate of the teacher model; a larger value means slower updates. Initially,

λ = 0.996

, which gradually increases with the training of the model. This approach alone is insufficient for problems like remote sensing images, where images exhibit different categories of features. With only this method, a problem could arise in which a single dimension dominates the distribution, leading to uninformative and collapsed representations. To overcome the issue of single-dominated distribution, the concept of centering is introduced. It helps prevent collapse by ensuring the average output of the teacher model remains balanced across all output dimensions. Instead of computing the center vector of each batch independently, the moving average of the center is maintained over time, using the equation:

c \leftarrow m c + (1 - m) \frac{1}{B} \sum_{i = 1}^{B} g_{θ_{t}} (x_{i})

(2)

With center and trick, there arises an issue of collapsing to a uniform distribution. To address this, sharpening of the distribution is enforced by lowering the temperature parameter in the SoftMax function.

P (y = i) = \frac{exp (z_{i} / T)}{\sum_{j = 1}^{K} exp (z_{j} / T)}

(3)

P is the probability distribution; T is the temperature. As we lower the temperature T, probability distributions produced by the SoftMax function become more peaked, centering more mass on the larger logit. Conversely, as we increase the temperature, the output probabilities become less sharp and are more evenly distributed across all classes. The temperature of the student model is fixed at 0.1, and the teacher model starts at a lower temperature and gradually increases to a higher value using a linear function throughout training. Due to the difference in temperature, the probability distribution of the teacher model is sharper than the student model. This makes the prediction of the teacher model more confident, with a higher probability mass on the most likely class. This gives a strong signal to the student model, encouraging it to match the confident prediction of the teacher model.

Instead of using only two views, multiple cropped views of the image are created. Both global and local views are generated; the teacher model observers only global views, while the student model receives both local and global views as input. By using this strategy, the model learns to associate detailed information from small local patches with the broader context provided by the global views. As a result, the model becomes better at recognizing objects and patterns even when only a small portion of the image is visible. This method is known as self-distillation with no labels (DINO).

DINOv3 introduces a special technique, called gram anchoring, to improve the quality of dense visual features. During training, as features become less focused and the similarity maps become noisier, the model tends to make worse predictions for dense features. To address this issue, DINOv3 uses a previous version of the teacher model to regularize correlations across all patch pairs by computing cosine similarities between all patch token embeddings, forming what is called “Gram Matrix”. An additional loss function is introduced to encourage the Gram matrix of the student model to closely match that of the teacher model. This mechanism is referred to as “Gram Anchoring”. This regularization preserves the spatial relationships between local patches while still allowing the student features to be refined during training.

The main parameters used in Dinov3 for feature extraction are presented in Table 1, whereas the architecture of DINOv3 is illustrated in the Figure 2. During training, the DINOv3 backbone served as a fixed feature extractor, with its parameters frozen. This strategy preserves the robustness of the self-supervised visual representations while ensuring stable optimization given the limited size of remote sensing captioning datasets.

3.2. Feature Aggregation and Caption Generation

Although DINOv3 captures global relationships through attention, its patch embeddings remain highly redundant and do not explicitly model local spatial continuity across neighboring patches. Remote sensing images often contain structured, quasi-linear patterns—such as road segments, river flows, coastlines, and agricultural boundaries that benefit from a lightweight inductive bias favoring smooth transitions across neighboring patches. To introduce such structural guidance, a small LSTM aggregation module is incorporated.

The LSTM processes the patch tokens in a fixed spatial order (a flattened raster sequence), enabling it to aggregate locally coherent features while remaining compatible with transformer-based processing. Rather than treating patches as a true temporal sequence, the LSTM acts as a contextual smoother and a redundancy reducer, producing refined features that are better aligned with the decoder’s cross-attention. This improves the coherence of generated captions—particularly for spatially continuous objects—without significantly increasing the model complexity.

The aggregation input consists of the batch of patch tokens from the DINOv3 encoder, shaped as [Batch Size, 197, 768]. An LSTM cell with input and hidden dimensions of 768 processes each token sequentially. At each step t, the hidden state

h t

is updated based on the current token

x t

and previous states, forming the aggregated feature for that position. The output is a refined feature map of shape [Batch Size, 197, 768], which serves as the key and value inputs to the decoder’s cross-attention layers. This LSTM-based fusion enhances the model’s ability to handle the sequential nature of spatial features in remote sensing images, such as linear arrangements in roads or rivers, thereby improving the coherence of generated captions.

The aggregated features are fed into a transformer decoder to generate natural-language captions in an auto-regressive manner. The decoder follows a standard architecture with three layers, each comprising masked multi-head self-attention, multi-head cross-attention, and a feed-forward network. The word embedding and decoder dimensions are both set to 512, with eight attention heads and a query/key/value dimension of 64 (computed as 512/8). The FFN inner dimension is set to 768 to align with the encoder’s output.

Input caption tokens (from ground-truth captions during training or generated tokens during inference) are embedded into 512-dimensional vectors and augmented with positional encoding. Masked self-attention ensures causality by allowing the model to attend only to preceding tokens. Cross-attention uses queries from the self-attention output (512-d) and keys/values from the aggregated visual features (768-d), enabling effective alignment between textual descriptions and visual elements. The final output of the decoder is projected through a linear layer to the vocabulary size (V), resulting in an output dimension of (512, V), where V denotes the caption vocabulary size derived from the datasets. During training, cross-entropy loss with teacher forcing is employed, while during inference, beam search is used for optimal caption generation. The overall process, from feature extraction to caption generation, is illustrated in Figure 3. The integration of DINOv3 with the Transformer fusion module is expected to provide three main improvements:

Enhanced spatial discriminability through patch-level self-supervised features.
Improved semantic consistency due to domain-agnostic representation learning.
Stronger global–local context modeling via Transformer-based attention.

Although pure Transformer decoders have demonstrated strong performance in large-scale vision-language tasks, they often require extensive labeled data to avoid overfitting. In remote sensing captioning, where datasets are relatively small, we adopt a hybrid Transformer–LSTM decoder. The Transformer component captures rich global visual context, while the LSTM enforces sequential linguistic consistency during caption generation. This combination promotes stable training and semantically coherent captions by leveraging the strengths of both architectures.

As the proposed framework follows a standard encoder–decoder paradigm, it is distinguished by its use of DINOv3, a label-free, self-distilled Vision Transformer that learns semantically rich and domain-agnostic representations well suited to remote sensing imagery. Unlike CLIP or supervised ViT backbones, DINOv3 does not rely on image-to-text alignment or large-scale labeled natural image datasets, mitigating domain shift and annotation scarcity. Furthermore, architectural advances in DINOv3, such as gram anchoring and register tokens, enhance spatial coherence and the stability of dense features, which are critical for describing fine-grained remote sensing structures. Finally, the hybrid Transformer–LSTM decoder is designed to leverage the structured patch embeddings produced by DINOv3, promoting spatial continuity and improving semantic alignment in the generated captions.

3.3. Experimentation

3.3.1. Dataset Details

To evaluate the proposed remote sensing image captioning framework across both standard scene captioning benchmarks and complex multi-modal geospatial datasets, experiments were conducted on five public datasets. These datasets are organized into two categories based on their annotation style and application focus.

Category I Standard Remote Sensing Datasets: Two widely used remote sensing datasets, namely UCM-Captions and RSICD, were employed in this study. The UCM-Captions dataset is relatively small, whereas RSICD is the largest in this domain. RSICD contains 10,921 remote sensing images covering 30 common scene categories, such as airports, mountains, and farmland. Each image has a resolution of 224 × 224 pixels and is annotated with five descriptive sentences. In total, the dataset includes 54,605 sentences, of which 24,333 are unique. Figure 4 presents sample images from both datasets along with their corresponding descriptions. Following the standard data split protocol used in prior work, 8000/1000/1921 images were used for training, validation, and testing on RSICD, respectively, and 1680/210/210 for UCM-Captions, ensuring direct comparability with published results.

Category II Advanced Domain-Specific Datasets: This category includes RSITMD, DisasterM3, and GeoChat, which introduce richer semantic content, multi-modal reasoning requirements, and domain-specific captioning challenges beyond conventional scene descriptions. Sample images from these three datasets are shown in Figure 5.

RSITMD (Remote Sensing Image–Text Multi-Modal Dataset) contains 4743 high-resolution remote sensing images, each paired with five textual descriptions.

DisasterM3 is designed for disaster monitoring and emergency response and features satellite images depicting events such as floods, earthquakes, wildfires, and infrastructure damage.

GeoChat is a vision–language dataset aimed at geospatial reasoning and conversational understanding. It pairs satellite images with descriptive and dialogue-style captions that encode geographic context, spatial relationships, and location-aware semantics.

3.3.2. Evaluation Metrics

Model performance is evaluated based on the similarity between candidate and reference sentences. Commonly used automatic evaluation metrics include BLEU [39], METEOR [40], ROUGE-L, and CIDEr [41]. These metrics measure the similarity between candidate sentences and reference sentences from different perspectives. Higher metric values indicate better caption generation performance.

3.3.3. Experimentation and Parameter Setting

Training was conducted on a high-performance computing system equipped with two NVIDIA Tesla V100 GPUs, each with 32 GB of memory. The use of dual GPUs enabled parallel processing and significantly reduced training time. The model was implemented using PyTorch (version 2.1.0), leveraging GPU acceleration and automatic differentiation. Training was performed for a maximum of 100 epochs, with early stopping applied if no improvement in validation performance was observed for 10 consecutive epochs. The batch size was set to 32 to ensure efficient utilization of the GPU while maintaining stable gradient updates.

Gradient clipping with a maximum norm of 5.0 was applied to prevent exploding gradients, which can occur in deep networks with recurrent components such as LSTMs. Both the encoder and decoder were optimized using the Adam optimizer with a learning rate of 0.0001. Doubly stochastic attention regularization was applied with a parameter of 1.0 to encourage uniform attention distribution across all image regions, thereby improving caption quality. Training and validation statistics, including loss values and evaluation metrics, were recorded every 25 batches to monitor convergence and detect potential issues such as overfitting or underfitting. The word embeddings were fine-tuned to capture the domain-specific vocabulary of the RSICD dataset.

4. Results

4.1. Quantitative Results

4.1.1. RSICD Dataset Performance

The training performance of the proposed remote sensing image captioning model utilizes DINOv3 for feature extraction and a ViT-based transformer architecture with an LSTM layer. The results are depicted in the provided graphs, covering training loss, top-5 accuracy, BLEU scores evolution, and a comparison of Advanced Metrics (Initial vs. Final Epoch).

Training Loss: Loss varied between 0.5 to 2.5 with no apparent decreasing trend, implying that there could have been instability or overfitting issues in training. As shown in Figure 6a.
Top-5 Accuracy: This measure leveled off at 62.5–65% following some variation early on, which means that the model can rank correct captions in the top 5 predictions with fair consistency. Figure 6b shows the graphical representation of the top-5 accuracies of the proposed model using the RSICD dataset.
BLEU Scores: BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores rose steadily, reaching BLEU-1 close to 0.9 closer to epoch 70, but improved with time in terms of n-gram precision. BLEU scores are represented graphically in Figure 6c.
Advanced Metrics (Initial vs. Final): In the initial epoch, the scores of these metrics were low (e.g., METEOR: 0.487, ROUGE-L: 0.629, CIDEr: 0.829), but in the epoch 99, the significant improvements were achieved (METEOR: 1.140, ROUGE-L: 1.487, CIDEr: 2.129), meaning that the quality and relevance of captions improved. As shown in Figure 6d.

4.1.2. UCM Dataset Performance

The training dynamics of the proposed remote sensing image captioning model, utilizing DINOv3 for feature extraction and a ViT-based transformer architecture with LSTM, were analyzed over 100 epochs on the UCM-Captions dataset. The performance trends are visualized in the provided graphs, which include training loss, top-5 accuracy, BLEU scores evolution, and a comparison of Advanced Metrics (Initial vs. Final Epoch).

Training Loss: The training loss has reduced gradually, with an initial loss value of about 6 towards almost 0 after epoch 100, which implies model convergence and optimization. As shown in Figure 7a.
Top-5 Accuracy: The accuracy level leveled off at around 90 percent after about 20 epochs, indicating that the model rapidly learned to rank correct captions among the top 5 predictions. Figure 7b shows the trend in the form of a graph.
BLEU Scores Evolution: BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores have increased over time, with BLEU-1 approaching 0.9, BLEU-2 approaching 0.8, BLEU-3 approaching 0.75, and BLEU-4 approaching 0.7 by the final epoch, with an improved n-gram precision. As represented graphically in Figure 7c.
Advanced Metrics (Initial vs. Final): At the first epochs, there were low scores in the metrics (e.g., METEOR: 0.496, ROUGE-L: 0.712, CIDEr: 0.839), but during the last epoch, the scores increased significantly (METEOR: 3.180, ROUGE-L: 2.189, CIDEr: 1.309) with respect to better caption quality and relevance during training. as shown in Figure 7d.

The training trajectory of the training loss decreases quickly, and the accuracy at the Top-5 level stabilizes, which shows the initial ability of the model to interpolate between the datasets. The steady increase in BLEU scores and n-gram levels demonstrates the hybrid ViT-LSTM architecture’s ability to capture sequential and contextual dependencies, which may, in turn, be supported by the rich feature extraction enabled by DINOv3. The significant improvement in the advanced metrics between the first and final epochs indicates that the model can optimize caption semantics, and METEOR and ROUGE-L gains indicate higher levels of fluency and content overlap, whereas the CIDEr improvement indicates greater agreement with human consensus.

The top-5 Accuracy result in 20 epochs shows that the plateau of performance of the model was already reached in task ranking, which may be because of the limitations of the dataset or exhausting the feature space. The decrease in CIDEr improvement over METEOR and ROUGE-L may suggest that, at higher CIDEr scores, the model cannot capture the subtle term weighting present in the datasets (e.g., the complexity of RSICD vs. the uniformity of UCM-Captions). The final BLEU-4 (70.619) and CIDEr (5.139) are higher than the benchmarks on RSICD previously reported by others, that are 54.12% and 3.057, respectively, and are closely supported by the training trends. It implies that the longer the training, the better the model, and that further advancements to CIDEr can be made by considering techniques such as data augmentation and attention refinement, among others, to yield further improvements, particularly on heterogeneous datasets such as RSICD.

4.2. Human Evaluation and Weighted Score Analysis

In addition to automated metrics, a small-scale human evaluation was conducted on 30 randomly selected images from each dataset. Three independent evaluators assessed caption quality based on fluency and semantic relevance using a 5-point Likert scale. The proposed approach achieved consistently higher fluency and relevance scores compared to baseline methods, confirming its superior qualitative performance and improved semantic alignment with ground-truth descriptions. To provide a unified performance indicator, a Weighted Score (WS) was introduced by averaging four widely used evaluation metrics: BLEU-4, METEOR, ROUGE-L, and CIDEr. Table 2 shows the weighted average of evaluation metrics. This composite score enables balanced performance comparison by jointly considering n-gram precision, semantic similarity, recall-based matching, and consensus-based evaluation. The weighted score is computed as:

Weighted Score = \frac{BLEU-4 + METEOR + ROUGE-L + CIDEr}{4}

(4)

5. Discussion

5.1. Qualitative Analysis

5.1.1. Attention Visualizations

In Figure 8, attention heatmaps show that the model is capable of sequentially prioritizing important areas of the image, effectively assigning the spatial relationship between the “wide road” and the parking lot. The gradual decrease and dislocation of attention to the general scene and the particular features (road and parking lot) indicate that the ViT transformer encoder, along with the LSTM decoder, manages to utilize DINOv3 to extract features as a guide to the caption generation. High-intensity bands in the wide and road heatmaps suggest a high level of localization, and the next to sequence reveals the ability of the model to process relational semantics, which is a vital indicator part of remote sensing captioning.

The changing intensity and, at times, diffuse focus can, however, be a case of insecurity at the start, or of the model’s use of the global context to base its attention and then focus accordingly. RSICD is more complex (e.g., overlapping objects, different resolutions) than simpler datasets (e.g., UCM-Captions), so the model may be challenged. Still, the heatmaps are comparable to the end caption results (e.g., BLEU-4: 70.619%, CIDEr: 5.139), indicating that the model is robust.

In Figure 8, the learning process on the RSICD data shows a complicated dynamic of training. Non-uniformity of the training loss could indicate that the model might suffer some issues with complete convergence, perhaps because the data is so diverse and complex (e.g., different resolutions and overlapping objects) that the DINOv3 feature extraction and ViT-LSTM decoding is impaired. The stabilization of Top-5 Accuracy at 62.5–65% shows that there is a decent but not high ability to rank correct captions, perhaps because of the complex semantics of the dataset that the model is having problems with. The fact that BLEU scores gradually improve is consistent with previous findings (BLEU-4: 70.619%), confirming that the model improves n-gram precision over epochs, which can be explained by the transformer encoding and LSTM’s sequential processing.

The significant rise in advanced measures between the initial and final epochs (e.g., METEOR from 0.487 to 1.140 and CIDEr from 0.829 to 2.129) indicates that longer training can improve the model’s advanced metrics (i.e., generating more fluent and relevant captions). Still, all of these values are lower in the final table (CIDEr: 5.139). This difference can mean that the 70-epoch training presented there is a sub-process of a more extended training, and the final measurements represent additional optimization. The final improvements of the model, compared to the previous literature, including BLEU-4 of 54.12% and CIDEr of 3.057 on RSICD, indicate a competitive advantage. Still, the unsteady loss curve shows that improvements could be made. Regularization methods or dataset-specific fine-tuning might be employed in the future to stabilize the training process and perhaps bring CIDEr a little closer to the reported 5.139, using the abundance of RSICD captions to its advantage.

5.1.2. Success and Failure Cases

The performance of the model differs across the three examples, which highlights its strengths and weaknesses on the RSICD dataset. The fact that Figure 9a perfectly matches the reference caption, supported by a focused heatmap, corresponds to the high BLEU-4 (0.706), CIDEr (5.139) scores. This demonstrates the model’s ability to recognize and localize objects with high accuracy in clear situations. Figure 9b shows that the model captures the right sense, which is the ability to understand car occupancy. However, it does not capture empty spaces, implying limitations in modeling fine-grained spatial features. This may be due to unresolved limitations in DINOv3 features or ViT attention when representing empty regions. This behavior is partially reflected in the moderate METEOR (0.487) and ROUGE-L (0.829) scores, which indicate reasonable but not strong content overlap. The complete mismatch in the third prediction shown in Figure 9c indicates a significant failure, likely caused by confusion between a parking lot and a natural landscape (chaparral). The diffuse heatmap indicates that the attention mechanism was unable to anchor to relevant features, which could be due to dataset diversity or training instability, as indicated by the fluctuating loss values (0.5–2.5). Although the model still outperforms benchmark methods, these errors indicate the need for further improvements in contextual awareness. Enhancements such as multi-scale feature integration or more advanced attention mechanisms may reduce false predictions and improve performance in challenging scenes, which would also be reflected in improved evaluation metrics.

5.2. Comparative Analysis with Baseline Models

The experiments were initially conducted on two standard remote sensing datasets (UCM-Captions and RSICD), using baseline methods, starting with VGG16 as the feature extractor and an LSTM for caption generation. The baseline models (VGG16 + LSTM and ResNet50 + LSTM + Transformer) were implemented using the same training procedure, hyperparameters, and data splits as the proposed model to ensure a fair comparison. All models were trained from scratch on the same hardware using identical optimization settings. Two incremental modifications were then introduced. First, the feature extractor was replaced from VGG16 to ResNet50, which is considered the state-of-the-art feature extractor in many studies, such as [3]. Second, Transformer layers were added alongside the LSTM to capture stronger temporal dependencies. Finally, the latest DINOv3 model was adopted as the feature extractor and combined with Transformer layers, as discussed in the methodology section. The results of all three techniques are compared in Table 3, which shows that the proposed DINOv3-based model, leveraging self-supervised feature extraction and a hybrid transformer–LSTM decoder, delivers robust performance in remote sensing image captioning.

To ensure a fair comparison with prior works, the same test split, preprocessing steps, vocabulary size, and tokenization strategy were used, and BLEU scores were computed using the standard COCO evaluation script. The observed performance improvements therefore reflect the effectiveness of our model’s enhanced feature representations and attention mechanisms, rather than differences in evaluation protocols or dataset handling.

Across both datasets, higher BLEU scores indicate higher accuracy at the word and phrase levels in the generated and reference captions, suggesting the model’s ability to retrieve fine-grained spatial and semantic information from aerial images. This improvement is likely due to the inclusion of Transformer encoding and decoding layers, in which multi-head attention captures global image context, while the LSTM layer models temporal dependencies during caption generation. This combination yields more coherent and fluent captions, as evidenced by higher METEOR and ROUGE-L scores.

Across both datasets, the model performs better on UCM-Captions than on RSICD, as measured by BLEU (1–4), METEOR, and ROUGE-L. This difference can be attributed to the characteristics of the datasets. UCM-Captions comprises 2100 images across 21 land-use categories, with relatively simple, repetitive scene structures (e.g., agricultural fields, urban areas), which enables DINOv3 features and the Vision Transformer architecture to generalize more effectively. In contrast, RSICD contains 10,921 images with greater diversity and complexity, including varying resolutions, overlapping objects, and abstract landforms, which pose additional challenges for feature extraction and caption decoding. Despite this, the model’s CIDEr score is significantly higher in RSICD (5.139 vs. 3.379 in UCM-Captions), indicating greater consistency and relevance when weighting more diverse captions. Such variance indicates that CIDEr is sensitive to dataset-specific linguistic consensus: the richer descriptive vocabulary of the RSICD dataset can inflate the measure if the model retains important terms, whereas the more limited annotations of UCM-Captions do not. The proposed approach demonstrates significant gains over state-of-the-art approaches. The lower CIDEr score on UCM-Captions compared to RSICD-Captions can be explained by the dataset’s relatively low caption diversity, and the model’s high precision does not correspond to the similarly high levels of consensus. In general, these findings highlight the effectiveness of self-supervised pretrained features, such as DINOv3, in combination with transformer-based captioning for solving remote sensing problems.

To further assess robustness, the proposed model was also evaluated on three advanced datasets. These datasets are primarily designed for cross-modal retrieval, where evaluation emphasizes retrieval accuracy (i.e., how well images and text are matched) rather than classical image captioning metrics. BLEU, ROUGE, METEOR, and CIDEr are typically used for caption quality, not retrieval benchmarking. The results for these three datasets are reported in Table 4. The results demonstrate that the proposed method achieves consistent performance across all datasets. On DisasterM3, which contains complex disaster-related scenes with high intra-class variability, the model achieves high BLEU scores (up to 0.91 for BLEU-4) and the highest CIDEr score (5.97), indicating accurate and semantically rich caption generation. This highlights the model’s ability to capture fine-grained contextual details in challenging real-world scenarios. For GeoChat, despite its conversational and diverse annotation style, the proposed approach maintains competitive performance, achieving a CIDEr score of 4.33 along with strong ROUGE-L and METEOR values. These results suggest effective generalization even when caption structures differ from traditional descriptive annotations. Similarly, for RSITMD, the model again demonstrates robust performance, achieving balanced BLEU scores across all n-gram levels and a high CIDEr score of 5.85. This confirms that the proposed framework effectively handles multi-object scenes and varied land use categories commonly present in remote sensing imagery.

5.3. Comparison with State of the Art

The results of standard remote sensing datasets are compared with existing methods in the literature to further assess the robustness of the proposed model. Table 5 presents a comparative analysis against several recently published methods for remote sensing image captioning using the RSICD caption dataset. Evaluation metrics include BLEU scores (BLEU-1 to BLEU-4), CIDEr, and ROUGE-L, which are standard benchmarks for assessing caption fluency, accuracy, and relevance. For advanced datasets, direct comparison is not feasible, as these datasets are relatively recent and are primarily evaluated using retrieval-based metrics such as

Recall @ K

(

R @ 1

,

R @ 5

,

R @ 10

) and mean Recall (mR), which measure ranking accuracy rather than language quality [42]. As shown in the table, the proposed model achieves the highest CIDEr score of 5.14 and BLEU-4 score of 0.706, indicating strong alignment between generated captions and ground-truth references for standard datasets. The model scores 0.8814 in BLEU-1 and 0.829 in ROUGE-L, reflecting its ability to maintain both content accuracy and linguistic structure.

Compared with recently published methods, such as GCN-based knowledge embedding, dual feature enhancement, and TextGCN with LSTM, the proposed approach shows comparatively lower performance, particularly in CIDEr and BLEU-4 metrics. Methods based on positional channel semantic fusion and the region-aware multi-label framework result close to the proposed approach but still fall short, as their architectures focus primarily on region-level or channel-based enhancements without fully integrating global scene context, sequential language modeling, and cross-modal attention. The proposed hybrid architecture, combining DINOv3, Transformer layers, and LSTM layers, captures both object-level details and holistic scene relationships, resulting in captions that are more contextually accurate and linguistically fluent captions.

Table 6 shows results on the UCM-Captions dataset, which are validated against the same state-of-the-art models as RSICD. Using BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores, the proposed model consistently outperforms all compared approaches. The proposed hybrid model components collectively enhance both object-level recognition and global scene understanding, enabling more accurate and contextually relevant caption generation. For the ROUGE-L score, the enhanced transformer approach slightly outperforms the proposed approach by 0.11, whereas the CIDEr score for the proposed model needs improvement. These small differences indicate comparable semantic matching across these metrics, while the proposed model maintains overall superiority across the majority of evaluation criteria.

Although the proposed model achieved high BLEU and CIDEr scores on UCM, to mitigate potential overfitting, especially on the smaller UCM dataset, we employed early stopping (patience = 10 epochs), dropout regularization (rate = 0.3) in the decoder, and monitored validation metrics throughout training. The validation learning curves in Figure 7 shows consistent improvement without divergence from training performance, further confirm generalization rather than overfitting.

The consistent performance across both validation and test sets, along with meaningful attention visualizations, indicates that the model has learned generalizable representations rather than memorizing the training data. Some evaluation metrics, such as CIDEr and BLEU-4, are sensitive to caption diversity in the dataset. For instance, the UCM dataset contains many images with highly similar or identical captions, which can lead to lower absolute CIDEr scores even for strong models. Similarly, BLEU-4 scores on datasets with limited caption variability may appear lower than on more diverse datasets like RSICD. These lower absolute scores do not indicate poor model performance; rather, they reflect the inherent characteristics of the dataset. Across datasets with higher caption diversity, our model consistently outperforms prior approaches, demonstrating its effectiveness. The observed performance gains across BLEU, METEOR, ROUGE-L, and CIDEr confirm the effectiveness of the proposed design. In particular, improvements in CIDEr and METEOR indicate better semantic alignment and richer descriptive content, validating our expectation that DINOv3-derived features improve caption quality beyond conventional supervised backbones. The use of DINOv3 enhances the practicality of the proposed system by eliminating the need for extensive labeled training data, thereby reducing annotation costs and improving adaptability across diverse geographic regions and sensor types.

By leveraging self-supervised pretraining, the proposed approach eliminates the need for large-scale manual annotation during representation learning and requires labeled samples only for downstream fine-tuning. This results in an estimated 70–80% reduction in annotation cost compared to fully supervised pretraining pipelines, consistent with prior self-supervised learning studies. The proposed framework can support real-world digital transformation (DX) use cases, including automated disaster damage assessment, rapid post-event situation awareness, and urban growth monitoring in data-scarce regions. For example, near real-time satellite captioning can assist municipal authorities in infrastructure monitoring and emergency response planning. Such applications enable faster decision-making and improved situational awareness in operational environments.

5.4. Lightweight Deployment Pipeline for Real-Time RS Captioning

To enable real-time deployment, we propose a lightweight web-based inference pipeline. As illustrated in Figure 10 the pipeline consists of four main components:

(i): A web-based user interface for satellite image upload.
(ii): A preprocessing module for resizing, normalization, and format conversion.
(iii): A trained DINOv3-based encoder–decoder inference engine running on GPU/CPU.
(iv): A visualization module for real-time caption display.
(v): The modular architecture enables efficient inference with low latency and supports flexible deployment on local servers, edge devices, or cloud platforms, making it suitable for time-critical operational remote sensing applications.

6. Conclusions

The proposed remote sensing image captioning model (i.e., a DINOv3 feature extractor with a ViT-LSTM hybrid decoder) is shown to perform well on the traditional remote sensing datasets, namely RSICD and UCM-Captions. The model achieves both high n-gram accuracy and semantic coherence, showing improved BLEU, METEOR, and ROUGE-L scores and surpassing previous benchmarks. Qualitative analyses using attention heatmaps confirm the model’s ability to recognize and describe key spatial features (e.g., wide roads adjacent to parking lots or dense urban blocks near water bodies). The model also performs well on advanced datasets, including RSITMD, GeoChat, and DisasterM3, which are primarily designed for cross-modal image–text retrieval rather than image caption generation. These findings further support the effectiveness of self-supervised Vision Transformers as strong encoders for multi-modal understanding in remote sensing applications. To the best of our knowledge, this study is the first to employ DINOv3 as a feature extractor for both general image analysis and remote sensing imagery.

The proposed model also enables the automatic generation of spatially grounded textual descriptions from remote sensing images by combining DINOv3 self-supervised visual representations with a ViT–LSTM decoder. This design reduces dependence on labeled training data while preserving fine-grained spatial semantics, making the model suitable for large-scale remote sensing analysis and decision-support systems.

Nevertheless, the paper identifies several limitations, such as training instability and semantic drift in complex scenes, which may result in misclassification of visually similar land cover types. Stability and contextual accuracy could be further improved by addressing these challenges through increased regularization, multi-scale feature integration, and more context-sensitive decoding strategies. Overall, the results provide a solid foundation for future research on transformer-based, self-supervised models for automated semantic interpretation of remote sensing images. Future research will explore integrating multi-modal inputs, such as textual metadata, spatial coordinates, and temporal sequences, to enhance contextual understanding. Moreover, incorporating multi-scale ViT backbones or pyramid feature fusion could improve performance on small-object and fine-grained scenes. Future work includes extending the framework to multi-agent systems for cross-domain decision support.

Author Contributions

Conceptualization, M.M., M.U., and A.S.; methodology, M.M., A.S., L.A.C.-N. and F.H.; validation, M.M., A.S. and F.H.; formal analysis, M.M. and F.H.; investigation, M.M.; resources, M.U.; writing—original draft preparation, M.M., A.S., F.H., L.A.C.-N., and M.U.; writing—review and editing, A.S. and M.U.; supervision, A.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in this study are publicly available. The Remote Sensing Image Captioning Dataset (RSICD) can be accessed at https://github.com/201528014227051/RSICD_optimal (accessed on 1 March 2026), or https://mega.nz/folder/EOpjTAwL#LWdHVjKAJbd3NbLsCvzDGA (accessed on 1 March 2026) and the UC Merced Land Use Dataset (UCM-Captions) is available at https://mega.nz/folder/wCpSzSoS#RXzIlrv–TDt3ENZdKN8JA (accessed on 1 March 2026). The preprocessed data splits, trained model weights, and code for reproducing the results presented in this study are available from the corresponding author upon reasonable request and will be made publicly available upon acceptance of the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BLEU	Bilingual Evaluation Understudy
CIDEr	Consensus based Image Description Evaluation
CNN	Convolutional Neural Network
DINOv3	Distillation with No Labels version 3
FFN	Feed-Forward Network
HOG	Histogram of Oriented Gradients
LSTM	Long Short-Term Memory
METEOR	Metric for Evaluation of Translation with Explicit Ordering
RNNs	Recurrent Neural Networks
ROUGE-L	Recall-Oriented Understudy for Gisting Evaluation-Large
RSICD	Remote Sensing Image Captioning Dataset
SIFT	Scale-Invariant Feature Transform
SSL	Self-Supervised Learning
UCM	UC Merced
ViT	Vision Transformers

References

Mehmood, M.; Shahzad, A.; Zafar, B.; Shabbir, A.; Ali, N. Remote sensing image classification: A comprehensive review and applications. Math. Probl. Eng. 2022, 2022, 5880959. [Google Scholar] [CrossRef]
Saha, S.; Xu, L. Vision transformers on the edge: A comprehensive survey of model compression and acceleration strategies. Neurocomputing 2025, 643, 130417. [Google Scholar] [CrossRef]
Mehmood, M.; Hussain, F.; Shahzad, A.; Ali, N. Classification of Remote Sensing Datasets with Different Deep Learning Architectures. Earth Sci. Res. J. 2024, 28, 409–419. [Google Scholar]
Zhao, A.; Yang, W.; Chen, D.; Wei, F. Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion. Electronics 2024, 13, 3605. [Google Scholar] [CrossRef]
Li, Z.; Zhao, W.; Du, X.; Zhou, G.; Zhang, S. Cross-modal retrieval and semantic refinement for remote sensing image captioning. Remote Sens. 2024, 16, 196. [Google Scholar]
Guo, J.; Li, Z.; Song, B.; Chi, Y. TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning. Remote Sens. 2024, 16, 1843. [Google Scholar] [CrossRef]
Li, Y.; Zhang, X.; Zhang, T.; Wang, G.; Wang, X.; Li, S. A patch-level region-aware module with a multi-label framework for remote sensing image captioning. Remote Sens. 2024, 16, 3987. [Google Scholar] [CrossRef]
Abdal Hafeth, D.; Kollias, S. Insights into object semantics: Leveraging transformer networks for advanced image captioning. Sensors 2024, 24, 1796. [Google Scholar] [CrossRef]
Elharrouss, O.; Himeur, Y.; Mahmood, Y.; Alrabaee, S.; Ouamane, A.; Bensaali, F.; Bechqito, Y.; Chouchane, A. ViTs as backbones: Leveraging vision transformers for feature extraction. Inf. Fusion 2025, 118, 102951. [Google Scholar] [CrossRef]
Yang, S.; Wang, H.; Xing, Z.; Chen, S.; Zhu, L. Segdino: An efficient design for medical and natural image segmentation with dino-v3. arXiv 2025, arXiv:2509.00833. [Google Scholar] [CrossRef]
Garg, M.; Dhiman, G. A novel content-based image retrieval approach for classification using GLCM features and texture fused LBP variants. Neural Comput. Appl. 2021, 33, 1311–1328. [Google Scholar] [CrossRef]
Hu, W.S.; Li, H.C.; Pan, L.; Li, W.; Tao, R.; Du, Q. Spatial–spectral feature extraction via deep ConvLSTM neural networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4237–4250. [Google Scholar] [CrossRef]
Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Ren, X.; Wei, W.; Xia, L.; Huang, C. A comprehensive survey on self-supervised learning for recommendation. ACM Comput. Surv. 2025, 58, 22. [Google Scholar] [CrossRef]
Zhang, F.; Zhou, L.; Yu, X.; Gong, Z. CVT-SimCLR: Contrastive visual representation learning with Conditional Random Fields and cross-modal fusion. Inf. Fusion 2025, 127, 103651. [Google Scholar] [CrossRef]
Wang, Z.; Zhou, S.; Wang, Y.; Yang, L.; Xing, M.; Wen, P. A Coherence-oriented Fast Time Domain Algorithm for UAV Swarm SAR Imaging with Trajectory Difference Correction and Data-Driven MOCO. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5213518. [Google Scholar] [CrossRef]
Zhang, B.; Guo, X.; Yang, S. A Remote Sensing Semantic Self-Supervised Segmentation Model Integrating Local Sensitivity and Global Invariance. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 13691–13700. [Google Scholar] [CrossRef]
Liu, C.; Sun, H.; Xu, Y.; Kuang, G. Multi-source remote sensing pretraining based on contrastive self-supervised learning. Remote Sens. 2022, 14, 4632. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 9650–9660. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
Zeid, K.A.; Yilmaz, K.; de Geus, D.; Hermans, A.; Adrian, D.; Linder, T.; Leibe, B. DINO in the Room: Leveraging 2D Foundation Models for 3D Segmentation. arXiv 2025, arXiv:2503.18944. [Google Scholar] [CrossRef]
Siméoni, O.; Vo, H.V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Khalidov, V.; Szafraniec, M.; Yi, S.; Ramamonjisoa, M.; et al. Dinov3. arXiv 2025, arXiv:2508.10104. [Google Scholar] [PubMed]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Lu, N.; Wu, Y.; Feng, L.; Song, J. Deep learning for fall detection: Three-dimensional CNN combined with LSTM on video kinematic data. IEEE J. Biomed. Health Inform. 2018, 23, 314–323. [Google Scholar] [CrossRef] [PubMed]
Lu, Y.; Wang, W.; Xue, L. A hybrid CNN-LSTM architecture for path planning of mobile robots in unknow environments. In Proceedings of the 2020 Chinese Control And Decision Conference (CCDC); IEEE: Piscataway, NJ, USA, 2020; pp. 4775–4779. [Google Scholar]
Kandadi, T.; Shankarlingam, G. Drawbacks of Lstm Algorithm: A Case Study. 2025. Available online: https://ssrn.com/abstract=5080605 (accessed on 1 March 2026).
Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Shazeer, N.; Ku, A.; Tran, D. Image transformer. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge MA, USA, 2018; pp. 4055–4064. [Google Scholar]
Aleissaee, A.A.; Kumar, A.; Anwer, R.M.; Khan, S.; Cholakkal, H.; Xia, G.S.; Khan, F.S. Transformers in remote sensing: A survey. Remote Sens. 2023, 15, 1860. [Google Scholar] [CrossRef]
Saini, N.; Dubey, A.; Das, D.; Chattopadhyay, C. Advancing open-set object detection in remote sensing using multimodal large language model. In Proceedings of the Winter Conference on Applications of Computer Vision; IEEE: Piscataway, NJ, USA, 2025; pp. 451–458. [Google Scholar]
Silva, J.D.; Magalhães, J.; Tuia, D.; Martins, B. Large language models for captioning and retrieving remote sensing images. arXiv 2024, arXiv:2402.06475. [Google Scholar] [CrossRef]
Imam, M.F.; Marew, R.F.; Hassan, J.; Fiaz, M.; Aji, A.F.; Cholakkal, H. CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections. arXiv 2024, arXiv:2411.19346. [Google Scholar] [CrossRef]
Feng, X.; Huang, J.; Chen, X.; Zhou, H.; Zhang, M.; Zhang, C.; Ye, F. Research on Hyperspectral Remote Sensing Alteration Mineral Mapping Using an Improved ViT Model. Comput. Geosci. 2025, 206, 106037. [Google Scholar] [CrossRef]
Bu, D.; Xie, Z.; Wan, G. Optical remote sensing image caption research progress. In Proceedings of the International Conference on Image Processing and Artificial Intelligence (ICIPAl 2024); SPIE: Nuremberg, Germany, 2024; Volume 13213, pp. 149–159. [Google Scholar]
Tabbakh, A.; Barpanda, S.S. A deep features extraction model based on the transfer learning model and vision transformer “tlmvit” for plant disease classification. IEEE Access 2023, 11, 45377–45392. [Google Scholar] [CrossRef]
Zeng, B.; Lu, R.; Mao, G. Multimodal knowledge retrieval of layout image text based on CLIP and ViT. Signal Image Video Process. 2025, 19, 1001. [Google Scholar] [CrossRef]
Weng, X.; Pang, C.; Xia, G.S. Vision-Language Modeling Meets Remote Sensing: Models, datasets, and perspectives. IEEE Geosci. Remote Sens. Mag. 2025, 13, 276–323. [Google Scholar] [CrossRef]
Yin, C.; Ye, Q.; Luo, J. A Transformer and Visual Foundation Model-Based Method for Cross-View Remote Sensing Image Retrieval. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2024, 48, 821–829. [Google Scholar] [CrossRef]
Zhang, B.; Yu, S.; Xiao, J.; Wei, Y.; Zhao, Y. Frozen CLIP-DINO: A Strong Backbone for Weakly Supervised Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4198–4214. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; pp. 311–318. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization; Association for Computational Linguistics: Stroudsburg, PA, USA, 2005; pp. 65–72. [Google Scholar]
Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2015; pp. 4566–4575. [Google Scholar]
Cheng, Q.; Zhou, Z.; Yuan, L.; Du, Y. Efficient Yet Effective: A Dynamic Self-Distillation Framework for Remote Sensing Image-Text Retrieval. IEEE Geosci. Remote Sens. Lett. 2025, 22, 8002605. [Google Scholar]
Cheng, K.; Cambria, E.; Liu, J.; Chen, Y.; Wu, Z. KE-RSIC: Remote Sensing Image Captioning Based on Knowledge Embedding. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 18, 4286–4304. [Google Scholar]
Zhao, W.; Yang, W.; Chen, D.; Wei, F. DFEN: Dual feature enhancement network for remote sensing image caption. Electronics 2023, 12, 1547. [Google Scholar] [CrossRef]
Das, S.; Sharma, R. A textgcn-based decoding approach for improving remote sensing image captioning. IEEE Geosci. Remote Sens. Lett. 2024, 22, 8000405. [Google Scholar] [CrossRef]
Yang, C.; Li, Z.; Zhang, L. Bootstrapping interactive image–text alignment for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5607512. [Google Scholar]
Zhan, Y.; Xiong, Z.; Yuan, Y. Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model. ISPRS J. Photogramm. Remote Sens. 2025, 221, 64–77. [Google Scholar]

Figure 1. Working of DINOv3 as feature extraction.

Figure 2. Architecture of DINOv3 used for feature extraction.

Figure 3. The graphical representation of the proposed image captioning model.

Figure 4. Sample images from RSICD and UCM datasets.

Figure 5. Sample images from DisasterM3, RSITMD and GeoChat Datasets.

Figure 6. Performance analysis of the proposed transformer-based captioning model on the RSICD dataset. (a) Training loss; (b) Top-5 accuracy; (c) BLEU scores; (d) ROUGE-L, METEOR and CIDEr scores.

Figure 7. Performance evaluation of the proposed transformer-based captioning model on the UCM-Captions dataset. (a) Training loss analysis; (b) Top-5 accuracy analysis; (c) BLEU score analysis; (d) METEOR, CIDEr and ROUGE-L analysis.

Figure 8. Word-level attention maps showing how the proposed hybrid model grounds each generated token in a semantically relevant image region.

Figure 9. Model predictions: (a) Accurate prediction, (b) Predicting the right sense, and (c) Wrong prediction.

Figure 10. Proposed Deployment Pipeline for Real-Time RS Captioning.

Table 1. Specifications of the DINOv3-B/16 backbone.

Parameter	Value	Description
Backbone	ViT-B/16	Vision Transformer for feature extraction
Params	85M	Trainable parameters in encoder
Blocks	12	Transformer encoder layers
Patch Size	16	$16 \times 16$ input patches
Pos. Embeddings	ROPE	Rotary positional embedding (adapted)
Activation	SwiGLU	FFN non-linear activation
FFN Dim.	3072	Inner FFN dimension ( $4 \times$ embedding)
Embed Dim.	768	Token/hidden dimension
Heads	12	Multi-head self-attention
Head Dim.	64	Per-head dimension ( $768 / 12$ )
Tokens/img	197	196 patches + 1 CLS token
DINO Head MLP	768–512–256	Projection head for embedding
DINO Prototypes	65K	SSL clustering prototypes

Table 2. Performance of the Proposed Model on RSICD and UCM Caption Datasets.

Dataset	BLEU-4	METEOR	ROUGE-L	CIDEr	Weighted Score
RSICD	0.706	0.487	0.829	5.140	1.790
UCM	0.751	0.494	0.859	3.380	1.371

Table 3. Performance comparison of different methods on RSICD and UCM-Captions Datasets.

Methods	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr
UCM Captions
VGG16 + LSTM	0.875	0.819	0.729	0.730	0.455	0.815	3.75
ResNet50 + Transformer + LSTM	0.923	0.852	0.795	0.744	0.429	0.854	3.32
DINOv3 + Transformer + LSTM	0.930	0.861	0.804	0.751	0.494	0.859	3.38
RSICD Captions
VGG16 + LSTM	0.692	0.549	0.496	0.420	0.275	0.466	1.11
ResNet50 + Transformer + LSTM	0.702	0.623	0.574	0.541	0.402	0.627	3.49
DINOv3 + Transformer + LSTM	0.881	0.807	0.750	0.706	0.487	0.829	5.14

Table 4. Performance comparison on DisasterM3, GeoChat, and RSITMD datasets using standard captioning evaluation metrics.

Dataset	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr
DisasterM3	0.89	0.83	0.83	0.91	0.63	0.96	5.97
GeoChat	0.78	0.62	0.47	0.84	0.53	0.89	4.33
RSITMD	0.89	0.85	0.81	0.87	0.60	0.92	5.85

Table 5. Benchmarking the proposed model: A performance comparison with recent models on the RSICD dataset.

Ref.	Method Used	BLEU-1	BLEU-2	BLEU-3	BLEU-4	CIDEr	ROUGE-L
Cheng et al. [43]	KE-RSIC	0.791	0.679	0.591	0.517	2.832	0.691
Zhao et al. [44]	DFEN	0.766	0.636	0.538	0.463	2.605	0.685
Das et al. [45]	TextGCN + LSTM	0.651	0.482	0.375	0.308	0.82	0.480
Zhao et al. [4]	Enhanced Transformer	0.803	0.7026	0.617	0.544	3.01	0.702
Li et al. [7]	Patch level Regional aware module	0.802	0.692	0.601	0.525	2.94	0.691
Yang et al. [46]	BITA	0.773	0.665	0.576	0.503	3.04	0.717
Zhan et al. [47]	FRIC + IRIC	0.867	0.766	0.673	0.599	0.83	0.353
Proposed Model	DINOv3 + Transformer + LSTM	0.881	0.807	0.750	0.706	5.14	0.829

Table 6. Benchmarking the proposed model: A performance comparison with recent models on the UCM-Captions dataset.

Ref.	Method Used	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr
Cheng et al. [43]	KE-RSIC	0.899	0.829	0.786	0.738	0.495	0.843	3.766
Zhao et al. [44]	DFEN	0.851	0.784	0.728	0.677	0.459	0.805	3.177
Das et al. [45]	TextGCN + LSTM	0.846	0.784	0.739	0.693	0.487	0.807	3.408
Zhao et al. [4]	PCS FTr	0.904	0.862	0.823	0.784	0.499	0.860	3.98
Li et al. [7]	Patch level Regional aware module	0.855	0.801	0.756	0.716	0.475	0.815	3.695
Yang et al. [46]	BITA	0.888	0.831	0.773	0.718	0.468	0.837	3.84
Zhan et al. [47]	FRIC + IRIC	0.907	0.856	0.815	0.784	0.462	0.794	2.36
Proposed Model	DINOv3 + Transformer + LSTM	0.930	0.861	0.804	0.7751	0.494	0.8559	3.38

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mehmood, M.; Shahzad, A.; Hussain, F.; Caceres-Najarro, L.A.; Usman, M. Remote Sensing Image Captioning via Self-Supervised DINOv3 and Transformer Fusion. Remote Sens. 2026, 18, 846. https://doi.org/10.3390/rs18060846

AMA Style

Mehmood M, Shahzad A, Hussain F, Caceres-Najarro LA, Usman M. Remote Sensing Image Captioning via Self-Supervised DINOv3 and Transformer Fusion. Remote Sensing. 2026; 18(6):846. https://doi.org/10.3390/rs18060846

Chicago/Turabian Style

Mehmood, Maryam, Ahsan Shahzad, Farhan Hussain, Lismer Andres Caceres-Najarro, and Muhammad Usman. 2026. "Remote Sensing Image Captioning via Self-Supervised DINOv3 and Transformer Fusion" Remote Sensing 18, no. 6: 846. https://doi.org/10.3390/rs18060846

APA Style

Mehmood, M., Shahzad, A., Hussain, F., Caceres-Najarro, L. A., & Usman, M. (2026). Remote Sensing Image Captioning via Self-Supervised DINOv3 and Transformer Fusion. Remote Sensing, 18(6), 846. https://doi.org/10.3390/rs18060846

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Remote Sensing Image Captioning via Self-Supervised DINOv3 and Transformer Fusion

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Evolution of Feature Extraction in Remote Sensing

2.2. Self-Supervised Learning and the DINO Family

2.3. Sequential Modeling in Image Description

2.4. Hybrid Architectures: Combining DINO Features with Sequential Models

2.5. Comparative Insights and Research Gaps

3. Methods and Materials

3.1. Feature Extraction

3.2. Feature Aggregation and Caption Generation

3.3. Experimentation

3.3.1. Dataset Details

3.3.2. Evaluation Metrics

3.3.3. Experimentation and Parameter Setting

4. Results

4.1. Quantitative Results

4.1.1. RSICD Dataset Performance

4.1.2. UCM Dataset Performance

4.2. Human Evaluation and Weighted Score Analysis

5. Discussion

5.1. Qualitative Analysis

5.1.1. Attention Visualizations

5.1.2. Success and Failure Cases

5.2. Comparative Analysis with Baseline Models

5.3. Comparison with State of the Art

5.4. Lightweight Deployment Pipeline for Real-Time RS Captioning

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI