Next Article in Journal
Video SAR Moving Target Shadow Detection Based on Intensity Information and Neighborhood Similarity
Next Article in Special Issue
Spatial-Aware Transformer (SAT): Enhancing Global Modeling in Transformer Segmentation for Remote Sensing Images
Previous Article in Journal
Ground-Based Oblique-View Photogrammetry and Sentinel-1 Spaceborne RADAR Reflectivity Snow Melt Processes Assessment on an Arctic Glacier
Previous Article in Special Issue
DMAU-Net: An Attention-Based Multiscale Max-Pooling Dense Network for the Semantic Segmentation in VHR Remote-Sensing Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Transformers in Remote Sensing: A Survey

1
Computer Vision Faculty, Mohamed bin Zayed University of Artificial Intelligence, Building 1B, Masdar City, Abu Dhabi P.O. Box 5224, United Arab Emirates
2
School of Computer Science, Wuhan University, Wuchang District, Wuhan 430072, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Remote Sens. 2023, 15(7), 1860; https://doi.org/10.3390/rs15071860
Submission received: 7 February 2023 / Revised: 17 March 2023 / Accepted: 20 March 2023 / Published: 30 March 2023
(This article belongs to the Special Issue Deep Learning for Remote Sensing Image Classification II)

Abstract

:
Deep learning-based algorithms have seen a massive popularity in different areas of remote sensing image analysis over the past decade. Recently, transformer-based architectures, originally introduced in natural language processing, have pervaded computer vision field where the self-attention mechanism has been utilized as a replacement to the popular convolution operator for capturing long-range dependencies. Inspired by recent advances in computer vision, the remote sensing community has also witnessed an increased exploration of vision transformers for a diverse set of tasks. Although a number of surveys have focused on transformers in computer vision in general, to the best of our knowledge we are the first to present a systematic review of recent advances based on transformers in remote sensing. Our survey covers more than 60 recent transformer-based methods for different remote sensing problems in sub-areas of remote sensing: very high-resolution (VHR), hyperspectral (HSI) and synthetic aperture radar (SAR) imagery. We conclude the survey by discussing different challenges and open issues of transformers in remote sensing.

1. Introduction

Remote sensing imaging technology has significantly advanced in the last decades. Modern airborne sensors provide a large coverage of the Earth’s surface with improved spatial, spectral and temporal resolutions, thereby playing a crucial role in numerous research areas, including ecology, environmental science, soil science, water contamination, glaciology, land surveying and analysis of the crust of the Earth. Automatic analysis of remote sensing imaging brings unique challenges, such as data are generally multi-modal (e.g., optical or synthetic aperture radar sensors), located in the geographical space (geo-located) and typically on a global scale with ever growing data volumes.
Deep learning, especially convolutional neural networks (CNNs), has dominated many areas of computer vision, including object recognition, detection and segmentation. These networks typically take an RGB image as an input and perform a series of convolution, local normalization and pooling operations. CNNs typically rely on a large amount of training data, and the resulting pre-trained models are then utilized as generic feature extractors for a variety of downstream applications. The success of deep learning-based techniques in computer vision has also inspired the remote sensing community with significant advances being made in many remote sensing tasks, including hyperspectral image classification, change detection and very high-resolution satellite instance segmentation.
One of the main building blocks in CNNs is the convolution operation, which captures local interactions between elements (e.g., contour and edge information) in the input image. CNNs encode biases, such as spatial connectivity and translation equivariance. These charactertistics aid in constructing generalizable and efficient architectures. However, the local receptive field in CNNs limits modeling long-range dependencies in an image (e.g., distant part relationships). Moreover, convolutions are content-independent as the convolutional filter weights are stationary with same weights applied to all inputs regardless of their nature. Recently, vision transformers (ViTs) [1] have demonstrated impressive performance across a variety of tasks in computer vision. ViTs are based on the self-attention mechanism that effectively captures global interactions by learning the relationships between the elements of a sequence. Recent works [2,3] have shown that ViTs possess content-dependent long-range interaction modeling capabilities and can flexibly adjust their receptive fields to counter nuisances in data and learn effective feature representations. As a result, ViTs and their variants have been successfully utilized for many computer vision tasks, including classification, detection and segmentation.
Following the success of ViTs in computer vision, the remote sensing community has also witnessed a significant growth (see Figure 1) in the employment of transformer-based frameworks in many tasks, such as very high-resolution image classification, change detection, pan sharpening, building detection and image captioning. This has started a new wave of promising research in remote sensing with different approaches utilizing either ImageNet pre-training [4,5,6] or performing remote sensing pre-training [7] with vision transformers. Similarly, there exist approaches in the literature that are based on pure transformer design [8,9] or utilize a hybrid approach [10,11,12] based on both transformers and CNNs. It is, therefore, becoming increasingly challenging to keep pace with the recent progress due to the rapid influx of transformer-based methods for different remote sensing problems. In this work, we these advances and present an account of recent transformer-based approaches in the popular field of remote sensing. To summarize, our main contributions are the following:
  • We present a holistic overview of applications of transformer-based models in remote sensing imaging. To the best of our knowledge, we are the first to present a survey on transformers in remote sensing, thereby bridging the gap between recent advances in computer vision and remote sensing in this rapidly growing and popular area.
  • We present an overview of both CNNs and transformers, discussing their respective strengths and weaknesses.
  • We present a review of more than 60 transformer-based research works in the literature to discuss the recent progress in the field of remote sensing.
  • Based on the presented review, we discuss different challenges and research directions on transformers in remote sensing.
The rest of the paper is organized as follows: Section 2 discusses other related surveys on remote sensing imaging. In Section 3, we present an overview of different imaging modalities in remote sensing, whereas Section 4 provides a brief overview of CNNs and vision transformers. Afterwards, we review advances with respect to transformer-based approaches in very high-resolution (VHR) imaging (Section 5), hyperspectral image analysis (Section 6) and synthetic aperture radar (SAR) in Section 7. In Section 8, we conclude our survey and discuss potential future research directions.

2. Related Work

In the literature, several works have performed a review of machine learning techniques for remote sensing imaging in the past decade. Tuia et al. [13] compare and evaluate different active learning algorithms for the supervised remote sensing image classification task. The work of [14] focuses on the problem of hyperspectral image classification and reviews recent advances in relation to machine learning and vision techniques. Zhu et al. [15] present a comprehensive review of utilizing deep learning techniques for remote sensing image analysis. Their work provides a comprehensive review of the existing approaches along with describing a list of resources about deep learning in remote sensing. Ma et al. [16] review major deep learning concepts in remote sensing with respect to image resolution and study area. To this end, their work studies different remote sensing tasks, such as image registration, fusion, scene classification and object segmentation.
Recently, transformer-based approaches have witnessed a significant surge within the computer vision community, following the breakthrough from transformer-based models [17] in natural language processing (NLP). Khan et al. [18] present an overview of the transformer models in vision with emphasis on recognition, generative modeling, multi-modal, video processing and low-level vision tasks. Shamshad et al. [19] survey the use of transformer models in medical imaging, focusing on different medical imaging tasks, such as segmentation, detection, reconstruction, registration and clinical medical report generation. The work of [20] presents an overview of the growing trend of using transformers to model video data. Their work also compares the performance of vision transformers on different video tasks, such as action recognition.
Different from the aforementioned surveys, our work presents a review of recent advances of transformer-based approaches in the popular area of remote sensing. To the best of our knowledge, this is the first survey presenting a comprehensive account of transformers in remote sensing, particularly dedicated to progress in very high-resolution, hyperspectral and synthetic aperture radar image analysis.

3. Remote Sensing Imaging Data

Remote sensing imagery is generally acquired from a range of sources, as well as data collection techniques. Remote sensing image data can be typically characterized by their spatial, spectral, radiometric, and temporal resolutions. Spatial resolution refers to each pixel size within an image along with the area of the surface of the Earth represented by that corresponding pixel. Spatial resolution characterizes the small and fine-detailed features in an imaging scene that can be separated. Spectral resolution refers to the capability of the sensor to collect information about the scene by discerning finer wavelengths with narrower bands (e.g., 10 nm). On the other hand, radiometric resolution characterizes the extent of the information in each pixel, where a larger dynamic range for a sensor implies more details are to be discerned in the image. The temporal resolution refers to the time it takes between consecutive images of the same location on ground acquired by the sensor. Here, we briefly discuss commonly utilized remote sensing imaging types with examples shown in Figure 2.
Very High-resolution Imagery: In recent years, the emergence of very high-resolution (VHR) satellite sensors has paved the way towards yielding the higher spatial resolution imagery beneficial for land use change detection, object-based image analysis (object detection and instance segmentation), precision agriculture farming (e.g., management of crops, soil and pests) and emergency responses. Furthermore, these recent advances in sensor technology, along with new deep learning-based techniques, allow the use of VHR remote sensing imagery to analyze the biophysical and biogeochemical processes both in coastal and inland waters. Nowadays, optical sensors produce panchromatic and multispectral imagery of the Earth’s surface at a much finer spatial resolutions (e.g., 10 to 100 cm/pixel).
Hyperspectral Imagery: Here, each pixel in the scene is captured using a continuous spectrum of light with fine wavelength resolutions. The continuous spectrum extends wavelengths beyond the visible spectrum and includes wavelengths from ultraviolet (UV) to infrared (IR). Generally, the spectral resolution of hyperspectral images are expressed using the wave number along with the nanometers (nm). The most popular continuous spectrum used for measuring the pixels is mid-infrared, which is near infrared and visible wavelength bands. In order to acquire hyperspectral imagery, there are different electromagnetic measurements, such as Raman spectroscopy, X-ray spectroscopy, Terahertz spectroscopy, 3D ultrasonic imaging, magnetic resonance and confocal laser microscopy scanners, that can measure the entire emission spectrum for each pixel at a specific excitation wavelength. The hyperspectral images have high dimensionality and strong resolving power for fine spectra. The imagery offers a wide range of applications, including in environmental science [21] and mining [22]. Different from regular images that contain only the primary colors (red, green and blue) within the visible spectrum, hyperspectral images are rich in spectral information that can reflect the physical structure and chemical composition of the item of interest. In remote sensing, automatically analyzing hyperspectral imagery is an active research topic.
Synthetic Aperture Radar Imagery: A large amount of synthetic aperture radar (SAR) images are produced by Earth observation satellites every day through emission and reception of electromagnetic signals. In the past decades, SAR images have gained popularity due to their higher spatial resolution, all-weather capability, de-speckling tools, such as CAESAR, along with recent advances in the SAR specific image processing. SAR imagery can be used for numerous applications, including geographical localization, object detection, functionalities of basic radars and geophysical feature estimation of complex settings, such as roughness, moisture content and density. Furthermore, SAR imagery can be used for disaster management (oil slick detection and ice tracking), forestry and hydrology.

4. From CNNs to Vision Transformers

In this section, we first present a brief overview of CNNs and then provide a brief description of vision transformers recently utilized for different vision tasks.

4.1. Convolutional Neural Networks

Convolutional neural networks (CNNs) have dominated a variety of computer vision tasks, including image classification [23] and object detection [24]. CNNs are typically made up of series of two main parts: convolutional and pooling layers. The convolutional layer produces feature maps by convolving the local region in the input with a set of kernels. These features are subjected to a non-linear function with the same process repeated for each convolutional layer. In CNNs, the pooling layer carries out a downsampling operation (typically utilizing the max or mean operation) to feature maps. In different existing CNN architectures, the convolutional and pooling layers are followed by a set of fully connected layers, where the last fully connected layer is the softmax computing each object category score.
Popular CNN Backbones: Here, we briefly discuss different popular CNN backbone architectures in the literature.
AlexNet: Krizhevsky et al. [23] propose a CNN architecture, named AlexNet, for the image classification task. AlexNet comprises five convolutional layers followed by three fully-connected layers. The proposed network architecture utilizes Rectified Linear Units (ReLU) for training efficiency. The network contains 60 million parameters and 500,000 neurons with network training performed on the large-scale ImageNet dataset [25]. Different data augmentation techniques are employed to increase the training set. In the ImageNet 2012 competition, AlexNet achieved a competitive performance with top-1 and top-5 error rates of 39.7% and 18.9%, respectively.
VGGNet: Different from AlexNet, Simonyan and Zisserman [26] introduced an architecture named VGGNet that comprises 16 layers in total. The network takes an input image of 224 × 224 size and has around 138 million parameters. It uses different data augmentation techniques, including scale jittering, during network training. The VGGNet architecture comprises convolution layers of 3 × 3 filter, where the receptive fields are convolved at each pixel with a stride of one pixel. The VGGNet contains multiple pooling layers, performing spatial pooling over 2 × 2 windows with a stride of two pixels. Furthermore, VGGNet contains two fully connected layers followed by a softmax for yielding output predictions. The VGG architecture achieved top classification accuracy on the 2014 ImageNet classification challenge.
ResNet: Different from AlexNet and VGGNet, He et al. [27] introduced residual neural networks (ResNet) that stacks residual blocks to build a network. ResNet provides a residual learning approach for training networks that are much deeper than their previously utilised counterparts. Instead of learning unreferenced functions, it explicitly reformulates the layers as learning residual functions with reference to the layer inputs. Extensive empirical evidence demonstrates that residual networks are easier to optimize with improved accuracy from higher depth.
The development of CNN-based architectures has led to the rise of novel techniques, improved hardware (e.g., GPUs and TPUs), better optimization methods and many open-source libraries. Interested readers can go through the survey papers related to CNN methods for remote sensing [15,16]. Previous works have analyzed that CNNs are able to capture image-specific inductive bias, which increases their effectiveness in learning better feature representations. However, CNNs do not capture long-range dependencies that aid enhanced expressivity of the representations. Next, we briefly present vision transformers that are capable of modelling long-range dependencies in the images.

4.2. Vision Transformers

Recently, transformer-based models have achieved promising results across many computer vision and natural language processing (NLP) tasks. Vaswani et al. [17] first introduced transformers as an attention-driven model for machine translation applications. To capture the long-range dependencies, transformers use self-attention layers instead of the traditional recurrent neural network that struggles to encode such dependencies between the elements of a sequence.
To effectively capture the long-range dependencies within an input image, the work of [1] introduces vision transformers (ViTs) for the image recognition task, as shown in Figure 3. ViTs [1] interpret an image as a sequence of patches and process it via a conventional transformer encoder similar to those used in NLP tasks. The success of ViTs in generic visual data have sparked the interest not only in different areas of computer vision, but also in the remote sensing community, where a number of ViT-based techniques have been explored in recent years for various tasks.
Next, we briefly describe the key component of self-attention within transformers.
Self-Attention: The self-attention mechanism has been an integral component of transformers as it captures the long-range dependencies and encodes the interaction between all of the sequences tokens (patch embedding). The key idea of self-attention is to learn self-alignment, that is, to update the token by aggregating global knowledge from all the other tokens in the sequence [28]. Given a 2D image x R H × W × C , the process starts with flattening the image into a series of 2D patches x p a t R M × ( P 2 C ) , where C represents number of channels, H and W represent the height and width of the image, respectively, P × P is the dimension of each individual patch and M = H W / P 2 represents the total number of patches. A learnable linear projection layer of E dimension is used to project these flattened patches and can be showed as a matrix X R N × E . The aim of the self-attention is to apprehend the interaction among all the M embeddings, which is achieved by introducing the three learnable weight matrices to modify input X into queries (as W Q R E × E q ), keys (as W K R E × E k ) and values (as W V R E × E v ), where E q = E k . The sequence X is first projected onto these weight matrices to obtain K = X W K , V = X W V and Q = X W Q . The relative attention matrix A R M × M is
Z = s o f t m a x ( Q K T E q ) V
Masked Self-Attention: All entities are attended to the usual self-attention layer. These self-attention blocks used in the decoder for the transformer model [17], which is trained to anticipate the next entity in the sequence, are masked to prevent attending to the subsequent entities. This task is performed by an element-wise multiplication operation with a mask M R n × n , where M is an upper-triangular matrix. Here, masked self-attention is represent by
s o f t m a x ( Q K T d q M )
where ∘ represents the Hadamard product. In masked self-attention, the attention ratings of future entities are set to zero when predicting an entity in the sequence.
Multi-Head Attention: Multi-head attention (MHA) comprises multiple self-attention blocks concatenated simultaneously channel-wise in order to capture different complex interactions between different sequences of embeddings. Each of the head of the multi-head self-attention has its own learnable weight matrices represented as W Q i , W K i and W V i , where i = 0 · · · · · · ( h 1 ) and h denotes the number head in multi-head self-attention. Hence, we can express
M H A ( Q , K , V ) = [ Z 0 , . . . , Z h 1 ] W O
where the output of each head is concatenated to form single matrix B R M × h · E v , whereas W O × R h . E v × M computes the linear transformation of the heads.
Popular Transformers Backbones: Here, we briefly discuss some recent transformer-based backbones.
ViT: The work of [1] introduces an architecture, where a pure transformer is utilized directly to a sequence of image patches for the task of image classification. The ViT architecture design does not employ image-specific inductive biases (e.g., translation equivariance and locality), and the pre-training is performed on large-scale ImageNet-21k or JFT-300M dataset.
Swin: Liu et al. [29] improved the ViT design by introducing an architecture that produces hierarchical feature representation. The Swin transformer has linear computational complexity with respect to input image size, where the efficiency is achieved by restricting the self-attention computation to non-overlapping local windows while enabling cross-window connection.
PVT: The work of [30] introduces a pyramid vision transformer (PVT) architecture to perform pixel-level dense prediction tasks. The PVT architecture utilizes a progressively shrinking pyramid and a spatial-reduction attention layer for producing high-resolution multi-scale feature maps. The PVT backbone has shown to achieve impressive performance on object detection and segmentation tasks compared to its CNN counterpart with a similar number of parameters.
Transformers offer unique characteristics that are useful for different vision tasks. Compared to the convolution operation in CNNs, where static filters are computed, filters in self-attention are dynamically calculated. Furthermore, permutations and changes in the number of input points have little effect on self-attention. Recent studies [2,3] have explored different interesting properties of vision transformers and compare them with CNNs. For instance, the recent work of [2] shows that vision transformers are more robust to severe occlusions, domain shifts and perturbations. Next, we present a review of transformers in remote sensing based on the taxonomy shown in Figure 4.

5. Transformers in VHR Imagery

Here, we review transformer-based approaches utilized to address different problems in very-high resolution (VHR) imagery.

5.1. Scene Classification

Remote sensing scene classification is a challenging problem, where the task is to automatically associate a semantic category label to a given high-resolution image comprising ground objects and different land cover types. Among the existing vision transformer-based VHR scene classification approaches, Bazi et al. [4] explore the impact of the standard vision transformer architecture of [1] (ViT) and investigate different data augmentation strategies for generating addition data. In addition, their work also evaluates the impact of compressing the network by pruning the layers while maintaining the classification accuracy. The work of [31] introduces a joint CNN-transformer framework, where there is one CNN stream and another ViT stream, as shown in Figure 5. The features from the two streams are concatenated and the entire framework is trained using a joint loss function, comprising cross-entropy and center losses, to optimize the two-stream architecture. Zhang et al. [32] introduce a framework, called Remote Sensing Transformer (TRS), that strives to combine the merits of CNNs and transformers by replacing the spatial convolutions with multi-head self-attention. The resulting multi-head self attention bottleneck has fewer parameters and is shown to be effective compared to other bottlenecks. The work of [5] introduces a two-stream Swin transformer network (TSTNet) that comprises two streams: original and edge. The original stream extracts standard image features, whereas the edge stream contains a differentiable edge Sobel operator module and provides edge information. Further, a weighted feature fusion module is introduced to effectively fuse the features from the two streams for boosting the classification performance. The work of [6] introduces a transformer-based framework with a patch generation module designed to generate homogeneous and heterogeneous patches. The patch generation module generates the heterogeneous patches directly, whereas the homogeneous patches are obtained using a superpixel segmentation method.
Remote Sensing Pre-training: Different from the aforementioned approaches that either use only transformers or hybrid CNN-transformer designs with backbone networks pretrained on ImageNet datasets, the recent work of [7] investigates training vision transformer backbones, such as Swin, from scratch on the large-scale MillionAID remote sensing dataset [33]. The resulting trained backbone models are then fine-tuned for different tasks, including scene classification. Figure 6 shows the response maps, obtained using Grad-CAM++ [34], of different ImageNet (IMP) and remote sensing pre-trained (RSP) models. It can be observed that RSP models learn better semantic representations by paying more attention to the important targets compared to their IMP counterparts. Furthermore, the transformer-based backbones, such as Swin-T, better capture the contextual information due to the self-attention mechanism. Moreover, backbones, such as ViTAEv2-S, that combine the merits of CNNs and transformers along with RSP can achieve better recognition performance.
Table 1 shows a comparison of the aforementioned classification approaches on one of the most commonly used VHR classification benchmarks: AID [35]. The AID dataset contains images acquired from multi-source sensors. The dataset possesses a high degree of intra-class variation since the images are collected from different countries under different times and seasons with variable imaging conditions. There are in total 10,000 images in the dataset and 30 categories. The performance is measured in terms of mean classification accuracy over all the categories. For more details on AID, we refer to [35]. Other than RSP that performs an initial pre-training on the Million-AID dataset, all approaches here utilize models pre-trained on the ImageNet benchmark.

5.2. Object Detection

Localizing objects in VHR imaging is a challenging problem due to extreme scale variations and the diversity of different object classes. Here, the task is to simultaneously recognize and localize (either rectangle or oriented bounding-boxes) all instances belonging to different object categories in an image. Most existing approaches employ a hybrid strategy by combining the merits of CNNs and transformers within existing two-stage and single-stage detectors. Other than the hybrid strategy, few recent works explore the DETR-based transformers object detection paradigm [36].
Hybrid CNN-Transformers based Methods: The work of [37] introduces a local perception Swin transformer (LPSW) backbone to improve the standard transformers for detecting small-sized objects in VHR imagery. The proposed LPSW strives to combine the merits of transformers and CNNs to improve the local perception capabilities for better detection performance. The proposed approach is evaluated with different detectors, such as Mask RCNN [38]. The work of [39] introduces a transformer-based detection architecture, where a pre-trained CNN is used to extract features and a transformer is adapted to process a feature pyramid of a remote sensing image. Zhang et al. [40] introduce a detection framework where an efficient transformer is utilized as a branch network to improve CNN’s ability to encode global features. Additionally, a generative model is employed to expand the input remote sensing aerial images ahead of the backbone network. The work of [41] proposes a detection framework based on RetinaNet, where a feature pyramid transformer (FPT) is utilized between the backbone network and the post-processing network to generate semantically meaningful features. The FPT enables the interaction among features at different levels across the scale. The work of [42] introduces a framework where transformers are adopted to model the relationship of sampled features in order to group them appropriately. Consequently, better grouping and bounding box predictions are obtained without any post-processing operations. The proposed approach effectively eliminates the background information, which helps in achieving improved detection performance.
Zhang et al. [43] introduce a hybrid architecture that combines the local characteristics of depth separable convolutions with the global (channel) characteristics of MLP. The work of [44] introduces a two-stage angle-free detector, where both the RPN and regression are angle-free. Their work also evaluates the proposed detector with a transformer-based backbone (Swin-Tiny). Liu et al. [45] propose a hybrid network architecture, called TransConvNet, that aims at combining the advantages of CNNs and transformers by aggregating both global and local information to address the rotation invariability of CNNs with a better contextual attention. Furthermore, an adaptive feature fusion network is designed to capture information from multiple resolutions. The work of [46] introduces a detection framework, called Oriented Rep-Points, that utilizes flexible adaptive points as a representation. The proposed anchor-free approach learns to select the point samples from classification, localization and orientation. Specifically, to learn geometric features for arbitrarily-oriented aerial objects, a quality assessment and sample assignment scheme is introduced that measures and identifies high-quality sample points for training, as shown in Figure 7. Furthermore, their approach utilizes a spatial constraint for penalizing the sample points that are outside the oriented box for robust learning of the points.
DETR-based Detection Methods: Few recent approaches have investigated adapting the transformer-based DETR detection framework [36] for oriented object detection in VHR imaging. The work of [47] adapts the standard DETR for oriented object detection. In their approach, an efficient encoder is designed for transformers by replacing the standard attention mechanism with a depthwise separable convolution. Dai et al. [48] propose a transformer-based detector, called AO2-DETR, where an oriented proposal generation scheme is employed to explicitly produce oriented object proposals. Furthermore, their approach comprises an adaptive oriented proposal refinement module that is designed to compute rotation-invariant features by eliminating the misalignment between region features and objects. Furthermore, a rotation-aware matching loss is utilized to perform a matching process for direct set prediction without the duplicated predictions.
Table 2 shows a comparison of the aforementioned detection approaches on the most commonly used VHR detection benchmark, DOTA [49]. The dataset comprises 2806 large aerial images of 15 different object categories: plane, baseball diamond, basketball court, soccer-ball field, bridge, ground track field, small vehicle, ship, large vehicle, tennis court, roundabout, swimming pool, harbor, storage tank and helicopter. The detection performance accuracy is measured in terms of mean average precision (mAP). For more details on DOTA, we refer to [49]. The results show that most of these recent methods obtain similar detection accuracy with a slight improvement in performance obtained when using the Swin-T backbone.

5.3. Image Change Detection

In remote sensing, image change detection is an important task for detecting changes on the surface of the Earth with numerous applications in agriculture [50,51], urban planning [52] and map revision [53]. Here, the task is to generate change maps obtained by comparing the multi-temporal or bi-temporal images with each pixel in the resulting binary change map having a value of either zero or one depending on whether the corresponding position has changed or not. Among the recent transformer-based change detection approaches, Chen et al. [54] propose a bi-temporal image transformer encapsulated in a deep feature differencing-based framework that is designed to model the spatio-temporal contextual information. Within the proposed framework, the encoder is employed to capture context in token-based space-time. The resulting contextualized tokens are then fed to the decoder where the features are refined in the pixel-space. Guo et al. [55] propose a deep multi-scale Siamese architecture, called MSPSNet, that utilizes a parallel convolutional structure (PCS) and self-attention. The proposed MSPSNet performs feature integration of different temporal images via PCS and then features refinement based on self-attention to further enhance the multi-scale features. The work of [56] introduces a Swin transformer-based network with a Siamese U-shaped structure, called SwinSUNet, for change detection. The proposed SwinSUNet comprises three modules: encoder, fusion and decoder. The encoder transforms the input image into tokens and produces multi-scale features by employing a hierarchical Swin transformer. The resulting features are concatenated in the fusion having linear projection and Swin transformer blocks. The decoder contains upsampling and merging within Swin transformer blocks to progressively generate change predictions.
Wang et al. [57] introduce an architecture, called UVACD, that combines CNNs and transformers for change detection. Within UVACD, the high-level semantic features are extracted via a CNN backbone, whereas transformers are utilized to generate better change features by capturing the temporal information interaction. The work of [58] introduces a hybrid architecture, TransUNetCD, that strives to combine the merits of transformers and UNet. Here, the encoder takes features extracted from CNNs and enriches them with global contextual information. The corresponding features are then unsampled and combined with multi-scale features to obtain global-local features for localization. The work of [59] introduces a hybrid multi-scale transformer, called Hybrid-TransCD, that captures both fine-grained and large object features by utilizing heterogeneous tokens via multiple receptive fields.
Table 3 shows a comparison of aforementioned change detection approaches on the most commonly used benchmarks: WHU [60] and LEVIR [61]. The WHU dataset comprises a single pair of high-resolution (0.075m) images. Here, the images are of size 32,507 × 15,354. The LEVIR dataset comprises 637 pairs of high-resolution (0.5 m) images. The images are of size 1024 × 1024. The performance is measured in terms of the F1 score with respect to the change category. Figure 8 presents a qualitative comparison of different methods with SwinSUNet on example images from the WHU-CD dataset.

5.4. Image Segmentation

In remote sensing, automatically segmenting an image into semantic categories by performing pixel-level classification is a challenging problem with a wide range of applications, including geological surveys, urban resources management, disaster management and monitoring. Most existing transformer-based remote sensing image segmentation approaches typically employ a hybrid design with an aim to combine the merits of CNNs and transformers. The work of [65] introduces a light-weight transformer-based framework, Efficient-T, that comprises an implicit edge enhancement technique. The proposed Efficient-T employs hierarchical Swin transformers along with the MLP head. A coupled CNN-transformer framework, called CCTNet, is introduced in [66], which is aimed at combining the local details, such as edges and texture, captured by the CNNs along with the global contextual information obtained via transformers for crop segmentation in remote sensing images. Furthermore, different modules, such as test time augmentation and post-processing steps, are introduced in order to remove holes and small objects at the inference for restoring the complete segmented images. A CNN-transformer framework, named STransFuse, is introduced in [67], where both coarse-grained and fine-grained feature representations at multiple scales are extracted and later combined adaptively by utilizing a self-attentive mechanism. The work of [68] proposes a hybrid architecture, where the Swin transformer backbone that captures long-range dependencies is combined with a U-shaped decoder, which employs an atrous spatial pyramid pooling block based on depth-wise separable convolution along with an SE block to better preserve local details in an image. The work of [69] utilizes a pre-trained Swin Transformer backbone along with three decoder designs, namely U-Net, feature pyramid network and pyramid scene parsing network, for semantic segmentation in aerial images.
We present in Table 4 a quantitative comparison of aforementioned approaches on the two most commonly used semantic segmentation datasets: Potsdam [70] and Vaihingen [71]. The Potsdam dataset comprises 38 patches, where each patch has a resolution of 6000 × 6000 pixels collected over the Potsdam City with a ground sampling distance of 5 cm. The dataset has six categories. The Vaihingen dataset comprises 33 samples, where each sample has a resolution from 1996 × 1995 to 3816 × 2550 pixels. Here, the ground sampling distance is 9 cm. This dataset contains the same categories as Potsdam. The performance is measured in terms of overall accuracy (OA) computed using true positives, false positives, false negatives and true negatives. Figure 9 presents a qualitative comparison between Trans-CNN and other approaches on the Potsdam dataset.
Building Extraction: transformer-based techniques have also been recently explored for the problem of building extraction, where the task is to automatically identify building and non-building pixels in a remote sensing image. A dual-pathway transformer framework is introduced in [72] that strives to learn long-range dependencies both in spatial and channel directions. The work of [73] proposes a transformers framework, STEB-UNet, comprising a Swin transformer-based encoding booster that captures semantic information from multi-level features generated from different scales. The encoder booster is further integrated in a U-shaped network design that fuses local and large-scale semantic features. A transformer-based architecture, called BuildFormer, comprising a window-based linear attention, a convolutional MLP and a batch normalization, is introduced in [74]. The work of [75] explores the problem of generalizability of building extraction models to different areas and proposes a transfer learning approach to fine-tune models from one area to a subset of another unseen area.
Other than semantic image segmentation and building extraction with transformers, a recent study by [37] explores the problem of instance segmentation, where the task is to automatically classify each pixel into an object class within an image while also differentiating multiple object instances. Their approach aims at combining the advantages of CNNs and transformers by designing a local perception Swin transformer backbone to enhance both local and global feature information.

5.5. Others

Apart from the problems discussed above, transformer-based techniques are also explored for other VHR remote sensing tasks, such as image captioning and super-resolution (Table 5).
Image Captioning: Image captioning in remote sensing images is a challenging problem, where the task is to generate a semantically natural description of a given image. Few recent works have explored using transformers for image captioning. The work of [97] introduces a framework where standard transformers are adapted for remote sensing image caption generation by integrating residual connections, dropout layers and fusing features adaptively. Moreover, a reinforcement learning technique is utilized to further improve the caption generation process. An encoder–decoder architecture is introduced in [98], where the multi-scale features are first extracted from different layers of CNNs in the encoder and then a multi-layer aggregated transformer is utilized in the decoder to effectively exploit the multi-scale features for generating sentences. The work of [99] introduces a topic token-based mask transformers framework, where a topic token is integrated into the encoder and serves as a prior in the decoder for capturing improved global semantic relationships.
Image Super Resolution: Remote sensing image super-resolution is the task of recovering high-resolution images from their low-resolution counterparts. A few recent works have explored transformers for this task. A transformer-based multi-stage enhancement structure is introduced in [100] that leverages features from different stages. The proposed multi-stage structure can be combined with conventional super-resolution techniques in order to fuse multi-resolution low- and high-dimension features. Ref. [101] proposes a CNN-transformer hybrid architecture to integrate both local and global feature information for super-resolution. The work of [102] explores the problem of multi-image super-resolution, where the task is to merge multiple low-resolution remote sensing images of the same scene into a high-resolution one. Here, a transformer-based approach is introduced comprising an encoder having residual blocks, a fusion module and a super-pixel convolution-based decoder.
To summarize the review of transformers in VHR imagery, we present a holistic overview of different techniques in the literature in Table 6.

6. Transformers in Hyperspectral Imaging

As discussed earlier, hyperspectral images are represented by several spectral brands and analyzing hyperspectral data is crucial in a wide range of problems. Here, we present a review of recent transformer-based approaches for different hyperspectral imaging (HSI) tasks.

6.1. Image Classification

Here, the task is to automatically classify and assign a category label to each pixel in an image acquired through hyperspectral sensors. We first review recent works that are either based on the pure transformer design or utilize a hybrid CNN-transformer approach. Afterwards, we discuss few recent transformer-based approaches fusing different modalities for hyperspectral image classification.
Pure transformer-based Methods: Among existing works, the approach of [114] introduces a bi-directional encoder representation from transformers, called HSI-BERT, that strives to capture global dependencies. The proposed architecture is flexible and can be generalized from different regions with the need to perform pre-training. A transformer-based backbone, called SpectralFormer, is introduced in [8], which can take pixel-wise or patch-wise inputs and is designed to capture spectrally local sequence knowledge from nearby hyperspectral bands. SpectralFormer utilizes cross-layer skip connection to circulate information from shallow to deep layers by learning soft residuals across layers, thereby producing group-wise spectral embeddings. To circumvent the problem of the fixed geometric structure of convolution kernels, a spectral—spatial transformer network is proposed in [115], comprising a spatial attention and a spectral association module. While the spatial attention aims at connecting the local regions through aggregation of all input feature channels with spatial kernel weights, the spectral association is achieved through the integration of all spatial locations of the corresponding masked feature maps. Transformers are also explored in the spatial and spectral dimensions in [9]. Here, a framework is introduced comprising spectral self-attention that learns to capture interactions along the spectral dimension, and a spatial self-attention designed to pay attention to features along the spatial dimension. The resulting features from both spectral and spatial self-attention are then combined and input to the classifier.
Hybrid CNN-Transformers based Methods: Several works recently have explored combining the merits of CNNs and transformers to better capture both the local information as well as long-range dependencies for hyperspectral image classification. To this end, a convolutional transformer network, named CTN, is introduced in [10], which utilizes center position encoding to generate spatial position features by combining pixel positions with spectral features as well as a convolutional transformer to further obtain local-global features, as shown in Figure 10. A hyperspectral image transformer (HiT) classification approach is proposed in [11], where convolutions are embedded into transformer architecture to further integrate local spatial contextual information. The proposed approach comprises two main modules, where one module, called spectral-adaptive 3D convolution projection, is designed to generate spatial–spectral local information via spectral adaptive 3D convolution layers from hyperspectral images. The other module, named Conv-Permutator, employs depthwise convolutions to capture spatial–spectral representations separately along the spectral, height and width dimensions. The work of [12] introduces a multi-scale convolutional transformer that effectively captures spatial–spectral information, which can be integrated with the transformer network. Furthermore, a self-supervised pre-task is defined that masks the token of the central pixel in the encoder, whereas remaining tokens are input to the decoder in order to reconstruct the spectral information corresponding to the central pixel. In [116], a spectral–spatial feature tokenization transformer, called SSFTT, is proposed that generates spectral–spatial and semantic features. The SSFTT comprises a feature extraction module that produces low-level spectral and spatial features by employing a 3D and a 2D convolution layer. Furthermore, a Gaussian weighted feature tokenizer is utilized in SSFTT for feature transformation, which are then input to a transformer encoder for feature representation. Consequently, a linear layer is employed to generate the sample label. Zhao et al. [10] proposes a convolutional transformer network (CTN) that employs center position encoding to combine spectral features with pixel positions. The proposed architecture introduces convolutional transformer blocks that effectively integrate local and global features from hyperspectral image patches. Yang et al. [11] introduces a hyperspectral image transformer (HiT) framework where convolution operations are embedded within the transformer design for also integrating local spatial contextual information. The HiT framework comprises of a spectral-adaptive 3D convolution projection to capture local spatial–spectral information. Additionally, the HiT framework employs a conv-permutator module that uses the depthwise convolution for explicitly capturing the spatial–spectral information along different dimensions: height, width and spectral. The work of [116] introduces a spectral–spatial feature tokenization transformer, named SSFTT, that consists of a spectral–spatial feature extraction scheme for encoding shallow spectral–spatial features, a feature transformation module which produces transformed features used as input in the encoder.
Multi-modal Fusion Transformers based Methods: Few recent transformer-based works also explore fusing different modalities, such as hyperspectral, SAR and LiDAR, for hyperspectral image classification. A multi-modal fusion transformer, MFT, is introduced in [117] and comprises a data fusion scheme to derive class tokens in the transformers from multi-modal data (e.g., LiDAR and SAR) along with the standard hyperspectral patch tokens. Furthermore, the attention mechanism within MFT fuses information from tokens of hyperspectral and other modalities into a new token of integrated features. The work of [118] introduces an approach where a spectral sequence transformer is utilized to extract features from hyperspectral images along the spectral dimension and a spatial hierarchical transformer to generate spatial features in a hierarchical manner from both hyperspectral and LiDAR data.
Table 7 shows a comparison of some representative CNN-based approaches with both pure transformers and hybrid CNN-transformers-based methods on two popular hyperspectral image classification benchmarks: Indian Pines and Pavia. The Indian Pines dataset is acquired through airborne visible/infrared imaging spectrometer (AVIRIS) sensors in Northwestern Indiana, USA. Here, the images comprise 145 × 145 pixels in the spatial dimension at a ground sampling distance (GSD) of 20m with 220 spectral bands that cover the wavelength range of 400–2500 nm. After the removal of noisy bands, 200 spectral brands are retained. The original dataset contains 16 class, where several methods discard the small classes. For the remaining categories, the number of training samples are 200 per class. The Pavia dataset comprises images acquired through the reflective optics system imaging spectrometer (ROSIS) sensor over Pavia, Italy. Here, the images consist of 610 × 340 pixels in the spatial dimension at a GSD of 1.3m with 103 spectral bands covering from 430 to 860 nm. The dataset contains nine categories, where the number of training samples are 200 per class. Generally, three metrics are used to evaluate the performance of methods quantitatively: overall accuracy, average accuracy and kappa coefficient. The overall accuracy (OA) denotes to the proportion of correctly classified test samples, whereas average accuracy (AA) reflects the average recognition accuracy for each category. The kappa coefficient refers to the consistency between the generated classification maps from the model and the available ground truth. Figure 11 presents a qualitative comparison between HSI-Bert [114] and other existing CNN-based methods on the Pavia dataset.

6.2. Hyperspectral Pansharpening

In the hyperspectral pansharpening problem, the task is to enhance low-resolution hyperspectral image spatially using the spatial information from registered panchromatic image, while preserving the spectral information of the low-resolution image. Pansharpening plays an important role in a variety of tasks in remote sensing, including classification and change detection. Previously, CNN-based approaches have shown promising results for this task. Recently, transformer-based methods have performed favorably for this problem by also utilizing the useful global contextual information. A multi-scale spatial–spectral interaction transformer, MSIT, is proposed by [121] that comprises a convolution–transformer encoder to extract multi-scale local and global features from low-resolution and panchromatic images. The work of [122] introduces an architecture where global features are constructed using transformers and local features are computed using a shallow CNN. These multi-scale features extracted in a pyramidal fashion are learned simultaneously. The proposed approach further introduces a loss formulation with spatial and spectral loss simultaneously used for training using the real data. Liang et al. [123] propose a framework, named PMACNet, where both the region-of-interest from the low-resolution image and the residuals for regression to high-resolution image are learned in a parallel CNN structure. Afterwards, a pixel-wise attention module is utilized to adapt the residuals based on the learned region-of-interest.
A transformer-based regression network is introduced by [124], where the feature extraction of spatial and spectral information is performed by utilizing a Swin transformer model. The work of [125] introduces a transformer-based approach, where multi-spectral and panchromatic features are formulated as keys and queries for enabling joint learning of features across the modalities. Furthermore, this work employs an invertible neural module to perform effective fusion of the features for generating the pansharpened images. Bandara et al. [126] propose a framework comprising separate feature extractors for panchromatic and hyperspectral images, a soft attention mechanism and a spectral-spatial fusion module. The pansharpened image quality is improved by learning cross-feature space dependencies of the different features.
To summarize the review of transformers in hyperspectral imaging, we provide a holistic overview of the existing techniques in literature in Table 8.

7. Transformers in SAR Imagery

As discussed earlier, SAR images are constructed from the signals of the electromagnetic waves through a sensor platform transmitted to the surface of Earth. SAR possesses unique characteristics due to being unaffected with different environmental conditions, such as day, night and fog. Here, we review recent transformer-based approaches for SAR imaging tasks.

7.1. SAR Image Interpretation

Classification: Accurately classifying the target categories within SAR images is a challenging problem with numerous real-world applications. Recently, transformers have been explored for automatic interpretation and target recognition in SAR imagery. The work of [141] explores vision transformers for polarimetric SAR (PolSAR) image classification. In this framework, the pixel values of the image patches are considered as tokens and the self-attention mechanism is employed to capture long-range dependencies followed by multi-layer perceptron (MLP) and learnable class tokens to integrate features. A contrastive learning technique is utilized within the framework to reduce the redundancies and perform the classification task. Figure 12 shows the overview of the framework and a qualitative comparison in terms of supervised classification is presented in Figure 13.
Other than the aforementioned pure transformer-based approach, hybrid methods utilizing both CNNs and transformers also exist in the literature. The work of [142] introduces a globa–local network structure (GLNS) framework that combines the merits of CNNs and transformers for SAR image classification. The proposed GLNS employs a lightweight CNN along with an efficient vision transformer to capture both local and global features, which are later fused to perform the classification task. Other than standard fully-supervised learning, transformers are also explored in the limited supervision regime, such as few-shot SAR image classification. Cai et al. [143] introduces a few-shot SAR classification approach, named ST-PN, where a spatial transformer network is utilized for performing spatial alignment on CNN-based features.
Segmentation and Detection: Detection and segmentation in SAR imagery is vital for different applications, such as crop identification, target detection and terrain mapping. In SAR imagery, segmentation can be challenging due to the appearance of speckles, which is a type of multiplicative noise that increases with the back-scattering radar magnitude. Among recent transformer-based approaches, the work of [144] introduces a framework, named GCBANet, for SAR ship instance segmentation. Within the GCBANet framework, a global contextual block is employed to encode spatial holistic long-range dependencies. Furthermore, a boundary-aware box prediction technique is introduced to predict the boundaries of the ship. Xia et al. [145] introduce an approach, named CRTransSar, that combines the benefits of CNNs and transformers to capture both local and global information for SAR object detection. The proposed CRTransSar works by constructing a backbone with attention and convolutional blocks. A geospatial transformer framework is introduced in [146], comprising the steps of image decomposition, multi-scale geo-spatial contextual attention and recomposition for detecting aircrafts in SAR imagery. A feature relation enhancement framework is proposed in [147] for aircraft detection in SAR imagery. The proposed framework adopts a fusion pyramid structure to combine features of different levels and scales. Further, a context attention enhancement technique is employed to improve the positioning accuracy in complex backgrounds.
Other than ship and aircraft detection, the recent work of [148] introduces a transformer-based framework for 3D detection of oil tank targets in SAR imagery. In this framework, the incidence angle is input to the transformer as a prior token followed by a feature description operator that utilizes scattering centers for refining the predictions.

7.2. Others

Apart from SAR image classification, detection and segmentation, few works exist exploring transformers for other SAR imaging problems, such as image despeckling.
SAR Image Despeckling: The aforementioned interpretation of SAR imaging is made challenging due to the degradation of images caused by a multiplicative noise known as speckle. Recently, transformers have been explored for SAR image despeckling. The work of [149] introduces a transformer-based framework comprising an encoder that learns global dependencies among various SAR image regions. The transformer-based network is trained in an end-to-end fashion with synthetic speckled data by utilizing a composite loss function.
Change Detection in SAR Images: SAR images can be affected by imaging noise, which presents challenges when detecting changes in high-resolution (HR) SAR data. Recently, a self-supervised contrastive representation learning technique has been proposed by [150], where hierarchical representations are constructed using a convolution-enhanced transformer to distinguish the changes from HR SAR images. A convolution-based module is introduced to enable interactions across windows when performing self-attention computations within local windows.
SAR Image Registration: Several applications, such as change detection, involves joint analysis and processing of multiple SAR images that are likely acquired in different imaging conditions. Thus, accurate SAR image registration is desired where the reference and the sensed images are registered. The recent work of [151] explores transformers for large-size SAR dense-matching registration. Here, a hybrid CNN-transformer is employed to register images under weak texture condition. First, coarse registration is performed via the down-sampled original SAR image. Then, cluster centers of registration points are selected from the previous coarse registration step. Afterwards, the registration of image pairs are performed using a CNN-transformer module. Lastly, the resulting point pair subsets are integrated to achieve the final global transformation through RANSAC.
In summary, we present a holistic overview of the existing transformers techniques in SAR imagery in Table 9.

8. Conclusions

In this work, we presented a broad overview of transformers in remote sensing imaging: very-high resolution (VHR), hyperspectral and synthetic aperture radar (SAR). Within these different remote sensory imagery, we further discuss transformer-based approaches on a variety of tasks, such as classification, detection and segmentation. Our survey covers more than 60 transformer-based remote sensing research works in the literature. We observed transformers to obtain favorable performance on different remote sensing tasks likely due to their capabilities to capture long-range dependencies along with their representation flexibility. Further, the public availability of several standard transformer architectures and backbones make it easier to explore their applicability in remote sensing imaging problems.
Open Research Directions: As discussed earlier, most existing transformer-based recognition approaches employ backbones pre-trained on the ImageNet dataset. One exception is the work of [7], which explores pre-training vision transformers on a large-scale remote sensing dataset. However, in both cases the pre-training is performed in a supervised fashion. An open direction is to explore large-scale pre-training in a self-supervised fashion by taking into account an abundant amount of unlabeled remote sensing imaging data.
Our survey also shows that most existing approaches typically utilize a hybrid architecture where the aim is to combine the merits of convolutions and self-attention. However, transformers are typically known to have a higher computational cost to compute global self-attention. Several recent works have explored different improvements in the transformers design, such as, reduced computational overhead [165], efficient hybrid CNN-transformers backbones [166] and unified architectures for image and video classification [167]. Moreover, due to the utilization of more training data by transformers, there is a need to construct larger-scale datasets in remote sensing imaging. For most problems discussed in this work and especially in case of object detection, heavy backbones are typically utilized to achieve better detection accuracy. However, this significantly slows down the speed of the aerial detector. An interesting open direction is to design light-weight transformer-based backbones to classify detect oriented targets in remote sensing imagery. Another open research direction is to explore the adaptability of the transformer-based models to a heterogeneous source of images, such as SAR and UAV (e.g., change detection).
In this survey, we also observe several existing approaches to utilize transformers in a plug-and-play fashion for remote sensing. This leads to the need of designing effective domain-specific architectural components and loss formulations to further boost the performance. Moreover, it is intriguing to study the adversarial feature space of vision transformer models that are pre-trained on remote sensing benchmarks and their transferability.
In the future, it is expected that more sophisticated pure transformer architectures with specifically designed self-attention mechanisms for remote sensing problems will be explored. Another potential future research direction is to investigate new hybrid CNN-transformer architectures that leverage the capabilities of convolutions and self-attention in the context of remote sensing tasks.
Additionally, we intend to frequently update and maintain the latest transformers in remote sensing papers with their respective code at https://github.com/VIROBO-15/Transformer-in-Remote-Sensing, accessed on 6 February 2023.

Author Contributions

Conceptualization, A.A.A., F.S.K. and A.K.; methodology, A.A.A., F.S.K. and A.K.; validation, A.A.A., F.S.K. and A.K.; formal analysis, A.A.A., F.S.K. and A.K.; investigation, A.A.A., F.S.K. and A.K.; resources, A.A.A., F.S.K. and A.K.; writing—original draft preparation, A.A.A., A.K.; writing—review and editing, A.A.A., F.S.K. and A.K.; supervision, R.M.A., S.K., H.C., F.S.K. and G.-S.X.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank and express their deepest gratitude to Mohamed bin Zayed University for Artificial Intelligence for the constant, helpful presence throughout the research journey.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the ICLR, Virtual-Only, 3–7 May 2021. [Google Scholar]
  2. Naseer, M.; Ranasinghe, K.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Intriguing Properties of Vision Transformers. In Proceedings of the NeurIPS, Virtual-Only, 7–10 December 2021. [Google Scholar]
  3. Park, N.; Kim, S. How Do Vision Transformers Work? In Proceedings of the ICLR, Virtual-Only, 25 April 2022. [Google Scholar]
  4. Bazi, Y.; Bashmal, L.; Rahhal, M.M.A.; Dayil, R.A.; Ajlan, N.A. Vision transformers for remote sensing image classification. Remote Sens. 2021, 13, 516. [Google Scholar] [CrossRef]
  5. Hao, S.; Wu, B.; Zhao, K.; Ye, Y.; Wang, W. Two-Stream Swin Transformer with Differentiable Sobel Operator for Remote Sensing Image Classification. Remote Sens. 2022, 14, 1507. [Google Scholar] [CrossRef]
  6. Ma, J.; Li, M.; Tang, X.; Zhang, X.; Liu, F.; Jiao, L. Homo–Heterogenous Transformer Learning Framework for RS Scene Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2223–2239. [Google Scholar] [CrossRef]
  7. Wang, D.; Zhang, J.; Du, B.; Xia, G.S.; Tao, D. An Empirical Study of Remote Sensing Pretraining. IEEE Trans. Geosci. Remote Sens. 2022. [Google Scholar] [CrossRef]
  8. Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5518615. [Google Scholar] [CrossRef]
  9. Liu, B.; Yu, A.; Gao, K.; Tan, X.; Sun, Y.; Yu, X. DSS-TRM: Deep spatial–spectral transformer for hyperspectral image classification. Eur. J. Remote Sens. 2022, 55, 103–114. [Google Scholar] [CrossRef]
  10. Zhao, Z.; Hu, D.; Wang, H.; Yu, X. Convolutional Transformer Network for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  11. Yang, X.; Cao, W.; Lu, Y.; Zhou, Y. Hyperspectral Image Transformer Classification Networks. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5528715. [Google Scholar] [CrossRef]
  12. Jia, S.; Wang, Y. Multiscale Convolutional Transformer with Center Mask Pretraining for Hyperspectral Image Classification. arXiv 2022, arXiv:2203.04771. [Google Scholar]
  13. Tuia, D.; Volpi, M.; Copa, L.; Kanevski, M.; Munoz-Mari, J. A survey of active learning algorithms for supervised remote sensing image classification. IEEE J. Sel. Top. Signal Process. 2011, 5, 606–617. [Google Scholar] [CrossRef]
  14. Camps-Valls, G.; Tuia, D.; Bruzzone, L.; Benediktsson, J.A. Advances in hyperspectral image classification: Earth monitoring with statistical learning methods. IEEE Signal Process. Mag. 2013, 31, 45–54. [Google Scholar] [CrossRef] [Green Version]
  15. Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef] [Green Version]
  16. Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]
  17. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. NeurIPS 2017, 30, 600–610. [Google Scholar]
  18. Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah:, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2021, 54, 1–41. [Google Scholar] [CrossRef]
  19. Shamshad, F.; Khan, S.; Zamir, S.W.; Khan, M.H.; Hayat, M.; Khan, F.S.; Fu, H. Transformers in medical imaging: A survey. arXiv 2022, arXiv:2201.09873. [Google Scholar]
  20. Selva, J.; Johansen, A.; Escalera, S.; Nasrollahi, K.; Moeslund, T.; Clapes, A. Video Transformers: A Survey. arXiv 2022, arXiv:2201.05991. [Google Scholar] [CrossRef]
  21. Teng, M.Y.; Mehrubeoglu, R.; King, S.A.; Cammarata, K.; Simons, J. Investigation of epifauna coverage on seagrass blades using spatial and spectral analysis of hyperspectral images. In Proceedings of the 2013 5th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), Gainesville, FL, USA, 26–28 June 2013; pp. 1–4. [Google Scholar]
  22. Notesco, G.; Dor, E.B.; Brook, A. Mineral mapping of makhtesh ramon in israel using hyperspectral remote sensing day and night LWIR images. In Proceedings of the 2014 6th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), Lausanne, Switzerland, 24–27 June 2014; pp. 1–4. [Google Scholar]
  23. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. NeurIPS 2012, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
  24. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. NeurIPS 2015, 28, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
  25. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the CVPR, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
  26. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  27. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the CVPR, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  28. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
  29. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the CVPR, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
  30. Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the ICCV, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
  31. Deng, P.; Xu, K.; Huang, H. When CNNs meet vision transformer: A joint framework for remote sensing scene classification. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
  32. Zhang, J.; Zhao, H.; Li, J. TRS: Transformers for Remote Sensing Scene Classification. Remote Sens. 2021, 13, 4143. [Google Scholar] [CrossRef]
  33. Long, Y.; Xia, G.S.; Li, S.; Yang, W.; Yang, M.Y.; Zhu, X.X.; Zhang, L.; Li, D. On Creating Benchmark Dataset for Aerial Image Interpretation: Reviews, Guidances and Million-AID. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4205–4230. [Google Scholar] [CrossRef]
  34. Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 839–847. [Google Scholar]
  35. Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef] [Green Version]
  36. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the ECCV, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
  37. Xu, X.; Feng, Z.; Cao, C.; Li, M.; Wu, J.; Wu, Z.; Shang, Y.; Ye, S. An Improved Swin Transformer-Based Model for Remote Sensing Object Detection and Instance Segmentation. Remote Sens. 2021, 13, 4779. [Google Scholar] [CrossRef]
  38. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the ICCV, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  39. Li, Q.; Chen, Y.; Zeng, Y. Transformer with Transfer CNN for Remote-Sensing-Image Object Detection. Remote Sens. 2022, 14, 984. [Google Scholar] [CrossRef]
  40. Zhang, Y.; Liu, X.; Wa, S.; Chen, S.; Ma, Q. GANsformer: A Detection Network for Aerial Images with High Performance Combining Convolutional Network and Transformer. Remote Sens. 2022, 14, 923. [Google Scholar] [CrossRef]
  41. Zheng, Y.; Sun, P.; Zhou, Z.; Xu, W.; Ren, Q. ADT-Det: Adaptive Dynamic Refined Single-Stage Transformer Detector for Arbitrary-Oriented Object Detection in Satellite Optical Imagery. Remote Sens. 2021, 13, 2623. [Google Scholar] [CrossRef]
  42. Tang, J.; Zhang, W.; Liu, H.; Yang, M.; Jiang, B.; Hu, G.; Bai, X. Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection. In Proceedings of the CVPR, New Orleans, LA, USA, 19–24 June 2022; pp. 4563–4572. [Google Scholar]
  43. Dai, Y.; Yu, J.; Zhang, D.; Hu, T.; Zheng, X. RODFormer: High-Precision Design for Rotating Object Detection with Transformers. Sensors 2022, 22, 2633. [Google Scholar] [CrossRef]
  44. Zhou, Q.; Yu, C. Point RCNN: An Angle-Free Framework for Rotated Object Detection. Remote Sens. 2022, 14, 2605. [Google Scholar] [CrossRef]
  45. Liu, X.; Ma, S.; He, L.; Wang, C.; Chen, Z. Hybrid Network Model: TransConvNet for Oriented Object Detection in Remote Sensing Images. Remote Sens. 2022, 14, 2090. [Google Scholar] [CrossRef]
  46. Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented RepPoints for Aerial Object Detection. In Proceedings of the IEEE/CVF, Nashville, TN, USA, 20–25 June 2021; pp. 1829–1838. [Google Scholar]
  47. Ma, T.; Mao, M.; Zheng, H.; Gao, P.; Wang, X.; Han, S.; Ding, E.; Zhang, B.; Doermann, D. Oriented Object Detection with Transformer. arXiv 2021, arXiv:2106.03146. [Google Scholar]
  48. Dai, L.; Liu, H.; Tang, H.; Wu, Z.; Song, P. AO2-DETR: Arbitrary-Oriented Object Detection Transformer. arXiv 2022, arXiv:2205.12785. [Google Scholar] [CrossRef]
  49. Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
  50. Muzein, B.S. Remote Sensing & GIS for Land Cover, Land Use Change Detection and Analysis in the Semi-Natural Ecosystems and Agriculture Landscapes of the Central Ethiopian Rift Valley. Ph.D. Thesis, Institute of Photogrammetry and Remote Sensing, Technology University of Dresden, Dresden, Germany, 2006. [Google Scholar]
  51. Haack, B.; Wolf, J.; English, R. Remote sensing change detection of irrigated agriculture in Afghanistan. Geocarto Int. 1998, 13, 65–75. [Google Scholar] [CrossRef]
  52. Bolorinos, J.; Ajami, N.K.; Rajagopal, R. Consumption change detection for urban planning: Monitoring and segmenting water customers during drought. Water Resour. Res. 2020, 56, e2019WR025812. [Google Scholar] [CrossRef]
  53. Metternicht, G. Change detection assessment using fuzzy sets and remotely sensed data: An application of topographic map revision. ISPRS J. Photogramm. Remote Sens. 1999, 54, 221–233. [Google Scholar] [CrossRef]
  54. Chen, H.; Qi, Z.; Shi, Z. Remote Sensing Image Change Detection with Transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5607514. [Google Scholar] [CrossRef]
  55. Guo, Q.; Zhang, J.; Zhu, S.; Zhong, C.; Zhang, Y. Deep multiscale Siamese network with parallel convolutional structure and self-attention for change detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 3131993. [Google Scholar] [CrossRef]
  56. Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure Transformer Network for Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5224713. [Google Scholar] [CrossRef]
  57. Wang, G.; Li, B.; Zhang, T.; Zhang, S. A Network Combining a Transformer and a Convolutional Neural Network for Remote Sensing Image Change Detection. Remote Sens. 2022, 14, 2228. [Google Scholar] [CrossRef]
  58. Li, Q.; Zhong, R.; Du, X.; Du, Y. TransUNetCD: A Hybrid Transformer Network for Change Detection in Optical Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5622519. [Google Scholar] [CrossRef]
  59. Ke, Q.; Zhang, P. Hybrid-TransCD: A Hybrid Transformer Remote Sensing Image Change Detection Network via Token Aggregation. Int. J. Geo-Inform. 2022, 11, 263. [Google Scholar] [CrossRef]
  60. Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2018, 57, 574–586. [Google Scholar] [CrossRef]
  61. Chen, H.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
  62. Daudt, R.C.; Le Saux, B.; Boulch, A. Fully convolutional siamese networks for change detection. In Proceedings of the ICIP, Athens, Greece, 7 October 2018; pp. 4063–4067. [Google Scholar]
  63. Alcantarilla, P.F.; Stent, S.; Ros, G.; Arroyo, R.; Gherardi, R. Street-view change detection with deconvolutional networks. Auton. Robot. 2018, 42, 1301–1322. [Google Scholar] [CrossRef]
  64. Chen, J.; Yuan, Z.; Peng, J.; Chen, L.; Huang, H.; Zhu, J.; Liu, Y.; Li, H. DASNet: Dual attentive fully convolutional Siamese networks for change detection in high-resolution satellite images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 1194–1206. [Google Scholar] [CrossRef]
  65. Xu, Z.; Zhang, W.; Zhang, T.; Yang, Z.; Li, J. Efficient transformer for remote sensing image segmentation. Remote Sens. 2021, 13, 3585. [Google Scholar] [CrossRef]
  66. Wang, H.; Chen, X.; Zhang, T.; Xu, Z.; Li, J. CCTNet: Coupled CNN and Transformer Network for Crop Segmentation of Remote Sensing Images. Remote Sens. 2022, 14, 1956. [Google Scholar] [CrossRef]
  67. Gao, L.; Liu, H.; Yang, M.; Chen, L.; Wan, Y.; Xiao, Z.; Qian, Y. STransFuse: Fusing Swin Transformer and Convolutional Neural Network for Remote Sensing Image Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 10990–11003. [Google Scholar] [CrossRef]
  68. Zhang, C.; Jiang, W.; Zhang, Y.; Wang, W.; Zhao, Q.; Wang, C. Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–20. [Google Scholar] [CrossRef]
  69. Panboonyuen, T.; Jitkajornwanich, K.; Lawawirojwong, S.; Srestasathiern, P.; Vateekul, P. Transformer-Based Decoder Designs for Semantic Segmentation on Remotely Sensed Images. Remote Sens. 2021, 13, 5100. [Google Scholar] [CrossRef]
  70. Available online: https://www.isprs.org/education/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx (accessed on 27 August 2022).
  71. Available online: https://www.isprs.org/education/benchmarks/UrbanSemLab/2d-sem-label-vaihingen.aspx (accessed on 27 August 2022).
  72. Chen, K.; Zou, Z.; Shi, Z. Building extraction from remote sensing images with sparse token transformers. Remote Sens. 2021, 13, 4441. [Google Scholar] [CrossRef]
  73. Xiao, X.; Guo, W.; Chen, R.; Hui, Y.; Wang, J.; Zhao, H. A Swin Transformer-Based Encoding Booster Integrated in U-Shaped Network for Building Extraction. Remote Sens. 2022, 14, 2611. [Google Scholar] [CrossRef]
  74. Wang, L.; Fang, S.; Meng, X.; Li, R. Building extraction with vision transformer. IEEE Trans. Geosci. Remote Sens. 2022, 14, 2611. [Google Scholar] [CrossRef]
  75. Qiu, C.; Li, H.; Guo, W.; Chen, X.; Yu, A.; Tong, X.; Schmitt, M. Transferring transformer-based models for cross-area building extraction from remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 4104–4116. [Google Scholar] [CrossRef]
  76. Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the SIGSPATIAL, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar]
  77. Wang, Q.; Liu, S.; Chanussot, J.; Li, X. Scene classification with recurrent attention of VHR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 1155–1167. [Google Scholar] [CrossRef]
  78. Cheng, G.; Li, Z.; Yao, X.; Guo, L.; Wei, Z. Remote sensing image scene classification using bag of convolutional features. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1735–1739. [Google Scholar] [CrossRef]
  79. Li, Y.; Zhu, Z.; Yu, J.G.; Zhang, Y. Learning deep cross-modal embedding networks for zero-shot remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 10590–10603. [Google Scholar] [CrossRef]
  80. Waqas Zamir, S.; Arora, A.; Gupta, A.; Khan, S.; Sun, G.; Shahbaz Khan, F.; Zhu, F.; Shao, L.; Xia, G.S.; Bai, X. Isaid: A large-scale dataset for instance segmentation in aerial images. In Proceedings of the CVPR Workshops, Long Beach, CA, USA, 16–20 June 2019; pp. 28–37. [Google Scholar]
  81. Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. In Proceedings of the ICPRAM, Porto, Portugal, 24–26 February 2017. [Google Scholar]
  82. Lebedev, M.; Vizilter, Y.V.; Vygolov, O.; Knyaz, V.; Rubis, A.Y. Change Detection in remote sensing images using conditional adversarial networks. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2018, 42, 324–331. [Google Scholar] [CrossRef] [Green Version]
  83. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
  84. Zhang, Y.; Yuan, Y.; Feng, Y.; Lu, X. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5535–5548. [Google Scholar] [CrossRef]
  85. Cheng, G.; Zhou, P.; Han, J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
  86. Zhu, H.; Chen, X.; Dai, W.; Fu, K.; Ye, Q.; Jiao, J. Orientation robust object detection in aerial images using deep convolutional neural network. In Proceedings of the ICIP, Quebec City, QC, Canada, 27–30 September 2015; pp. 3735–3739. [Google Scholar]
  87. Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image Represent. 2016, 34, 187–203. [Google Scholar] [CrossRef] [Green Version]
  88. Pan, X.; Ren, Y.; Sheng, K.; Dong, W.; Yuan, H.; Guo, X.; Ma, C.; Xu, C. Dynamic refinement network for oriented and densely packed object detection. In Proceedings of the CVPR, Seattle, WA, USA, 13–19 June 2020; pp. 11207–11216. [Google Scholar]
  89. Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic data for text localisation in natural images. In Proceedings of the CVPR, Las Vegas, NV, USA, 27–30 June 2016; pp. 2315–2324. [Google Scholar]
  90. Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S.; et al. ICDAR 2015 competition on robust reading. In Proceedings of the ICDAR, Tunis, Tunisia, 23–26 August 2015; pp. 1156–1160. [Google Scholar]
  91. Nayef, N.; Yin, F.; Bizid, I.; Choi, H.; Feng, Y.; Karatzas, D.; Luo, Z.; Pal, U.; Rigaud, C.; Chazalon, J.; et al. Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In Proceedings of the ICDAR, Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 1454–1459. [Google Scholar]
  92. Yao, C.; Bai, X.; Liu, W.; Ma, Y.; Tu, Z. Detecting texts of arbitrary orientations in natural images. In Proceedings of the CVPR, Providence, RI, USA, 16–21 June 2012; pp. 1083–1090. [Google Scholar]
  93. He, M.; Liu, Y.; Yang, Z.; Zhang, S.; Luo, C.; Gao, F.; Zheng, Q.; Wang, Y.; Zhang, X.; Jin, L. ICPR2018 contest on robust reading for multi-type web images. In Proceedings of the ICPR, Beijing, China, 20–24 August 2018; pp. 7–12. [Google Scholar]
  94. Ch’ng, C.K.; Chan, C.S. Total-text: A comprehensive dataset for scene text detection and recognition. In Proceedings of the ICDAR, Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 935–942. [Google Scholar]
  95. Yuliang, L.; Lianwen, J.; Shuaitao, Z.; Sheng, Z. Detecting curve text in the wild: New dataset and new solution. arXiv 2017, arXiv:1712.02170. [Google Scholar]
  96. Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
  97. Shen, X.; Liu, B.; Zhou, Y.; Zhao, J. Remote sensing image caption generation via transformer and reinforcement learning. Multi. Tools Appl. 2020, 79, 26661–26682. [Google Scholar] [CrossRef]
  98. Liu, C.; Zhao, R.; Shi, Z. Remote-Sensing Image Captioning Based on Multilayer Aggregated Transformer. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6506605. [Google Scholar] [CrossRef]
  99. Ren, Z.; Gou, S.; Guo, Z.; Mao, S.; Li, R. A Mask-Guided Transformer Network with Topic Token for Remote Sensing Image Captioning. Remote Sens. 2022, 14, 2939. [Google Scholar] [CrossRef]
  100. Lei, S.; Shi, Z.; Mo, W. Transformer-Based Multistage Enhancement for Remote Sensing Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5615611. [Google Scholar] [CrossRef]
  101. Ye, C.; Yan, L.; Zhang, Y.; Zhan, J.; Yang, J.; Wang, J. A Super-resolution Method of Remote Sensing Image Using Transformers. IDAACS 2021, 2, 905–910. [Google Scholar]
  102. An, T.; Zhang, X.; Huo, C.; Xue, B.; Wang, L.; Pan, C. TR-MISR: Multiimage Super-Resolution Based on Feature Fusion with Transformers. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 1373–1388. [Google Scholar] [CrossRef]
  103. Shi, Q.; Liu, M.; Li, S.; Liu, X.; Wang, F.; Zhang, L. A deeply supervised attention metric-based network and an open aerial image dataset for remote sensing change detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5604816. [Google Scholar] [CrossRef]
  104. Daudt, R.C.; Le Saux, B.; Boulch, A.; Gousseau, Y. Urban change detection for multispectral earth observation using convolutional neural networks. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 2115–2118. [Google Scholar]
  105. Daudt, R.C.; Le Saux, B.; Boulch, A.; Gousseau, Y. Multitask learning for large-scale semantic change detection. Comput. Vis. Image Underst. 2019, 187, 102783. [Google Scholar] [CrossRef] [Green Version]
  106. Shen, L.; Lu, Y.; Chen, H.; Wei, H.; Xie, D.; Yue, J.; Chen, R.; Lv, S.; Jiang, B. S2Looking: A satellite side-looking dataset for building change detection. Remote Sens. 2021, 13, 5094. [Google Scholar] [CrossRef]
  107. Barley Remote Sensing Dataset. Available online: https://tianchi.aliyun.com/dataset/dataDetail?dataId=74952 (accessed on 27 August 2022).
  108. Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can semantic labeling methods generalize to any city? The inria aerial image labeling benchmark. In Proceedings of the IGARSS, Fort Worth, TX, USA, 23–28 July 2017; pp. 3226–3229. [Google Scholar]
  109. Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring models and data for remote sensing image caption generation. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2183–2195. [Google Scholar] [CrossRef] [Green Version]
  110. MEGA. Available online: https://mega.nz/folder/wCpSzSoS#RXzIlrv–TDt3ENZdKN8JA (accessed on 27 August 2022).
  111. MEGA. Available online: https://mega.nz/folder/pG4yTYYA#4c4buNFLibryZnlujsrwEQ (accessed on 27 August 2022).
  112. Märtens, M.; Izzo, D.; Krzic, A.; Cox, D. Super-resolution of PROBA-V images using convolutional neural networks. Astrodynamics 2019, 3, 387–402. [Google Scholar] [CrossRef]
  113. Available online: http://weegee.vision.ucmerced.edu/datasets/landuse.html (accessed on 27 August 2022).
  114. He, J.; Zhao, L.; Yang, H.; Zhang, M.; Li, W. HSI-BERT: Hyperspectral image classification using the bidirectional encoder representation from transformers. IEEE Trans. Geosci. Remote Sens. 2019, 58, 165–178. [Google Scholar] [CrossRef]
  115. Zhong, Z.; Li, Y.; Ma, L.; Li, J.; Zheng, W.S. Spectral-spatial transformer network for hyperspectral image classification: A factorized architecture search framework. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5514715. [Google Scholar] [CrossRef]
  116. Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–Spatial Feature Tokenization Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5522214. [Google Scholar] [CrossRef]
  117. Roy, S.K.; Deria, A.; Hong, D.; Rasti, B.; Plaza, A.; Chanussot, J. Multimodal fusion transformer for remote sensing image classification. arXiv 2022, arXiv:2203.16952. [Google Scholar]
  118. Xue, Z.; Tan, X.; Yu, X.; Liu, B.; Yu, A.; Zhang, P. Deep Hierarchical Vision Transformer for Hyperspectral and LiDAR Data Classification. IEEE Trans. Image Process. 2022, 31, 3095–3110. [Google Scholar] [CrossRef]
  119. Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep Convolutional Neural Networks for Hyperspectral Image Classification. Sensors 2015, 2015, 258619. [Google Scholar] [CrossRef] [Green Version]
  120. Li, W.; Wu, G.; Zhang, F.; Du, Q. Hyperspectral Image Classification Using Deep Pixel-Pair Features. IEEE Trans. Geosci. Remote Sens. 2017, 2, 844–853. [Google Scholar] [CrossRef]
  121. Zhang, F.; Zhang, K.; Sun, J. Multiscale Spatial–Spectral Interaction Transformer for Pan-Sharpening. Remote Sens. 2022, 14, 1736. [Google Scholar] [CrossRef]
  122. Li, S.; Guo, Q.; Li, A. Pan-Sharpening Based on CNN+ Pyramid Transformer by Using No-Reference Loss. Remote Sens. 2022, 14, 624. [Google Scholar] [CrossRef]
  123. Liang, Y.; Zhang, P.; Mei, Y.; Wang, T. PMACNet: Parallel Multiscale Attention Constraint Network for Pan-Sharpening. IEEE Geosci. Remote Sens. Lett. 2022, 19, 5512805. [Google Scholar] [CrossRef]
  124. Su, X.; Li, J.; Hua, Z. Transformer-Based Regression Network for Pansharpening Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5407423. [Google Scholar] [CrossRef]
  125. Zhou, M.; Huang, J.; Fang, Y.; Fu, X.; Liu, A. Pan-Sharpening with Customized Transformer and Invertible Neural Network. AAAI 2022, 36, 3553–3561. [Google Scholar] [CrossRef]
  126. Bandara, W.; Patel, V. HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening. In Proceedings of the CVPR, New Orleans, LA, USA, 19–24 June 2022; pp. 1767–1777. [Google Scholar]
  127. 220 Band AVIRIS Hyperspectral Image Data Set: June 12, 1992 Indian Pine Test Site 3. Available online: https://purr.purdue.edu/publications/1947/1 (accessed on 27 August 2022).
  128. Available online: https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes#Pavia_Centre_and_University (accessed on 27 August 2022).
  129. Available online: https://hyperspectral.ee.uh.edu/?page_id=459 (accessed on 27 August 2022).
  130. Available online: https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes#Salinas (accessed on 27 August 2022).
  131. Gader, P.; Zare, A.; Close, R.; Aitken, J.; Tuell, G. Muufl Gulfport Hyperspectral and Lidar Airborne Data Set; Technical Report REP-2013-570; University of Florida: Gainesville, FL, USA, 2013. [Google Scholar]
  132. Hyperspectral Image Analysis Lab. Available online: https://hyperspectral.ee.uh.edu/?page_id=1075 (accessed on 27 August 2022).
  133. Pavia Centre Scene. Available online: https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes#Pavia_Centre_scene (accessed on 27 August 2022).
  134. Zhou, H.; Liu, Q.; Wang, Y. PanFormer: A Transformer Based Model for Pan-sharpening. arXiv 2022, arXiv:2203.02916. [Google Scholar]
  135. WorldView-2 Full Archive and Tasking. Available online: https://earth.esa.int/eogateway/catalog/worldview-2-full-archive-and-tasking (accessed on 27 August 2022).
  136. WorldView-3 Full Archive and Tasking. Available online: https://earth.esa.int/eogateway/catalog/worldview-3-full-archive-and-tasking (accessed on 27 August 2022).
  137. Botswana. Available online: https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes#Botswana (accessed on 27 August 2022).
  138. Yokoya, N.; Iwasaki, A. Airborne Hyperspectral Data over Chikusei; Technical Report; Space Application Laboratory, University of Tokyo: Tokyo, Japan, 2016; Volume 5. [Google Scholar]
  139. Pleiades. Available online: https://pleiades.stoa.org/downloads (accessed on 27 August 2022).
  140. QuickBird Full Archive. Available online: https://earth.esa.int/eogateway/catalog/quickbird-full-archive (accessed on 27 August 2022).
  141. Dong, H.; Zhang, L.; Zou, B. Exploring Vision Transformers for Polarimetric SAR Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5219715. [Google Scholar] [CrossRef]
  142. Liu, X.; Wu, Y.; Liang, W.; Cao, Y.; Li, M. High Resolution SAR Image Classification Using Global-Local Network Structure Based on Vision Transformer and CNN. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4505405. [Google Scholar] [CrossRef]
  143. Cai, J.; Zhang, Y.; Guo, J.; Zhao, X.; Lv, J.; Hu, Y. ST-PN: A Spatial Transformed Prototypical Network for Few-Shot SAR Image Classification. Remote Sens. 2022, 14, 2019. [Google Scholar] [CrossRef]
  144. Ke, X.; Zhang, X.; Zhang, T. GCBANet: A Global Context Boundary-Aware Network for SAR Ship Instance Segmentation. Remote Sens. 2022, 14, 2165. [Google Scholar] [CrossRef]
  145. Xia, R.; Chen, J.; Huang, Z.; Wan, H.; Wu, B.; Sun, L.; Yao, B.; Xiang, H.; Xing, M. CRTransSar: A Visual Transformer Based on Contextual Joint Representation Learning for SAR Ship Detection. Remote Sens. 2022, 14, 1488. [Google Scholar] [CrossRef]
  146. Chen, L.; Luo, R.; Xing, J.; Li, Z.; Yuan, Z.; Cai, X. Geospatial transformer is what you need for aircraft detection in SAR Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
  147. Zhang, P.; Xu, H.; Tian, T.; Gao, P.; Tian, J. SFRE-Net: Scattering Feature Relation Enhancement Network for Aircraft Detection in SAR Images. Remote Sens. 2022, 14, 2076. [Google Scholar] [CrossRef]
  148. Ma, C.; Zhang, Y.; Guo, J.; Hu, Y.; Geng, X.; Li, F.; Lei, B.; Ding, C. End-to-End Method with Transformer for 3D Detection of Oil Tank from Single SAR Image. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5217619. [Google Scholar]
  149. Perera, M.; Bandara, W.; Valanarasu, J.; Patel, V. Transformer-based SAR Image Despeckling. arXiv 2022, arXiv:2201.09355. [Google Scholar]
  150. Dong, H.; Ma, W.; Jiao, L.; Liu, F.; Shang, R.; Li, Y.; Bai, J. A Contrastive Learning Transformer for Change Detection in High-Resolution SAR Images; SSRN 4169439; SSRN: Rochester, NY, USA, 2022. [Google Scholar]
  151. Fan, Y.; Wang, F.; Wang, H. A Transformer-Based Coarse-to-Fine Wide-Swath SAR Image Registration Method under Weak Texture Conditions. Remote Sens. 2022, 14, 1175. [Google Scholar] [CrossRef]
  152. Norikane, L.; Broek, B.; Freeman, A. Application of modified VICAR/IBIS GIS to analysis of July 1991 Flevoland AIRSAR data. In Proceedings of the AIRSAR Workshop, Pasadena, CA, USA, 1–5 June 1992; Volume 3. [Google Scholar]
  153. E-SAR—The Airborne SAR System of DLR. Available online: https://www.dlr.de/hr/en/desktopdefault.aspx/tabid-2326/3776_read-5679/ (accessed on 27 August 2022).
  154. Available online: https://ietr-lab.univ-rennes1.fr/polsarpro-bio/san-francisco/dataset/SAN_FRANCISCO_AIRSAR.zip (accessed on 27 August 2022).
  155. Use Data. Available online: https://www.eorc.jaxa.jp/ALOS/en/alos-2/a2_data_e.htm (accessed on 27 August 2022).
  156. GF-3 (Gaofen-3). Available online: https://directory.eoportal.org/web/eoportal/satellite-missions/g/gaofen-3 (accessed on 27 August 2022).
  157. F-SAR—The New Airborne SAR System. Available online: https://www.dlr.de/hr/en/desktopdefault.aspx/tabid-2326/3776_read-5691/ (accessed on 27 August 2022).
  158. MSTAR Overview. Available online: https://www.sdms.afrl.af.mil/index.php?collection=mstar (accessed on 27 August 2022).
  159. Li, J.; Qu, C.; Shao, J. Ship detection in SAR images based on an improved faster R-CNN. In Proceedings of the BIGSARDATA, Beijing, China, 3–14 November 2017; pp. 1–6. [Google Scholar]
  160. Wei, S.; Zeng, X.; Qu, Q.; Wang, M.; Su, H.; Shi, J. HRSID: A High-Resolution SAR Images Dataset for Ship Detection and Instance Segmentation. IEEE Access 2020, 8, 120234–120254. [Google Scholar] [CrossRef]
  161. CryoSat Products. Available online: https://earth.esa.int/eogateway/catalog/cryosat-products (accessed on 27 August 2022).
  162. Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the ICCV, Vancouver, BC, Canada, 7–14 July 2001; Volume 2, pp. 416–423. [Google Scholar]
  163. TerraSAR-X ESA Archive. Available online: https://earth.esa.int/eogateway/catalog/terrasar-x-esa-archive (accessed on 27 August 2022).
  164. Li, Z.; Snavely, N. MegaDepth: Learning Single-View Depth Prediction from Internet Photos. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2041–2050. [Google Scholar]
  165. Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows. In Proceedings of the CVPR, New Orleans, LA, USA, 19–24 June 2022; pp. 12124–12134. [Google Scholar]
  166. Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. In Proceedings of the ICLR, Virtual-Only, 25 April 2022. [Google Scholar]
  167. Yanghao, L.; Wu, C.Y.; Fan, H.; Mangalam, K.; Xiong, B.; Malik, J.; Feichtenhofer, C. MViTv2: Improved Multiscale Vision Transformers for Classification and Detection. In Proceedings of the CVPR, New Orleans, LA, USA, 19–24 June 2022; pp. 4804–4814. [Google Scholar]
Figure 1. Recent transformer-based techniques in the remote sensing imaging. On the left and middle: pie-charts are representing statistics of the articles covered in this survey in terms of different remote sensing imaging problems and data type representations. On the right: we show a plot illustrating the consistent increase in the number of papers recently.
Figure 1. Recent transformer-based techniques in the remote sensing imaging. On the left and middle: pie-charts are representing statistics of the articles covered in this survey in terms of different remote sensing imaging problems and data type representations. On the right: we show a plot illustrating the consistent increase in the number of papers recently.
Remotesensing 15 01860 g001
Figure 2. Example hyperspectral images from Pavia, Indian Pines and Kennedy Space Center datasets (a), example VHR images (b) and SAR images of the L band E-SAR dataset (c).
Figure 2. Example hyperspectral images from Pavia, Indian Pines and Kennedy Space Center datasets (a), example VHR images (b) and SAR images of the L band E-SAR dataset (c).
Remotesensing 15 01860 g002
Figure 3. The vision transformer’s architecture is shown on the left and the encoder block’s specifications are shown on the right. The input image is first divided into patches. These are then projected (after flattening) into a feature space, where a transformer encoder analyzes them to create the classification output. * indicates to Extra learnable [class] embedding. Adapted with permission from [1,19].
Figure 3. The vision transformer’s architecture is shown on the left and the encoder block’s specifications are shown on the right. The input image is first divided into patches. These are then projected (after flattening) into a feature space, where a transformer encoder analyzes them to create the classification output. * indicates to Extra learnable [class] embedding. Adapted with permission from [1,19].
Remotesensing 15 01860 g003
Figure 4. The taxonomy of transformers in VHR, hyperspectral and SAR imagery with a variety of tasks, such as classification, detection, segmentation, pan sharpening and change detection.
Figure 4. The taxonomy of transformers in VHR, hyperspectral and SAR imagery with a variety of tasks, such as classification, detection, segmentation, pan sharpening and change detection.
Remotesensing 15 01860 g004
Figure 5. The CTNet architecture comprising two modules: the ViT stream (T-stream) and the CNNs stream (C-stream). The T-stream and C-stream are designed to capture semantic features and the local structural information. Figure is from [31]. Best viewed zoomed in.
Figure 5. The CTNet architecture comprising two modules: the ViT stream (T-stream) and the CNNs stream (C-stream). The T-stream and C-stream are designed to capture semantic features and the local structural information. Figure is from [31]. Best viewed zoomed in.
Remotesensing 15 01860 g005
Figure 6. Comparison in terms of response maps obtained using different models on example VHR images. The original images are shown in (a), whereas the evaluated models are: (b) IMP-ResNet-50, (c) SeCo-ResNet-50, (d) RSP-ResNet-50, (e) IMP-Swin-T, (f) RSP-Swin-T, (g) IMP-ViTAEv2-S and (h) RSP-ViTAEv2-S. Here, IMP denotes ImageNet pre-training and RSP refers to remote sensing pre-training. In the response map, the warmer color indicates a higher response. Figure is from [7].
Figure 6. Comparison in terms of response maps obtained using different models on example VHR images. The original images are shown in (a), whereas the evaluated models are: (b) IMP-ResNet-50, (c) SeCo-ResNet-50, (d) RSP-ResNet-50, (e) IMP-Swin-T, (f) RSP-Swin-T, (g) IMP-ViTAEv2-S and (h) RSP-ViTAEv2-S. Here, IMP denotes ImageNet pre-training and RSP refers to remote sensing pre-training. In the response map, the warmer color indicates a higher response. Figure is from [7].
Remotesensing 15 01860 g006
Figure 7. Overview of the anchor-free Oriented RepPoints detection architecture [46] that strives to learn selecting points samples for classification, regression and orientation. RepPoints utilizes the same structure of the shared head as in [46], except a quality assessment and sample assignment strategy (APAA) are employed for selecting high-quality sample points for training. Figure is adapted with permission from [46]. Best viewed zoomed in.
Figure 7. Overview of the anchor-free Oriented RepPoints detection architecture [46] that strives to learn selecting points samples for classification, regression and orientation. RepPoints utilizes the same structure of the shared head as in [46], except a quality assessment and sample assignment strategy (APAA) are employed for selecting high-quality sample points for training. Figure is adapted with permission from [46]. Best viewed zoomed in.
Remotesensing 15 01860 g007
Figure 8. Results of different CD methods visualized, such as FC-EF [62], FC-Siam-Conc [62], FC-Siam-Diff [62], CDNet [63], DASNet [64], STANet [61] and SwinSUNet [56], compared to (ad) sample imagery sets, such as the WHU-CD [60] test set. Various colors were utilised to convey different denotations; white represents true positive, black represents true negative, red represents false positive and green represents false negative. Figure is from [56].
Figure 8. Results of different CD methods visualized, such as FC-EF [62], FC-Siam-Conc [62], FC-Siam-Diff [62], CDNet [63], DASNet [64], STANet [61] and SwinSUNet [56], compared to (ad) sample imagery sets, such as the WHU-CD [60] test set. Various colors were utilised to convey different denotations; white represents true positive, black represents true negative, red represents false positive and green represents false negative. Figure is from [56].
Remotesensing 15 01860 g008
Figure 9. A qualitative comparison between the hybrid Trans-CNN with other existing segmentation approaches. The examples are from the Potsdam dataset. Every two rows present the results as a group. Here, from left to right and top to bottom are: (a) the corresponding ground-truth, (b) results obtained from AFNet + TTA, (c) results of ResUNet, (d) results of CASIA2, (e) results achieved using Trans-CNN and (f) the RGB image. The inccorect classification results from AFNet + TTA, ResUNet, CASIA2 and Trans-CNN are presented in (gj), respectively. Figure is from [68].
Figure 9. A qualitative comparison between the hybrid Trans-CNN with other existing segmentation approaches. The examples are from the Potsdam dataset. Every two rows present the results as a group. Here, from left to right and top to bottom are: (a) the corresponding ground-truth, (b) results obtained from AFNet + TTA, (c) results of ResUNet, (d) results of CASIA2, (e) results achieved using Trans-CNN and (f) the RGB image. The inccorect classification results from AFNet + TTA, ResUNet, CASIA2 and Trans-CNN are presented in (gj), respectively. Figure is from [68].
Remotesensing 15 01860 g009
Figure 10. Overview of the CTN framework [10] for hyperspectral image classification. Given the HSI data patches, CTN processes them to center position encoding (CPE), convolutional transformer and classification modules. Here, the output represents the category label. Figure is from [10]. Best viewed zoomed in.
Figure 10. Overview of the CTN framework [10] for hyperspectral image classification. Given the HSI data patches, CTN processes them to center position encoding (CPE), convolutional transformer and classification modules. Here, the output represents the category label. Figure is from [10]. Best viewed zoomed in.
Remotesensing 15 01860 g010
Figure 11. A qualitative comparison, in terms of visualization of classification maps between HSI-BERT and several CNN-based methods on the Pavia dataset. Here, (a) CNN, (b) CNN-PPF, (c) CDCNN, (d) DRCNN and (e) HSI-BERT. Figure is from [114].
Figure 11. A qualitative comparison, in terms of visualization of classification maps between HSI-BERT and several CNN-based methods on the Pavia dataset. Here, (a) CNN, (b) CNN-PPF, (c) CDCNN, (d) DRCNN and (e) HSI-BERT. Figure is from [114].
Remotesensing 15 01860 g011
Figure 12. Overview of the ViT-PolSAR framework [141] for supervised polarimetric SAR image classification. Here, the pixel values of the SAR image patches are considered as tokens and then the self-attention mechanism is utilized to encode longe-range dependencies followed by MLP. Figure is from [141]. Best viewed zoomed in.
Figure 12. Overview of the ViT-PolSAR framework [141] for supervised polarimetric SAR image classification. Here, the pixel values of the SAR image patches are considered as tokens and then the self-attention mechanism is utilized to encode longe-range dependencies followed by MLP. Figure is from [141]. Best viewed zoomed in.
Remotesensing 15 01860 g012
Figure 13. A visual comparison in terms of supervised classification of the entire map on the ALOS2 San Francisco dataset. Here, (ah) shows the results obtained from Wishart, RBF-SVM, CV-CNN, 3D-CNN, PSENet, SF-CNN and ViT-PolSAR, respectively. Figure is from [141].
Figure 13. A visual comparison in terms of supervised classification of the entire map on the ALOS2 San Francisco dataset. Here, (ah) shows the results obtained from Wishart, RBF-SVM, CV-CNN, 3D-CNN, PSENet, SF-CNN and ViT-PolSAR, respectively. Figure is from [141].
Remotesensing 15 01860 g013
Table 1. Performance, in terms of classification accuracy, of different transformer-based methods on the popular AID dataset with 20:80 train-test ratio.
Table 1. Performance, in terms of classification accuracy, of different transformer-based methods on the popular AID dataset with 20:80 train-test ratio.
MethodVenueBackboneAID (20%)
V16-21K [4]Remote SensingViT94.97
CTNet [31]GRSLResNet34 + ViT96.35
TRS [32]Remote SensingTRS95.54
TSTNet [5]Remote SensingSwin-T97.20
RSP [7]TGRSRSP-Swin-T-E30096.83
Table 2. Comparison in terms of detection accuracy (mAP) of different detectors utilizing a hybrid CNN-transformer design, transformers pre-trained backbone or a DETR-based transformer architecture on DOTA benchmark. The results are presented on the orientated bounding-boxes task of the DOTA benchmark.
Table 2. Comparison in terms of detection accuracy (mAP) of different detectors utilizing a hybrid CNN-transformer design, transformers pre-trained backbone or a DETR-based transformer architecture on DOTA benchmark. The results are presented on the orientated bounding-boxes task of the DOTA benchmark.
MethodVenueBackboneDOTA
ADT-Det [41]Remote SensingResNet5076.89
RBox [42]CVPRResNet5079.59
Rodformer [43]SensorsResNet5063.89
Rodformer [43]SensorsViT-B475.60
PointRCNN [44]Remote SensingSwin-T80.14
Hybrid Network [45]Remote SensingTransC-T78.41
Oriented RepPoints [46]ArxivResNet5075.97
Oriented RepPoints [46]ArxivSwin-T77.63
O2DETR [47]ArxivResNet5079.66
AO2-DETR [48]ArxivResNet5079.22
Table 3. Comparison, in terms of F1 score, of different transformer-based change detection methods on the two popular benchmarks: WHU and LEVIR.
Table 3. Comparison, in terms of F1 score, of different transformer-based change detection methods on the two popular benchmarks: WHU and LEVIR.
MethodVenueWHULEVIR
CD-Trans [54]TGRS83.9889.31
MSPSNet [55]TGRS-89.18
UVACD [57]Remote Sensing92.8491.30
SwinSUNet [56]TGRS93.8-
TransUNetCD [58]TGRS93.5991.1
HybridTransCD [59]IJGI-90.06
Table 4. Performance comparison, in terms of overall accuracy (OA), of different transformer-based semantic segmentation methods on two popular benchmarks: Potsdam and Vaihingen.
Table 4. Performance comparison, in terms of overall accuracy (OA), of different transformer-based semantic segmentation methods on two popular benchmarks: Potsdam and Vaihingen.
MethodVenuePotsdamVaihingen
Efficient-T [65]Remote Sensing90.0888.41
STransFuse [67]JSTAR86.7186.07
Trans-CNN [68]TGRS91.090.40
SwinTF [69]Remote Sensing-90.97
Table 5. Overview of transformer-based approaches in VHR remote sensing imaging. Here, we highlight transformer-based methods for different VHR remote sensing tasks.
Table 5. Overview of transformer-based approaches in VHR remote sensing imaging. Here, we highlight transformer-based methods for different VHR remote sensing tasks.
Transformers in Very-High Resolution (VHR) Satellite Imagery
MethodTaskDatasetsMetricsHighlights
V16-21K [4]ClassificationMerced [76],
AID [35],
Optimal31 [77],
NWPU [78]
Overall classification accuracyExplores vision transformers along with combination of data augmentation techniques for boosting accuracy.
TRS [32]ClassificationMerced [76],
AID [35],
Optimal31 [77],
NWPU [78]
Overall classification accuracyIntegrates transformers into CNNs by replacing the last three ResNet bottlenecks with encoders having multi-head self-attention bottleneck.
TSTNet [5]ClassificationMerced [76],
AID [35],
NWPU [78]
Overall classification accuracyA Swin transformer-based two-stream architecture that uses both deep features from the image and edge features from edge stream.
CTNet [31]ClassificationAID [35],
NWPU [78]
Overall classification accuracyComprises a ViT stream that mines semantic features and the CNN stream, which captures local structural features.
HHTL [6]ClassificationMerced [76],
AID [35],
RSSDIVCS [79],
NWPU [78]
Overall classification accuracyExplores integrating heterogenous non-overlapping patches and homogenous patches obtained using superpixel segmentation.
RSP [7]Classification, Segmentation, DetectionMillionAID [33],
Potsdam [70],
iSAID [80],
HRSC2016 [81],
DOTA [49],
CCD [82],
LEVIR [61]
Overall classification accuracy,
mAP,
F1 score
Investigates pre-training transformers on a large-scale remote sensing dataset.
SAIEC [37]Detection, SegmentationDIOR [83],
HRRSD [84],
NWPU VHR-10 [85]
mAPIntroduces a local perception Swin transformer backbone that aims to combine the merits of transformers and CNNs for improving the local perception capabilities.
T-TRD-DA [39]DetectionDIOR [83],
NWPU VHR-10 [85]
mAPProposes a transformer-based detector utilizing a pre-trained CNN for feature extraction and multiple-layer transformers for multi-scale feature aggregation at global spatial positions.
GANsformer [40]DetectionDIOR [83],
NWPU VHR-10 [85]
mAPIntroduces an efficient transformer, with reduced parameters, as a branch network to capture global features along with a generative model to expand the input image ahead of backbone.
ADT-Det [41]DetectionDIOR [83],
HRSC2016 [81]
mAPIntroduces a RetineNet-based framework with a feature pyramid transformer integrated between the backbone and post-processing network for generating multi-scale semantic features.
PointRCNN [44]DetectionDOTA [49],
HRSC2016 [81]
mAPIntroduces a two-stage angle-free dectection framework, which is also evaluated using the transformer-based Swin backbone.
HybridNetwork22 [45]DetectionDOTA [49],
UCAS-AOD [86],
VEDAI [87]
mAPIntegrates multi-scale global and local information from transformers and CNNs through an adaptive feature fusion network.
Oriented RepPoints [46]DetectionDOTA [49],
UCAS-AOD [86],
HRSC2016 [81]
mAPProposes an anchor-free detector and learns flexible adaptive points as representations through a quality assessment and sample assignment scheme.
O2DETR [47]DetectionDOTA [49],
SKU110K-R [88],
HRSC2016 [81]
mAPExtends the standard DETR for oriented detection by introducing an encoder employing depthwise separable convolution.
AO2DETR [48]DetectionDOTA [49]mAPIntroduces a DETR-based detector with oriented proposal generation scheme, a refine module to compute rotation-invariant features and a rotation-aware matching loss for performing the matching process for direct set predictions.
RBox [42]DetectionSynthText [89],
ICDAR 2015 (IC15) [90],
MLT-2017 (MLT17) [91],
MSRA-TD500 [92],
MTWI [93],
Total-Text [94],
CTW1500 [95]
mAPProposes a framework employing transformers to model the relationship of sampled features for better grouping and box prediction without requiring post-processing operation.
Rodformer [43]DetectionDOTA [49]mAPA hybrid detection architecture integrating the local characteristics of depth-separable convolutions with the global characteristics of MLP.
CD-Trans [54]Change DetectionWHU [60],
LEVIR [61],
DSIFN [96]
F1 scoreIntroduces a bi-temporal image transformer designed to model the spatio-temporal contextual information. The encoder captures context in token-based space-time, which is then fed to a decoder where feature refinement is performed in the pixel-space.
Table 6. Overview of transformer-based approaches in VHR remote sensing imaging. Here, we highlight transformer-based methods for different VHR remote sensing tasks.
Table 6. Overview of transformer-based approaches in VHR remote sensing imaging. Here, we highlight transformer-based methods for different VHR remote sensing tasks.
Transformers in Very-High Resolution (VHR) Satellite Imagery
MethodTaskDatasetsMetricsHighlights
MSPSNet [55]Change DetectionSYSU-CD [103],
LEVIR [61]
F1 scoreIntroduces a multi-scale Siamese framework employing a parallel convolutional structure for feature integration of different temporal images and self-attention for feature refinement.
SwinSUNet [56]Change DetectionCCD [82],
WHU [60],
OSCD [104],
HRSCD [105]
F1 scoreIntroduces a Swin transformer-based network with a Siamese U-shaped structure having encoder, fusion and decoder modules.
TransUNetCD [58]Change DetectionWHU [60],
LEVIR [61],
CCD [82],
DSIFN [96],
OSCD [104],
S2Looking [106]
F1 scoreIntroduces a framework integrating merits of transformers and UNet through capturing enriched contextualized features which are upsampled and fused with multi-scale features to generate global-local features.
Hybrid-TransCD [59]Change DetectionLEVIR [61],
SYSU-CD [103]
F1 scoreIntroduces a multi-scale transformer that encodes both fine-grained and large object features through heterogeneous tokens via multiple receptive fields.
CCTNet [66]SegmentationBarley Remote Sensing Dataset [107]F1 score,
overall accuracy
Proposes a hybrid CNN-transformer framework to combine local details and global conextual information for crop segmentation.
STransFuse [67]SegmentationPotsdam [70],
Vaihingen [71]
F1 score,
overall accuracy
Introduces a framework that encodes both coarse-grained as well as fine-grained features at multiple scales which are fused using self-attentive mechanism.
Trans-CNN [68]SegmentationPotsdam [70],
Vaihingen [71]
F1 score,
overall accuracy
Introduces a framework with a Swin transformer backbone to capture long-range dependencies and a U-shaped decoder with depth-wise separable convolution to encode local details.
SwinTF [69]SegmentationVaihingen [71],
Thailand North Landsat-8 corpus (private),
Thailand Isan Landsat-8 corpus (private)
F1 score,
overall accuracy
Introduces a framework with pre-trained Swin backbone along with a U-Net, feature pyramid network and a pyramid scene parsing network for segmentation.
Efficient-T [65]SegmentationPotsdam [70],
Vaihingen [71]
F1 score,
overall accuracy
Proposes a light-weight framework consisting of an implicit edge enhancement scheme along with a Swin transformers.
STT [72]Building ExtractionWHU [60],
INRIA [108]
IoU,
overall accuracy,
F1 score
Introduces a transformers framework to learn long-range dependencies both in the spatial and channel direction.
STEB-UNet [73]Building ExtractionWHU [60],
Massachusetts [108]
IoU,
F1 score
Introduces a transformer framework capturing semantic information from multi-scale features which are further fused to local features.
BuildFormer [74]Building ExtractionWHU [60],
Massachusetts [108],
INRIA [108]
IoU,
F1 score
Introduces an architecture consisitng of a window-based linear attention and a convolutional MLP.
T-Trans [75]Building ExtractionMassachusetts [108]
,INRIA [108]
IoU,
F1 score
Explores the task of generalizability of building extraction models to different areas and introduces a transfer learning method to fine-tune models from one area to a subset of another unseen area.
TRL [97]Image CaptioningRSICD [109],
UCM-captions [110],
Sydney-Caption [111]
BLEU,
ROUGE,
METEOR
and CIDEr
Proposes an approach adapting transformers by integrating residual connections, dropout and adatpive feature fusion for remote sensing image caption generation.
MLAT [98]Image CaptioningRSICD [109],
UCM-captions [110],
Sydney-Caption [111]
BLEU,
ROUGE,
METEOR
and CIDEr
Introduces an architecture where multi-scale features from CNN layers are extracted in encoder and a multi-layer aggregated transformer in the decoder uses those features for sentence generation.
Ren et al. [99]Image CaptioningRSICD [109],
UCM-captions [110],
Sydney-Caption [111]
BLEU,
ROUGE,
METEOR
and CIDEr
Proposes a topic token-based mask transformers with the topic token being integrated into encoder while serving as prior in decoder for capturing global semantic relationships.
TR-MISR [102]Image Super ResolutionRSICD [109],
UCM-captions [110],
PROBA-V [112]
cPSNR,
cSSIM
Introduces a transformer-based architecture with an encoder having residual blocks, a fusion module along with a super-pixel convolution-based decoder for multi-image super-resolution.
MSE-Net [100]Image Super ResolutionUCMerced [113],
AID [35]
cPSNR,
cSSIM
Proposes a multi-stage enchancement framework to utilize features from different stages and further integrating them with standard super-resolution technique for combining multi-resolution low as well as high-dimension feature representations.
SRT [101]Image Super ResolutionUCMerced [113]cPSNR,
cSSIM
Introduces a hybrid framework that integrates local features from CNNs and global features from transformers.
Table 7. Comparison in terms of overall accuracy (OA) of some representative CNN-based methods with pure transformers and hybrid CNN-transformer-based hyperspectral image classification methods on two popular benchmarks: Indian Pines and Pavia. Here, the results are reported using 200 samples for training for each category.
Table 7. Comparison in terms of overall accuracy (OA) of some representative CNN-based methods with pure transformers and hybrid CNN-transformer-based hyperspectral image classification methods on two popular benchmarks: Indian Pines and Pavia. Here, the results are reported using 200 samples for training for each category.
MethodVenueTypeIndian PinesPavia
CNN [119]SensorsCNNs87.0192.27
CNN-PPF [120]TGRSCNNs93.9096.48
HSI-BERT [114]TGRSPure99.5699.75
DSS-TRM [9]EJRSPure99.4398.50
CTN [10]GRSLHybrid99.1197.48
Table 8. Overview of transformer-based approaches in hyperspectral and multispectral imaging. Here, we highlight methods for different hyperspectral remote sensing tasks.
Table 8. Overview of transformer-based approaches in hyperspectral and multispectral imaging. Here, we highlight methods for different hyperspectral remote sensing tasks.
Transformers in Hyperspectral Imagery
MethodTaskDatasetsMetricsHighlights
SpectralFormer [8]ClassificationIndian Pines [127],
Pavia University [128],
Houston 2013 [129]
Overall classification accuracy,
kappa
Introduces a transformer-based backbone to capture spectrally local information from nearby hyperspectral bands by generating group-wise spectral embeddings.
MCT [12]ClassificationSalinas [130],
Yellow River Estuary
Overall classification accuracy,
kappa
Proposes a multi-scale convolutional transformer to encode spatial-spectral information that is integrated with transformers network.
MFT [117]ClassificationUniversity of Houston [129],
Trento,
MUUFL Gulfport [131],
Augsburg scenes
Overall classification accuracy,
kappa
Proposes a multi-modal transfomers that derives class tokens from multi-modal data along with the standard hyperspectral patch tokens.
CTN [10]ClassificationIndian Pines [127],
Pavia University [128]
Overall classification accuracy,
kappa
Introduces a convolutional transformer network with dedicated blocks that integrates local and global features from hyspectral image patches.
DHViT [118]ClassificationTrento,
Houston 2013 [129],
Houston 2018 [132]
Overall classification accuracy,
kappa
Introduces an approach comprising a spectral sequence transformer to encode features along the spectral dimension and a spatial hierarchical transformer to produce hierarchical spatial features for hyperspectral and LiDAR data.
DSS-TRM [9]ClassificationPavia University [128],
Salinas [130],
Indian Pines [127]
Overall classification accuracy,
kappa
Introduces a transformer-based approach consisting of spectral self-attention and spatial self-attention to capture interactions along spectral and spatial dimension, respectively.
HiT [11]ClassificationIndian Pines [127],
Pavia University [128],
Houston2013 [129],
Xiongan
Overall classification accuracy,
kappa
Proposes a hyperspectral image transformer consisting of a 3D convolution projection module to encode local spatial-spectral details and a conv-permutator modue to capture the information along height, width and spectral dimensions.
HSI-BERT [114]ClassificationIndian Pines [127],
Pavia University [128],
Salinas [130]
Overall classification accuracyProposes a transformer-based method that captures capture global dependencies using a bi-direction encoder representation.
SSFTT [116]ClassificationIndian Pines [127],
Pavia University [128],
Houston 2013 [129]
Overall classification accuracy,
kappa
Proposes a spectral–spatial feature tokenization transformer that utilizes both spectral-spatial shallow and semantic features for representation and learning.
SSTN [115]ClassificationPavia University [128],
Kennedy Space Center,
Indian Pines [127],
University of Houston [129],
Pavia Center [133]
Overall classification accuracy,
kappa
Introduces a spectral–spatial transformer with a spatial attention and a spectral association module. The two modules perform spectral and spatial association through the integration of spectral and spatial locations, respectively.
CTIN [134]Pan-Sharpeningworldview II [135],
worldview III [136],
GaoFen-2
IQA,
ERGAS,
PSNR,
SAM
A transformer-based approach is introduced, where multi-spectral and panchromatic features are captured for joint feature learning across modalities. Further, an invertible neural module performs feature fusion to generate pansharpened images.
HyperTransformer [126]Pan-SharpeningPavia Center [133],
Botswana [137],
Chikusei [138]
Cross-correlation(CC),
spectral Angle Mapping (SAM),
RSNR,
ERGAS,
PSNR
Introduces a transformer-based framework with separate feature extractors for panchromatic and hyperspectral images and a spectral-spatial fusion module to learn cross-feature space dependencies of features.
PMACNet [123]Pan-Sharpeningworldview II [135],
worldview III [136]
Spatial correlation coefficient(SCC),
spectral angle mapper (SAM)
Introduces a framework with a parallel CNN structure to learn ROIs from low-resolution image and residuals from high-resolution image. It also contains a a pixel-wise attention module to adapt residuals on the learned ROIs.
CPT-noRef [122]Pan-SharpeningGaofen-1,
worldview II [135],
Pleiades [139]
IQA,
ERGAS,
SAM,
correlation coefficient(CC)
A CNN-transformers framework where global features are generated using transformers and local features are constructed using a shallow CNNs. The features are combined and a loss formulation having spatial and spectral losses are utilized for training.
MSIT [121]Pan-SharpeningGeoEye-1,
QuickBird [140]
ERGAS,
SAM,
Q4
Introduces a multi-scale spatial–spectral interaction transformer with a convolution-transformer encoder for generating multi-scale global and local features from both low-resolution and panchromatic images.
Su et al. [124]Pan-Sharpeningworldview II [135], QuickBird [140], GaoFen-2spatial correlation coefficient(SCC),
ESGAS,
RMSE,
SAM,
Q4
A transformer-based approach with spatial and spectral feature extraction performed using a Swin model.
Table 9. Overview of transformer-based approaches in SAR imaging. Here, we highlight methods for different SAR remote sensing tasks.
Table 9. Overview of transformer-based approaches in SAR imaging. Here, we highlight methods for different SAR remote sensing tasks.
Transformers in Hyperspectral Imagery
MethodTaskDatasetsMetricsHighlights
ViT-PolSAR [141]ClassificationAIRSAR Flevoland [152],
ESAR Oberpfaffenhofen [153],
AIRSAR San Francisco [154],
ALOS2 San Francisco [155]
AA,
OA,
kappa
Explores transformers, where self-attention is used to capture long-range dependencies followed by MLP for polarimetric SAR image classification.
GLNS [142]ClassificationGaofen-3 SAR [156],
F-SAR [157]
AA,
OA,
kappa
Introduces a global–local network structure to exploit the merits of CNNs and transformers with local and global features that are fused to perform classification.
ST-PN [143]ClassificationMSTAR [158]AccuracyProposes a spatial transformer network for spatial alignment of features extracted from CNNs for few-shot SAR classification.
GCBANet [144]SegmentationSSDD [159],
HRSID [160]
APIntroduces a transformer-based approach with a global contextual block for capturing spatial holistic long-range dependencies and a boundary-aware prediction scheme for estimating the boundaries of ship.
CRTransSar [145]DetectionSMCDD [145],
SSDD [159]
Accuracy,
recall,
mAP,
F1
Proposes a backbone based on convolutional and attention blocks for capturing both local and global features.
Geospatial Transformers [146]DetectionGaofen-3 [156]DR,
FAR
Introduces a framework with multi-scale geo-spatial attention for aircraft detection in SAR imaging.
SFRE-Net [147]DetectionGaofen-3 [156]Precision,
recall,
F1
Introduces a feature relation enhancement architecture consisting of a fusion pyramid structure and a context attention enhancement technique.
3DET-ViT [148]DetectionL1B SAR [161]AP,
AR,
mean Offset
Proposes a transformer-based framework that takes incidence angle as a prior token with a feature description operator employing scattering centers for prediction refinement.
ID-ViT [149]DespecklingBerkeley Segmentation Dataset [162]PSNR,
SSIM
Proposes a framework comprising an encoder to learn global dependencies among SAR image regions, where the network is trained using synthetic speckled data.
CLT [150]Change DetectionBrazil and Namibia datasets [163],
simulation data [150]
KCIntroduces a self-supervised contrastive representation learning method with a convolution-enhanced transformer to generate hierarchical representations for distinguishing changes from HR SAR images.
CF-ViT [151]Image RegistrationMegaDepth [164]KCA CNN-transformers framework that first performs coarse registration on the down-sampled image, followed by registration of image pairs via a CNN-transformer module with the resulting point pair subsets integrated to obtain final global registration.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Aleissaee, A.A.; Kumar, A.; Anwer, R.M.; Khan, S.; Cholakkal, H.; Xia, G.-S.; Khan, F.S. Transformers in Remote Sensing: A Survey. Remote Sens. 2023, 15, 1860. https://doi.org/10.3390/rs15071860

AMA Style

Aleissaee AA, Kumar A, Anwer RM, Khan S, Cholakkal H, Xia G-S, Khan FS. Transformers in Remote Sensing: A Survey. Remote Sensing. 2023; 15(7):1860. https://doi.org/10.3390/rs15071860

Chicago/Turabian Style

Aleissaee, Abdulaziz Amer, Amandeep Kumar, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal, Gui-Song Xia, and Fahad Shahbaz Khan. 2023. "Transformers in Remote Sensing: A Survey" Remote Sensing 15, no. 7: 1860. https://doi.org/10.3390/rs15071860

APA Style

Aleissaee, A. A., Kumar, A., Anwer, R. M., Khan, S., Cholakkal, H., Xia, G. -S., & Khan, F. S. (2023). Transformers in Remote Sensing: A Survey. Remote Sensing, 15(7), 1860. https://doi.org/10.3390/rs15071860

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop