Effect of Attention Mechanism in Deep Learning-Based Remote Sensing Image Processing: A Systematic Literature Review

: Machine learning, particularly deep learning (DL), has become a central and state-of-the-art method for several computer vision applications and remote sensing (RS) image processing. Researchers are continually trying to improve the performance of the DL methods by developing new architectural designs of the networks and/or developing new techniques, such as attention mechanisms. Since the attention mechanism has been proposed, regardless of its type, it has been increasingly used for diverse RS applications to improve the performances of the existing DL methods. However, these methods are scattered over different studies impeding the selection and application of the feasible approaches. This study provides an overview of the developed attention mechanisms and how to integrate them with different deep learning neural network architectures. In addition, it aims to investigate the effect of the attention mechanism on deep learning-based RS image processing. We identiﬁed and analyzed the advances in the corresponding attention mechanism-based deep learning (At-DL) methods. A systematic literature review was performed to identify the trends in publications, publishers, improved DL methods, data types used, attention types used, overall accuracies achieved using At-DL methods, and extracted the current research directions, weaknesses, and open problems to provide insights and recommendations for future studies. For this, ﬁve main research questions were formulated to extract the required data and information from the literature. Furthermore, we categorized the papers regarding the addressed RS image processing tasks (e.g., image classiﬁcation, object detection, and change detection) and discussed the results within each group. In total, 270 papers were retrieved, of which 176 papers were selected according to the deﬁned exclusion criteria for further analysis and detailed review. The results reveal that most of the papers reported an increase in overall accuracy when using the attention mechanism within the DL methods for image classiﬁcation, image segmentation, change detection, and object detection using remote sensing images.


Introduction
Remotely sensed images have been employed as the main data sources in many fields such as agriculture [1][2][3][4], urban planning [5][6][7] and disaster risk management [8][9][10], and have been shown as an effective and critical tool to provide information. Accordingly, processing remote sensing (RS) images is crucial to extract the useful information from them for such applications. RS image processing tasks include image classification, object detection, change detection, and image fusion [11]. Different processing methods were developed to address them, and they aimed to improve the performance and accuracy of the methods to address RS image processing. Machine learning methods such as support vector machines and ensemble classifiers (e.g., random forest and gradient boosting) the irrelevant ones, giving them lower weights. This allows the brain to process and focus on the most important parts precisely and efficiently, rather than processing the entire view space. This characteristic of human vision inspired researchers to develop the attention mechanism. It was initially developed in 2014 for natural language processing applications [20], since then it has been widely used for different applications [30], in particular, computer vision tasks [21,31]. Its potential to enhance mostly CNN-based methods has been reported [32]. In addition, it has been used in conjunction with recurrent neural network models [33][34][35][36], and graph neural networks [37,38]. The main idea behind the attention mechanism is to give different weights to different information. Thus, giving higher weights to relevant information attracts the attention of the DL model to them [39]. Attention mechanism approaches can be grouped based on four criteria (Figure 1) [21]: (i) The softness of attention: the initial attention mechanism proposed by [20] is a soft version, which is also known as deterministic attention. This network considers all input elements (computes the average for each weight) to compute the final context vector. The context vector is the high-dimensional vector representation of the input elements or sequences of the input elements and in general the attention mechanism aims to add more contextual information to compute the final context vector. However, hard attention, which is also known as stochastic attention, randomly selects from the sample elements to compute the final context vector [40]. This, therefore, reduces the computational time. Furthermore, there is another categorization that is frequently used in computer vision tasks and RS image processing, i.e., global and local attentions [41,42]. Global attention is similar to soft attention since it also considers all input elements. However, global attention simplifies soft attention by using the output of the current time step rather than the prior one, while local attention is a combination of soft and hard attentions. This approach considers a subset of input elements at a time, and thus, overcomes the limitation of hard attention, i.e., being nondifferentiable, and in the meantime is less computationally expensive. (ii) Forms of input features: attention mechanisms can be grouped based on their input requirements: item-wise and location-wise. Item-wise attention requires inputs that are known to the model explicitly or produced with a preprocess [43][44][45]. However, location-wise attention does not necessarily require known inputs, in this case, the model needs to deal with input items that are difficult to distinguish. Due to the characteristics and features of the RS images and targeted tasks, location-wise attention is commonly used for RS image processing [42,[46][47][48]. (iii) Input representations: there are single-input and multi-input attention models [49,50]. In addition, the general processing procedure of the inputs also varies between the developed models. Most of the current attention networks work with single-input, and the model processes them in two independent sequences (i.e., distinctive model). The co-attention model is a multi-input attention network that parallelly implements the attention mechanism on two different sources but finally merges them [50]. This makes it suitable for change detection from RS images [51]. (i) The softness of attention: the initial attention mechanism proposed by [20] is a soft version, which is also known as deterministic attention. This network considers all input elements (computes the average for each weight) to compute the final context vector. The context vector is the high-dimensional vector representation of the input elements or sequences of the input elements and in general the attention mechanism aims to add more contextual information to compute the final context vector. However, hard attention, which is also known as stochastic attention, randomly selects from the sample elements to compute the final context vector [40]. This, therefore, reduces the computational time. Furthermore, there is another categorization that is frequently used in computer vision tasks and RS image processing, i.e., global and local attentions [41,42]. Global attention is similar to soft attention since it also considers all input elements. However, global attention simplifies soft attention by using the output of the current time step rather than the prior one, while local attention is a combination of soft and hard attentions. This approach considers a subset of input elements at a time, and thus, overcomes the limitation of hard attention, i.e., being nondifferentiable, and in the meantime is less computationally expensive. (ii) Forms of input features: attention mechanisms can be grouped based on their input requirements: item-wise and location-wise. Item-wise attention requires inputs that are known to the model explicitly or produced with a preprocess [43][44][45]. However, location-wise attention does not necessarily require known inputs, in this case, the model needs to deal with input items that are difficult to distinguish. Due to the characteristics and features of the RS images and targeted tasks, location-wise attention is commonly used for RS image processing [42,[46][47][48]. (iii) Input representations: there are single-input and multi-input attention models [49,50].
In addition, the general processing procedure of the inputs also varies between the developed models. Most of the current attention networks work with single-input, and the model processes them in two independent sequences (i.e., distinctive model).
The co-attention model is a multi-input attention network that parallelly implements the attention mechanism on two different sources but finally merges them [50]. This makes it suitable for change detection from RS images [51]. A self-attention network computes attentions only based on the model inputs, and thus, it decreases the dependence on external information [52][53][54]. This allows the model to perform better in images with complex background by focusing more on targeted areas [55]. Hierarchical attention mechanism computes weights from the original input and different levels/scales of the inputs [56]. This attention mechanism is also known as fine-grained attention for image classification [57]. (iv) Output representations: single-output is the commonly used output representation in attention mechanisms. It processes a single feature at a time and computes weight scores. There are also two other multidimensional and multi-head attention mechanisms [21]. Multi-head attention processes the inputs linearly in multiple subsets, and finally merges them to compute the final attention weights [58], and is especially useful when employing the attention mechanism in conjunction with CNN methods [59][60][61]. Multidimensional attention, which is mostly employed for natural language processing, computes weights based on matrix representation of the features instead of vectors [62,63].
The above-explained attention mechanisms are the same in principle and are developed by researchers to adopt or improve the basic attention mechanism for their tasks. In addition, not all of them have been used for computer vision, and thus, RS image processing. In DL-based image processing, this mechanism is usually used to focus on specific features (feature layers) or a certain location or aspect of an image [64][65][66][67]. Accordingly, it can be classified into two major types: channel and spatial attentions. Figure 2 illustrates simple channel and spatial attention types: (a) The channel attention network aims to boost the feature layers (channel) in the feature map that convey more important information and silence the other feature layers (channels); (b) the spatial attention network highlights regions of interest in the feature space and covers up the background regions. These two attention mechanisms can be used solely or combined within DL methods to provide attention to both important feature layers and the location of the region of interest. Papers in this review were classified according to these two types. decreases the dependence on external information [52][53][54]. This allows the model to perform better in images with complex background by focusing more on targeted areas [55]. Hierarchical attention mechanism computes weights from the original input and different levels/scales of the inputs [56]. This attention mechanism is also known as fine-grained attention for image classification [57]. (iv) Output representations: single-output is the commonly used output representation in attention mechanisms. It processes a single feature at a time and computes weight scores. There are also two other multidimensional and multi-head attention mechanisms [21]. Multi-head attention processes the inputs linearly in multiple subsets, and finally merges them to compute the final attention weights [58], and is especially useful when employing the attention mechanism in conjunction with CNN methods [59][60][61]. Multidimensional attention, which is mostly employed for natural language processing, computes weights based on matrix representation of the features instead of vectors [62,63].
The above-explained attention mechanisms are the same in principle and are developed by researchers to adopt or improve the basic attention mechanism for their tasks. In addition, not all of them have been used for computer vision, and thus, RS image processing. In DL-based image processing, this mechanism is usually used to focus on specific features (feature layers) or a certain location or aspect of an image [64][65][66][67]. Accordingly, it can be classified into two major types: channel and spatial attentions. Figure 2 illustrates simple channel and spatial attention types: (a) The channel attention network aims to boost the feature layers (channel) in the feature map that convey more important information and silence the other feature layers (channels); (b) the spatial attention network highlights regions of interest in the feature space and covers up the background regions. These two attention mechanisms can be used solely or combined within DL methods to provide attention to both important feature layers and the location of the region of interest. Papers in this review were classified according to these two types.

Deep Neural Network Architectures with Attention for RS Image Processing
In this section, we describe and provide examples of the four different deep neural network architectures (i.e., CNN, GAN, RNN, and GNN) that are improved using the attention mechanism to address RS image processing. CNN is the main method that has

Deep Neural Network Architectures with Attention for RS Image Processing
In this section, we describe and provide examples of the four different deep neural network architectures (i.e., CNN, GAN, RNN, and GNN) that are improved using the attention mechanism to address RS image processing. CNN is the main method that has been used for image processing in general, as well as RS applications. Both spatial and channel attentions are embedded in CNN with different attention network designs. For CNNs the channel attention is typically implemented after each convolution but the spatial attention is mostly added to the end of the network [68][69][70][71]. However, in UNet-based networks, spatial attention is usually added to each layer of a decoding/upsampling section [72][73][74]. Figure 3 shows an example of using spatial and channel attentions, in particular co-attention network, in a Siamese model for building-based change detection [51]. The proposed co-attention network is based on an initial correlation process with a final attention module. For GAN networks which are based on encoding and decoding modules, the process of adding attention networks is the same as of CNNs that can be used in both adversarial and/or discrimination networks depending on the targeted tasks [75] ( Figure 4).
Remote Sens. 2021, 13, x FOR PEER REVIEW 5 of 23 been used for image processing in general, as well as RS applications. Both spatial and channel attentions are embedded in CNN with different attention network designs. For CNNs the channel attention is typically implemented after each convolution but the spatial attention is mostly added to the end of the network [68][69][70][71]. However, in UNet-based networks, spatial attention is usually added to each layer of a decoding/upsampling section [72][73][74]. Figure 3 shows an example of using spatial and channel attentions, in particular co-attention network, in a Siamese model for building-based change detection [51]. The proposed co-attention network is based on an initial correlation process with a final attention module. For GAN networks which are based on encoding and decoding modules, the process of adding attention networks is the same as of CNNs that can be used in both adversarial and/or discrimination networks depending on the targeted tasks [75] ( Figure 4).

Figure 3.
An example of adding attention network (i.e., co-attention) to a CNN module (i.e., Siamese network) for buildingbased change detection [51]. CoA-co-attention module, At-attention network, CR-change residual module. Figure 3. An example of adding attention network (i.e., co-attention) to a CNN module (i.e., Siamese network) for building-based change detection [51]. CoA-co-attention module, At-attention network, CR-change residual module.
RNN is the first deep learning network that is improved by attention mechanism [20] for natural language processing tasks. RNNs are not as popular as CNNs for image processing due to the inherent characteristics of the images. However, RNN has been frequently used in conjunction with CNN for RS image processing [34,[76][77][78]. This also allows the integration of the attention mechanism with RNN for RS applications. For example, Ref. [79] developed a bidirectional RNN module to provide channel attention and add the outcome weights to the CNN-based module which is supported with a spatial attention network for hyperspectral image classification ( Figure 5). . An example of adding spatial and channel attentions to a GAN module for building detection from aerial images [75]. A-max pooling layer; B-convolutional + batch normalization + rectified linear unit (ReLU) layers; C-upsampling layer; D-concatenation operation; SA-spatial attention mechanism; CA-channel attention mechanism; RS-reshape operation.
RNN is the first deep learning network that is improved by attention mechanism [20] for natural language processing tasks. RNNs are not as popular as CNNs for image processing due to the inherent characteristics of the images. However, RNN has been frequently used in conjunction with CNN for RS image processing [34,[76][77][78]. This also allows the integration of the attention mechanism with RNN for RS applications. For example, [79] developed a bidirectional RNN module to provide channel attention and add the outcome weights to the CNN-based module which is supported with a spatial attention network for hyperspectral image classification ( Figure 5). GNN is another network architecture that has been employed in conjunction with CNN for RS image processing. Hence, this mechanism is used to focus on the most important graph nodes of the network. A typical integration of GNN with CNN is to implement a GNN after a CNN-based image segmentation to produce the final RS image classification results [80,81]. Accordingly, the attention network adjusts the weight for each graph node through the graph convolutional layers ( Figure 6) [82]. GNN is another network architecture that has been employed in conjunction with CNN for RS image processing. Hence, this mechanism is used to focus on the most Remote Sens. 2021, 13, 2965 7 of 22 important graph nodes of the network. A typical integration of GNN with CNN is to implement a GNN after a CNN-based image segmentation to produce the final RS image classification results [80,81]. Accordingly, the attention network adjusts the weight for each graph node through the graph convolutional layers ( Figure 6) [82]. Figure 5. An example of adding attention networks (i.e., spatial and channel attentions) to a RNN + CNN module for hyperspectral image classification [79]. PCA-principal component analysis.
GNN is another network architecture that has been employed in conjunction with CNN for RS image processing. Hence, this mechanism is used to focus on the most important graph nodes of the network. A typical integration of GNN with CNN is to implement a GNN after a CNN-based image segmentation to produce the final RS image classification results [80,81]. Accordingly, the attention network adjusts the weight for each graph node through the graph convolutional layers ( Figure 6) [82].

Methodology
We followed the guidelines provided by Kitchenham, et al. [83] to systematically review the literature and report the results. Accordingly, we developed a review protocol at the start of the study and before conducting the review to reduce the biases. As the first

Methodology
We followed the guidelines provided by Kitchenham, et al. [83] to systematically review the literature and report the results. Accordingly, we developed a review protocol at the start of the study and before conducting the review to reduce the biases. As the first step of the developed protocol, a set of research questions was defined (Section 4.1) according to the objective of this review study (i.e., reviewing and investigating attention-based deep learning methods for remote-sensing image-processing applications). Thereafter, the search strategy including search databases, strings, and a time-period was formulated to automatically find the relevant publications (Section 4.2). The final set of papers for the systematic review were selected by manually screening the papers according to the predefined exclusion criteria (Section 4.3). Then, a data extractions strategy (Section 4.4) and a form (Appendix A- Table A1) were developed to extract the required information from the papers. The extracted data and information were synthesized and the associated results are presented and discussed to answer the research questions.

Research Questions
A total of five main research questions (RQs) were defined to address the objective of this study. The RQs were specifically selected to extract state-of-the-art and interesting aspects of the developed DL methods with attention mechanism applied to RS image processing, including the effect of such mechanisms in their performance. The review and further structured analysis were built on these RQs.

RQ1.
What are the specific objectives in remote sensing image processing that were addressed with attention-based deep learning?

Search Strategy
Two main attributes are usually employed to define the search scope of a systemic literature review: publication date and platform. We executed the search with no limit for the published data on the well-known and widely accepted platforms, i.e., ISI Web of Knowledge and Scopus. We formulated the following search string and executed it on the search engine of the selected publication platforms automatically to search in title, abstract, and keywords of the papers.
Search string: (("attention mechanism" OR "attention guid*" OR "attention embed*" OR "attention contain*" OR "attention based" OR "with attention" OR "attention aid*" OR "attention net*" OR "attentive") AND ("remote sensing" OR "satellite image*" OR "UAV image*" OR "hyperspectral image*" OR "aerial image*" OR "SAR") AND ("CNN" OR "deep learning")) The defined search query consisted of three main parts that were separated by the term "AND". The first part aimed to find the publications that used attention mechanisms (e.g., attentive). The second part aimed to find the relevant publications concerning their used remote sensing images (e.g., satellite images) and the third part aimed to find the papers that used deep learning methods (e.g., CNN).

Study Selection Criteria
After automated extraction of the publications from the selected platforms using the defined search query, we manually filtered the papers to select the final list of the most suitable ones. For this, we screened the publications mainly by reading their abstract and introduction sections and based on a set of exclusion criteria ( Table 1) that were particularly defined according to the objectives of this review.

Data Extraction
To properly answer the defined research questions, first, we needed to extract the necessary data and information from the retrieved papers. For this, a data extraction form was designed and created (Appendix A- Table A1). This form consists of a set of attributes to extract general information from the papers (e.g., publication year and publisher), as well as detailed ones including the study target of the papers, developed DL methods, attention mechanism type used, and the accuracy rates of the employed/developed DL methods with and without attention mechanism. Here, we used only the papers that did this analysis as explained above or compared their produced At-DL results with state-of-the-art DL methods in which no attention mechanism was used. In addition, only the overall accuracy metric was used to compare the papers since this was the only performance measurement used in most of the papers. The general data were extracted with the initial screening of the papers while the more detailed ones were extracted by carefully reading and reviewing of the papers.

Data Synthesis
The data synthesizing step is to answer the research questions, synthesize the extracted data and present the results. Thus, it is the most important step of a systematic literature review. In this step, the papers were grouped based on the extracted data into defined groups to answer corresponding research questions, and accordingly, the results were summarized and visualized. The detailed discussions over the presented results are provided to elicit and highlight the important points for each research question. Furthermore, the main findings such as current research directions, achievements on the use of attention mechanism to increase the performance of the DL methods for RS image processing applications, open problems, and recommendations for future studies are provided.

Results and Discussion
A final number of 176 papers were selected for the detailed review. The main statistics and an overview of the papers are provided in the following subsection. In addition, the detailed results are presented and corresponding discussions are provided for each research question in the next subsections.

Overview of the Reviewed Papers
At-DL methods entered RS image processing in 2018, while attention mechanism was developed in 2014 [20]. However, only since 2020, have most studies (i.e., 141 papers) employed this technique for different RS image processing applications, which reveals a significant interest in the technique in recent years (Figure 7). Just in 2021, 47 papers were published, knowing that the searches from the online databases were conducted in March 2021.

Overview of the Reviewed Papers
At-DL methods entered RS image processing in 2018, while attention mechanism was developed in 2014 [20]. However, only since 2020, have most studies (i.e. 141 papers) employed this technique for different RS image processing applications, which reveals a significant interest in the technique in recent years (Figure 7). Just in 2021, 47 papers were published, knowing that the searches from the online databases were conducted in March 2021.

Figure 7.
Year-wise classification of the papers and classified based on the attention mechanism type used. Table 2 shows the journal names with at least two papers, and the rest with only one paper are aggregated in the "other" category. The papers are published in 30 different journals, which shows the usefulness of the At-DL for a wide range of RS image processing applications from water management [84,85] to urban studies [86]. The most pop-  Table 2 shows the journal names with at least two papers, and the rest with only one paper are aggregated in the "other" category. The papers are published in 30 different journals, which shows the usefulness of the At-DL for a wide range of RS image processing applications from water management [84,85] to urban studies [86]. The most popular journal is the "Remote Sensing" journal with 44 papers, and the second one is "IEEE Transactions on Geoscience and Remote Sensing" journal with 33 papers (Table 2). Furthermore, 17 journals only have one paper ("other" category in Table 2). These statistics show that most of the papers are published in technical RS journals rather than subject-specific journals. The papers are grouped with regard to their study target similar to the classes used in [11]: image classification, image segmentation, image fusion, object detection, change detection, and other ( Figure 8).
(i) Image classification: refers to labeling a group of pixels (objects or patches) in the RS images using training samples (e.g., land cover and land use classification). This is one of the most frequently used RS image processing tasks in various application domains as the starting point of the process [87][88][89]. Image classification is also called scene classification [88] or land cover and land use classifications [90] in the literature, depending on the aim and the data used in the studies. About half of the papers in At-DL addressed the image classification tasks for images acquired from different sensors such as multispectral satellites [67,91,92], hyperspectral [71,93], and unmanned aerial vehicles (UAV) [34,94] images. The large amount of the freely available benchmark data sets and organized competitions in this regard attracts researchers to develop DL methods in this subject area. (ii) Object detection: refers to the detection of different objects in an image. It is the second most popular task that is addressed using At-DL including general object/target detection from RS images [46,60,95] or detection of the specific objects and features such as buildings [74,96], ships [97,98], landslides [99], clouds [53,100], airports [101], roads [72] and trees [102].
(iii) Image segmentation: also known as semantic segmentation refers to labeling each pixel in the image, usually using end-to-end At-DL methods. From the At-DL papers, 17 papers addressed image segmentation [103][104][105]. (iv) Image fusion: is mostly known as a fundamental preprocess in the RS field, and aims to produce higher spectral and spatial resolutions. There are two main image fusion tasks that were addressed using At-DL in 13 papers. One is pan-sharpening that aims to fuse a coarse resolution multispectral image with a correspondingly high-resolution panchromatic image to produce a high-resolution multispectral image [106][107][108]. Another one is image super-resolution which refers to enhancing the resolution of the original image using At-DL methods [106,107,109]. (v) Change detection: refers to detecting and quantifying the changes in multi-temporal RS images. This is one of the challenging tasks and with the increasing amount of multi-temporal RS images has become more popular. At-DL was used in 7 papers to detect changes in general [110,111], in buildings [51], or any other objects [81,112]. (vi) Other tasks, such as image dehazing [113], digital elevation model (DEM) void filling [114], and SAR image despeckling [115] were addressed with At-DL in 9 papers.

RQ2. What
Are the Deep Learning Algorithms That Are Improved with Attention Mechanism for Remote Sensing Image Processing? Figure 9 shows the number of papers that employed the attention mechanism for each DL algorithm. Accordingly, the convolutional neural networks (CNN) algorithm is the predominant DL method that was enhanced with an attention mechanism to address RS image processing, which applied in 154 out of 176 reviewed papers [69,[116][117][118][119][120]. This is an expected result since CNN is the most frequently used DL method in general computer vision and image processing. Recurrent neural networks (RNN), such as long-short term memories (LSTM) methods, were the second most frequently used DL method supported by attention mechanism for RS image processing with 18 papers [121][122][123], this algorithm is also the first DL method that was improved with attention mechanism [20]. In addition, it was observed that most of the RNN methods were used in combination with CNN methods [76,78,124]. Generative adversarial networks (GAN) [53,125,126], Graph Neural Network (GNN) [80,82], and other DL methods including capsule network [72] and autoencoders [61] were the other DL algorithms used in 12, 5, and 4 papers, respectively.   Figure 9 shows the number of papers that employed the attention mechanism for each DL algorithm. Accordingly, the convolutional neural networks (CNN) algorithm is the predominant DL method that was enhanced with an attention mechanism to address RS image processing, which applied in 154 out of 176 reviewed papers [69,[116][117][118][119][120]. This

RQ3. Which Types of Attention Mechanisms were Used in Deep Learning Methods for Remote Sensing Image Processing?
At-DL methods can be classified based on the used attention types (i.e., channel and spatial attention networks) as explained in Section 2 ( Figure 10). The combined use of the channel and spatial attention mechanisms were the most frequently used types in the papers [59,127,128]. In addition, the channel type, which is mostly used in hyperspectral image processing [129][130][131], and the spatial type [47,132,133] were also solely used in 41 and 33 papers, respectively. Depending on the aim of the study, the attention type can be selected; however, because in RS images, the features/channels and spatial location of the objects/features are both important, using a combined type was the predominant choice of the researchers in the papers.

RQ4. What are the Used Data Sets/Types in Attention-Based Deep Learning Methods for Remote Sensing Image Processing?
Multispectral satellite images are the most popular images that are processed with At-DL methods (81 papers) [91,92,134] (Figure 11). This is mostly due to the free availability of some MS satellite images and their wide range of applications. Aerial images [54,135,136], hyperspectral images [137][138][139], and SAR images [97,140,141] were also processed with At-DL methods in 55, 43, and 24 papers, respectively. However, UAV images were used in only three papers [34,94,142]. This is a surprisingly low number; however, due to the very high resolution of the UAV images, the attention mechanism could significantly increase the performance of the DL methods.

RQ3. Which Types of Attention Mechanisms Were Used in Deep Learning Methods for Remote Sensing Image Processing?
At-DL methods can be classified based on the used attention types (i.e., channel and spatial attention networks) as explained in Section 2 ( Figure 10). The combined use of the channel and spatial attention mechanisms were the most frequently used types in the papers [59,127,128]. In addition, the channel type, which is mostly used in hyperspectral image processing [129][130][131], and the spatial type [47,132,133] were also solely used in 41 and 33 papers, respectively. Depending on the aim of the study, the attention type can be selected; however, because in RS images, the features/channels and spatial location of the objects/features are both important, using a combined type was the predominant choice of the researchers in the papers.
Remote Sens. 2021, 13, x FOR PEER REVIEW 13 of 23 Figure 9. The improved DL algorithms with attention mechanism in the papers.

RQ3. Which Types of Attention Mechanisms were Used in Deep Learning Methods for Remote Sensing Image Processing?
At-DL methods can be classified based on the used attention types (i.e., channel and spatial attention networks) as explained in Section 2 ( Figure 10). The combined use of the channel and spatial attention mechanisms were the most frequently used types in the papers [59,127,128]. In addition, the channel type, which is mostly used in hyperspectral image processing [129][130][131], and the spatial type [47,132,133] were also solely used in 41 and 33 papers, respectively. Depending on the aim of the study, the attention type can be selected; however, because in RS images, the features/channels and spatial location of the objects/features are both important, using a combined type was the predominant choice of the researchers in the papers.

RQ4. What are the Used Data Sets/Types in Attention-Based Deep Learning Methods for Remote Sensing Image Processing?
Multispectral satellite images are the most popular images that are processed with At-DL methods (81 papers) [91,92,134] (Figure 11). This is mostly due to the free availability of some MS satellite images and their wide range of applications. Aerial images [54,135,136], hyperspectral images [137][138][139], and SAR images [97,140,141] were also processed with At-DL methods in 55, 43, and 24 papers, respectively. However, UAV images were used in only three papers [34,94,142]. This is a surprisingly low number; however, due to the very high resolution of the UAV images, the attention mechanism could significantly increase the performance of the DL methods.  Multispectral satellite images are the most popular images that are processed with At-DL methods (81 papers) [91,92,134] (Figure 11). This is mostly due to the free availability of some MS satellite images and their wide range of applications. Aerial images [54,135,136], hyperspectral images [137][138][139], and SAR images [97,140,141] were also processed with At-DL methods in 55, 43, and 24 papers, respectively. However, UAV images were used in only three papers [34,94,142]. This is a surprisingly low number; however, due to the very high resolution of the UAV images, the attention mechanism could significantly increase the performance of the DL methods. Remote Sens. 2021, 13, x FOR PEER REVIEW 14 of 23 Figure 11. The data sets used in the papers.
The processed RS images were also grouped based on the spatial resolution of the processed images (Figure 12). High-and medium-resolution images were the main processed RS images in 157, and 58 papers, respectively. Low-resolution images (with spatial resolution over 30 m) were only used in four papers.  The processed RS images were also grouped based on the spatial resolution of the processed images (Figure 12). High-and medium-resolution images were the main processed RS images in 157, and 58 papers, respectively. Low-resolution images (with spatial resolution over 30 m) were only used in four papers.
Remote Sens. 2021, 13, x FOR PEER REVIEW 14 of 23 Figure 11. The data sets used in the papers.
The processed RS images were also grouped based on the spatial resolution of the processed images (Figure 12). High-and medium-resolution images were the main processed RS images in 157, and 58 papers, respectively. Low-resolution images (with spatial resolution over 30 m) were only used in four papers.

RQ5. What Are the Effects of the Attention Mechanism in the Performance of the Deep Learning Methods in Remote Sensing Image Processing?
We investigated the performance of attention mechanism in DL methods for RS image processing in two manners; (i) by extracting the overall accuracies of the used At-DL methods for RS image processing tasks (Figure 13), and (ii) comparing the overall accuracies of the produced results with and without attention mechanism in the papers (Figure 14). and tasks. Nevertheless, image classification is one of the fundamental and valuable ap-plications in RS image processing that can be used as the basis in other science fields including agriculture, natural hazards, and thus, the already reached high accuracy levels is a good sign of using At-DL. Change detection with the increasing availability of multitemporal RS images has become important in different fields [143][144][145]. Although the results revealed a high performance of At-DL in conducting change detection, only seven papers is not a robust number of papers to conclude a general statement that the At-DL produces above 95% accuracy rate, and thus, more work is needed on this. Image segmentation and object detection had a median accuracy value of about 91%, which is about 5% less than the first two image processing tasks. In addition, other tasks such as digital elevation model (DEM) void filling with At-DL papers had the median of below 90% accuracy values. Providing benchmark RS images and training samples for applications such as object detection would help to attract the attention of the researchers and develop more advanced methods. However, most of the used At-DL methods in image classification can be adopted for other tasks, including object detection.  Figure 14 shows a box plot graph of the effect of the attention mechanism in overall accuracies of the produced results in the papers for change detection, image classification image segmentation, object detection, and other tasks. Most of the papers reported an increase when using the attention mechanism within the DL methods. Only one paper stated that using the attention mechanism did not positively impact the performance of the DL method [146]. The median of the increase rates for all the classes was less than 5%. This increased rate was a remarkable enhancement of overall accuracies, given that the overall accuracy rates for most of the classes were already above 90%. The highest median rate which also showed the highest accuracy increase belonged to the object detection class with ~5%. One of the reasons for the highest increase rate by using attention mechanism in DL methods for object detection class when compared with the others is the inherent characteristics of these methods which need to localize the objects and attention mechanism, in particular, the spatial type, has the same aim by providing a focus on the spatial location of the important features. Image classification, image segmentation, and change detection classes had almost the same increase rate of overall accuracies with ~3-4%. The "other" class with ~1% increase had the lowest increase. Figure 14. The effect of the use of the attention mechanism within the DL algorithms in terms of accuracy rate for different tasks in the papers.

Threats to Validity of This Review
Every systematic literature review may be biased due to some limitations such as publication bias, data extraction, and classification. The main threats to the validity of our review are discussed as follows: Construct validity: This study aimed to examine the effect of the attention mechanism on deep learning algorithms for RS image processing through the review of the existing Figure 14. The effect of the use of the attention mechanism within the DL algorithms in terms of accuracy rate for different tasks in the papers. Figure 13 illustrates a box plot graph of the overall accuracies of the produced results in the papers for change detection, image classification, image segmentation, object detection, and other tasks. Image classification and change detection had the highest median accuracies (~97%). One of the reasons is the availability of the benchmark datasets for such applications that encourage researchers to test their proposed methods on such datasets and tasks. Nevertheless, image classification is one of the fundamental and valuable applications in RS image processing that can be used as the basis in other science fields including agriculture, natural hazards, and thus, the already reached high accuracy levels is a good sign of using At-DL. Change detection with the increasing availability of multi-temporal RS images has become important in different fields [143][144][145]. Although the results revealed a high performance of At-DL in conducting change detection, only seven papers is not a robust number of papers to conclude a general statement that the At-DL produces above 95% accuracy rate, and thus, more work is needed on this. Image segmentation and object detection had a median accuracy value of about 91%, which is about 5% less than the first two image processing tasks. In addition, other tasks such as digital elevation model (DEM) void filling with At-DL papers had the median of below 90% accuracy values. Providing benchmark RS images and training samples for applications such as object detection would help to attract the attention of the researchers and develop more advanced methods. However, most of the used At-DL methods in image classification can be adopted for other tasks, including object detection. Figure 14 shows a box plot graph of the effect of the attention mechanism in overall accuracies of the produced results in the papers for change detection, image classification image segmentation, object detection, and other tasks. Most of the papers reported an increase when using the attention mechanism within the DL methods. Only one paper stated that using the attention mechanism did not positively impact the performance of the DL method [146]. The median of the increase rates for all the classes was less than 5%. This increased rate was a remarkable enhancement of overall accuracies, given that the overall accuracy rates for most of the classes were already above 90%. The highest median rate which also showed the highest accuracy increase belonged to the object detection class with~5%. One of the reasons for the highest increase rate by using attention mechanism in DL methods for object detection class when compared with the others is the inherent characteristics of these methods which need to localize the objects and attention mechanism, in particular, the spatial type, has the same aim by providing a focus on the spatial location of the important features. Image classification, image segmentation, and change detection classes had almost the same increase rate of overall accuracies with~3-4%. The "other" class with~1% increase had the lowest increase.

Threats to Validity of This Review
Every systematic literature review may be biased due to some limitations such as publication bias, data extraction, and classification. The main threats to the validity of our review are discussed as follows: Construct validity: This study aimed to examine the effect of the attention mechanism on deep learning algorithms for RS image processing through the review of the existing literature that used At-DL methods for RS image processing and accordingly provide insights and recommendations for future studies. We employed automated search queries applied to the ISI Web of Knowledge [147] and Scopus websites. As a result, using these databases as the only sources of publications may lead to missing other relevant publications that are not included in this study. However, this study aimed to provide an overview of high-quality publications. Hence, indexing in ISI and Scopus is an accepted and widely used way to find the corresponding high-quality papers. In addition, there might be missing terms that may affect the final results. However, we tried to keep the search broad (the initial number of papers was 270) and revised the search query several times to reduce such impacts on our results.
Internal validity: In a systematic literature review, systematic errors may occur in the data extraction phase and lead to an incomplete relationship between the extracted data and findings. In the current study, we precisely defined the research questions to investigate and extract all the required data and necessary information from At-DL studies. Hence, the findings of this study are properly explained and linked to the extracted and presented results.
External validity: This study reviewed the publications which employed At-DL methods for RS image processing applications. However, all of the existing DL methods have not been improved with attention mechanism or have not yet been used for RS image processing applications, and all the possible RS image processing applications were not addressed with At-DL and thus not included or discussed in this study. In addition, we only reviewed the publications that used At-DL for RS image processing applications, and thus, we cannot make judgments about the use and the effect of the At-DL in a broader scope or other applications.
Conclusion validity: We conducted the review based on the accepted structure and protocol for systematic literature review studies [83]. In addition, the steps of the structure review process are comprehensively explained in Section 4 of the paper, and the used search string, data extraction form (Appendix A) and the extracted papers as supplementary materials are provided in the paper. Therefore, the results of this study are reproducible using the given information.

Conclusions
This study reviewed the remote sensing (RS) literature that used attention mechanismbased deep learning (At-DL) methods for processing RS imagery. We investigated the advances in the use of At-DL methods and also the effect of the attention mechanism considering its different types on the performance of the DL methods in RS image processing. Accordingly, the current research directions and challenges are presented, and insights and recommendations for future studies are provided. Using a systematic literature review, which is not a well-known and used strategy in RS review papers, led us to a comprehensive review and to precisely answering the predefined research questions and contributing to the objective of this study. The results clearly demonstrate the positive impact of the attention mechanism on the performance of the DL methods in RS image processing, therefore, it is one of the powerful approaches that can be used to improve DL methods for such applications. In addition, the review results show an increasing trend in the use of At-DL methods in RS image processing. However, while image classification attracted most of the attention, other RS image processing tasks, such as object detection and change detection still need more studies to fully understand the effect of the attention mechanism on the performance of the DL methods. There are even important tasks that have not yet been addressed using this mechanism, including object-oriented image analysis. Results also revealed that the CNN methods are the algorithms that are the most frequently improved by the attention mechanism, which is largely due to its general usefulness; and it is a popular method for different computer vision tasks, in general. However, recently generative adversarial networks (GANs) have become state-of-the-art methods in different computer vision tasks when combined with attention mechanisms such as StarGAN [148] and AttentionGAN [149]. Hence, they can be adopted for RS image processing applications in future studies. Moreover, we investigated the performance of the At-DL methods based on the overall accuracy metric, which is widely used for RS applications and provided in the papers. However, the accuracy of the DL methods depends on the dataset used and the aimed tasks. In addition, the performance of the At-DL methods should be studied using other important metrics (e.g., computational time).

Acknowledgments:
We thank the anonymous reviewers for their insights and constructive comments, which helped to improve the paper.

Conflicts of Interest:
The authors declare no conflict of interest. Funding: This research received no external funding.

Data Availability Statement:
The full list of the reviewed publications is provided in a Supplementary File.

Acknowledgments:
We thank the anonymous reviewers for their insights and constructive comments, which helped to improve the paper.

Conflicts of Interest:
The authors declare no conflict of interest.  13 Overall accuracy (%) The overall accuracy of the produced results using At-DL method 14 Effect of attention mechanism (%)

Appendix A. Data extraction form
The increased rate of the overall accuracy when used attention mechanism. 15 Additional notes E.g., the opinions of the reviewer about the study