1. Introduction
Remote sensing is a significant component of earth observation since it can detect and identify scenes based on physical characteristics. Remote-sensing detectors measure reflected and emitted radiation of objects without establishing direct touch. Polarimetric synthetic aperture radar (PolSAR), an efficient microwave detector, has attracted great attention [
1,
2,
3]. PolSAR employs different polarimetric channels to obtain the polarimetric scattering characteristics of objects, which offers conveniences for subsequent information extraction of geoscience applications. In particular, it can provide more structural information than single-polarized SAR systems. Moreover, PolSAR can be used at any time and in any weather, so several successful applications have been made in environmental monitoring [
4], resource management [
5], urban planning [
6], military [
7], and so on. Among these applications, classification is a critical and difficult process that entails categorizing polarimetric scattering points into some predefined categories according to their scattering properties [
8].
In recent years, copious classification methods for PolSAR image classification have been developed. Scattering-mechanism-based methods are frequently used, which employ the scattering information and imaging mechanism to increase the classification accuracy [
9,
10]. These techniques extract different scattering characteristics from the coherency matrix or covariance matrix of PolSAR images, such as Krogager decomposition [
11], Huynen decomposition [
12], Cameron decomposition [
13], Freeman decomposition [
14], H/alpha decomposition [
15], Pauli decomposition [
16], and so on. The methods based on scattering mechanisms are straightforward and efficient. Statistical-distribution-based methods have achieved considerable interest in the past few years. These approaches use different distributions to represent PolSAR data. Lee et al. [
17] utilized the H/alpha decomposition and the complex Wishart classifier to process unsupervised PolSAR classification. Liu et al. [
18] used a Wishart deep belief network (W-DBN) and local spatial information to classify. Jiao and Liu [
19] combined Wishart distribution with a deep stacking network (W-DSN) for PolSAR classification. Xie et al. [
20] proposed a PolSAR classification approach based on the complex Wishart distribution and convolutional autoencoder. Recently, with the advancement of machine learning, some classifiers such as support vector machines (SVMs) [
21], decision trees (DTs) [
22], and K-nearest neighbor (KNN) [
23] are used for PolSAR classification and achieve superior performance compared with target decomposition approaches. However, these approaches require manual feature extraction and use only pixel-based polarimetric features. Deep-learning-based algorithms have been developed in several fields, including natural language processing [
24] and computer vision [
25], because of their exceptional performance. The deep structure of deep learning enables the model to learn discriminative, invariant, and high-dimensional data features autonomously. Several deep-learning-based frameworks have been introduced into PolSAR classification. For example, deep transfer learning [
26], deep reinforcement learning [
27], the sparse autoencoder [
28], the convolutional neural network (CNN) [
29], and the long short-term memory (LSTM) network [
30] are frequently used. In these methods, CNN-based methods are commonly employed and have achieved tremendous success.
Due to its advantages in local contextual perception and feature transformation with parameter sharing, CNN-based approaches have gained popularity. Nevertheless, within the local learning framework, these CNN-based approaches often have repetitive calculations [
31] for PolSAR image classification. The framework process consists of two parts: generating overlapping image patches and assigning labels to the corresponding central pixels. Patches generated by adjacent pixels overlap with each other, resulting in redundant computation. For this reason, approaches under the patch-based framework have difficulty running quickly. Moreover, the finite patch size only constrains some local features. Hence, it is hard for CNN-based methods to build long-range dependency. Additionally, traditional CNN-based methods usually adopt the fixed kernel size for feature extraction, which restricts the capability of context information extraction [
32]. In general, the fixed kernel size cannot capture fine-grained and coarse-grained terrain structures simultaneously, which influences the performance of PolSAR classification. Recent research indicates that varying spatial kernel sizes are helpful to classification. Unfortunately, it is not easy to choose the weights of various kernel sizes.
In the recent years, attention-based methods have also been applied in PolSAR image classification. The purpose of the attention mechanism is to allow a neural network to focus on specific parts of its input when processing it, rather than having to consider the entire input equally. The existing attention-based methods for PolSAR image classification are generally based on two kinds of attention mechanisms: channel-driven attention and spatial-driven attention. Dong et al. [
33] proposed an attention-based polarimetric feature selection module for a CNN network, which captures the relationship between input polarimetric features and ensures the validity of high-dimensional data classification. Hua et al. [
34] introduced a feature selection method based on spatial attention to enhance the relationship between pixel spatial information. Yang et al. [
35] introduced a convolutional block attention module to achieve better classification performance and accelerate network convergence. Ren et al. [
36] proposed a residual attention module to enhance discriminate features in multiple resolutions. The existing attention-based methods are focused on channel and spatial features, but multi-scale features obtained by using different receptive fields are often ignored.
In light of the challenges mentioned above, we proposed a hybrid attention-based encoder–decoder fully convolutional network (EDNet) called HA-EDNet for PolSAR classification. In HA-EDNet, the EDNet is constructed as the patch-free backbone network. The network accepts arbitrary-size input data without any pretreatment. Similar to the human visual system, the attention mechanism is commonly used in computer vision since it can inhibit irrelevant features and enhance important features. Self-attention is one the most popular attention mechanisms and has become the dominant paradigm in NLP [
37]. In this paper, self-attention is designed to build the long-range dependency between pixels of a PolSAR image. Moreover, the attention-based selective kernel module (SK-module) [
38] is utilized to replace traditional convolution operations. This module can adjust the kernel sizes automatically for different terrain sizes. Compared with the conventional CNN models, the HA-EDNet framework can deal with repeated calculation and observe objects from multi-scale and long-distance perspectives.
The main contributions are summarized as follows:
- (1)
An end-to-end encoder–decoder fully convolution network called EDNet is proposed to classify PolSAR images. The approach follows a patch-free architecture and accepts arbitrary-size input images. Then, the output is the whole image classification result.
- (2)
A self-attention module is embedded into EDNet for global information extraction, where long-distance dependencies are modeled. Moreover, the self-attention module makes the classification results more refined and discriminative.
- (3)
To further boost the performance, the SK module is used to extract multi-scale features, where different kernel sizes are fused by softmax attention. In this module, more discriminating features are extracted for better PolSAR classification.
- (4)
Four widely known datasets are employed to test the effectiveness of the proposed approach. The experimental results show that the approach has better visual performance and classification accuracy than state-of-the-art methods.
The remainder of this paper is structured as follows. In
Section 2, the related works on the fully convolutional network, attention mechanism, and self-attention are shown. In
Section 3, we formulate the proposed methods in detail.
Section 4 exhibits experimental results and discussions of four widely used PolSAR images. Finally,
Section 5 depicts the conclusion and future work.
3. Proposed Method
In this section, the HA-EDNet is discussed in depth. First, the representation of PolSAR data is shown. Secondly, the SK module is introduced. Thirdly, the self-attention module is presented. Finally, the structure of the proposed network is shown.
3.1. Representation of PolSAR Data
To identify the scattering characteristics of targets, the scattering matrix is employed. The scattering matrix represents the horizontal and vertical polarization states of the sent and received signals. The following is the representation of the scattering matrix:
Complex-valued scattering coefficients are denoted by , , , and in this formula, where represents horizontal transmitting and vertical receiving. The other coefficients are defined similarly.
The scattering features of PolSAR were previously represented using a statistical coherence matrix due to speckle noise. In the condition of reciprocity (
), every pixel’s coherence matrix is expressed as a complex value matrix:
where
is the Pauli scattering vector, and the superscript
H is the conjugate transpose. Because the coherence matrix
is a Hermitian matrix, it is equivalent to its conjugate transpose. As a result, the polarimetric characteristics are represented by a 9-dimensional real vector, denoted as:
where
and
represent the real and imaginary components of a complex value, respectively. For subsequent processing, every pixel of a PolSAR image is represented as the 9-dimensional real vector.
3.2. Attention-Based Selective Kernel Module
In PolSAR classification, approaches based on CNN have produced satisfactory results. However, in the above ways, the fixed kernel size is used for feature extraction, which limits the capability of multi-scale feature extraction. Therefore, it is necessary to automatically alter the kernel sizes of the network to improve the efficiency of PolSAR classification, which can be achieved by the SK module [
44].
Figure 1 shows the illustration of the SK module, which is made up of three operations: split, fusion, and selection. The module utilizes feature maps
as input and produces the output feature maps
. Therefore, the module can be presented as:
where
denotes parameters in the module.
H,
W, and
C are the height, width, and channel of feature maps. In the split operation, two transformations:
and
are utilized to illustrate, and 3 × 3 and 5 × 5 kernel sizes are used in the two transformations, respectively. Two output feature maps
and
can be formulated as:
where
and
are convolutional kernels and biases, respectively. Different kernel sizes are employed to extract multi-scale information. Moreover, the Batch Normalization (BN) and activation function are included in the two transformations.
In the SK module, the fusion operation aims to allow neurons to learn multi-scale features by automatically adjusting the kernel sizes jointly. Firstly, an element-wise summation combines feature maps from the split operation. The output
is given by:
where ⊕ represents element-wise addition operation. Then, a squeeze-and-excitation block (SE block) is utilized to extract the global information via global average pooling [
45]. Finally, the output
contains channel-wise statistics and is calculated by reducing
through spatial dimensions using global average pooling (GAP):
where
denotes the
cth element of output
, and
is the
cth channel of the fusion feature map
. Then, a fully connected layer with a ReLU function is used to generate compact features that can guide precise and adaptive selections. The compact feature
can be formulated as:
where
is a weight matrix, and
r denotes a reduction ratio. In this paper, the reduction ratio
r is fixed at 16.
In the selection operation, selective kernel attention across channels is utilized to select different scales of features adaptively. The compact feature
is applied to compute the selective kernel attention vectors
and
by the fully connected layer and softmax function:
where
and
denote the
cth element of attention vector
and
. Here,
and
and
are the
cth row of
,
. Moreover, the elements
and
have the following relationship:
Finally, the output feature map
of the SK module with two kernel sizes is calculated by the attention vector as follows:
Each neuron can modify the size of its receptive field using the SK module depending on multiple scales of input features. In the module, softmax attention is used to fuse different kernel sizes with the information in corresponding branches. Moreover, the module can capture target objects with multi-scales information, which is important for classification.
3.3. Self-Attention Module
In order to capture more contextual and spatial information in the learning network, we use a self-attention module to build long-range dependency and extract the complex land cover areas effectively. The self-attention module is illustrated in
Figure 2, where
is the input feature map.
H,
W, and
C denote the width, height, and channel, respectively. Then, three convolutions with
kernel are employed to transform the input feature map into three diverse embeddings.
where
,
,
, and
N indicate the channel of reshaped feature map. Then,
,
, and
are reshaped to
. To obtain the spatial self-attention map
, the matrix multiplication is applied between
and the transpose of
with the softmax function:
Then, the spatial self-attention map is multiplied by , and the result is .
The result
is reshaped to
. The final self-attention enhanced feature map
is formulated as:
where
is the nolinear transformation implemented by a convolutional layer with
kernel. It can be seen from Equation (
16) that the attention feature map
is the sum of the the global feature map and input feature map. The global feature map contains relationships across all positions in the PolSAR image. This property enables the network to build the global spatial dependency for pixels belonging to the same category. Moreover, the global information contained in the PolSAR image can significantly improve the robustness of deep neural networks when confronted with blur [
46].
3.4. PolSAR Classification with HA-EDNet
Patch-based CNN methods need to divide the input image into overlapping patches, which results in high computational complexity. This paper proposes a patch-free network architecture called HA-EDNet for full image classification. In patch-free networks, the explicit patching is replaced with the implicit receptive field of the model. The patch-free networks can avoid redundant computation on the overlapping areas and obtain a wider latent spatial context. The network accepts arbitrary-size images as input without pretreatment, and the output is the classification results of the whole image. The proposed model is described in the following.
As shown in
Figure 3, the HA-EDNet is comprised of two basic subnetworks: (1) the encoder subnetwork and (2) the decoder subnetwork. As input, the coherence matrices of the whole PolSAR image are employed. Then, the encoder subnetwork is used to compute the hierarchical convolutional feature maps of the input PolSAR image, which is started with a 3 × 3 convolutional layer, a BN layer, and a ReLU function. The remaining part is composed of three SK modules and three downsampling layers. Here, 2 × 2 average pooling layers are utilized as the downsampling layers. In the top layer of the encoder subnetwork, the self-attention module is used to build long-range dependency. The decoder subnetwork is used to recover the spatial dimension of the coarsest convolutional feature map, which is a sample composed of three 3 × 3 convolutional layers and three upsample layers. Here, the upsampling layer is the bilinear interpolate function with a factor of 2. To effectively combine the spatial detail features in the encoder subnetwork and the semantic features in the decoder subnetwork, the feature maps of every SK module are added to the same size output of the decoder subnetwork. Finally, the softmax function is utilized to classify.
In the training process, the model is optimized by the cross-entropy loss function, which is written as:
where
z is the predicted result of the network. When
z belongs to the
ith class,
; otherwise,
.
is the output of the softmax function, which is the probability of belonging to a certain category.
6. Conclusions
A novel encoder–decoder method is proposed to extract multi-scale features and long-range dependency based on a hybrid attention mechanism for PolSAR classification in this paper. Inspired by the way humans mimic cognitive attention, the network devotes more focus to the small but important parts of the data, known as the attention mechanism. This mechanism is incorporated into our model. In this method, the patch-free framework, SK module, and self-attention module are utilized. First, an encoder–decoder network is built for patch-free classification, allowing an entire PolSAR image as input and not requiring dividing the image into overlapping patches. Then, the SK module is embedded into the EDNet, which can capture multi-scale features by automatically adjusting the kernel size. Finally, self-attention is employed to extract long-range dependency, which can improve classification performance. In the experiments, four PolSAR datasets are employed to test the effectiveness of the HA-EDNet architecture. The experimental results show that the proposed approach has effective and superior performance compared with some state-of-the-art approaches.
Other attention mechanisms will be introduced into the model for better feature extraction performance in future work. Moreover, a more effective patch-free model will also be investigated in our future works.