Attentively Learning Edge Distributions for Semantic Segmentation of Remote Sensing Imagery

: Semantic segmentation has been a fundamental task in interpreting remote sensing imagery (RSI) for various downstream applications. Due to the high intra-class variants and inter-class similarities, inﬂexibly transferring natural image-speciﬁc networks to RSI is inadvisable. To enhance the distinguishability of learnt representations, attention modules were developed and applied to RSI, resulting in satisfactory improvements. However, these designs capture contextual information by equally handling all the pixels regardless of whether they around edges. Therefore, blurry boundaries are generated, rising high uncertainties in classifying vast adjacent pixels. Hereby, we propose an edge distribution attention module (EDA) to highlight the edge distributions of leant feature maps in a self-attentive fashion. In this module, we ﬁrst formulate and model column-wise and row-wise edge attention maps based on covariance matrix analysis. Furthermore, a hybrid attention module (HAM) that emphasizes the edge distributions and position-wise dependencies is devised combing with non-local block. Consequently, a conceptually end-to-end neural network, termed as EDENet, is proposed to integrate HAM hierarchically for the detailed strengthening of multi-level representations. EDENet implicitly learns representative and discriminative features, providing available and reasonable cues for dense prediction. The experimental results evaluated on ISPRS Vaihingen, Potsdam and DeepGlobe datasets show the efﬁcacy and superiority to the state-of-the-art methods on overall accuracy (OA) and mean intersection over union (mIoU). In addition, the ablation study further validates the effects of EDA.


Introduction
Semantic segmentation, a fundamental task for interpreting remote sensing imagery (RSI), is currently essential in various fields, such as water resource management [1,2], land cover classification [3][4][5], urban planning [6,7] and precision agriculture [8,9] and so forth. This task strives to produce a raster map that refers to an input image by assigning a categorical label to every pixel [10]. The observed objects and terrain information are easily recognized and analyzed with the labeled raster map, contributing to structuralized and readable knowledge. However, the reliability and availability of the transformed knowledge are tremendously conditioned on the accuracy of semantic segmentation.
Conventional segmentation methods of remote sensing imagery are essentially implemented by statistically analyzing the prior distributions. For example, Arivazhagan et al. [11] extracted and combined wavelet statistical features and co-occurrence features, characterizing the textures at different scales. As a result, the experiments on monochrome images achieved comparable performance. For segmenting plant leaves, Gitelson A. and Merzlyak M. [12] developed a new spectral index, green normalized difference vegetation modules brought astounding improvements by learning contextual information, equally processing edges induce blur around boundaries, which indirectly causes massive misclassified pixels. Hence, it is necessary to incorporate the edge distributions to assist the network in enhancing edge delineation and recognizing various objects. Motivated by the attention mechanism and two-dimensional principal component analysis (2DPCA) diagram [42], we found that injecting edge distributions with a learnable way is available. Therefore, in this study, we firstly formulate and re-define the covariance matrix inspired by 2DPCA. To refine the representations, two perspectives of efforts are devoted. One is learning edge distributions modelled by the re-defined covariance matrix following the inherently spatial structure of encoded feature maps. The other is the employment of the non-local block, a typical self-attention module with high efficiency, to enhance the representations with local and global contextual information. Therefore, the hybrid strategy makes the segmentation network determine the dominating features and filter irrelevant noise. In summary, the contributions are as follows, (1) Inspired by the image covariance analysis of 2DPCA, the covariance matrix (CM) is re-defined with learnt feature maps in the network. Then, the edge distribution attention module (EDA) is devised based on the covariance matrix analysis, modeling the dependencies of edge distributions in a self-attentive way explicitly. Through the column-wise and row-wise edge attention maps, the vertical and horizontal relationships are both quantified and leveraged. Specifically, in EDA, the handcrafted feature is successfully combined with learnt ones. (2) A hybrid attention module (HAM) that emphasizes the edge distributions and position-wise dependencies is devised. Thereby, more complementary edge and contextual information are collected and injected. This module supports independent and flexible embedding by a parallel architecture. (3) A conceptually end-to-end neural network, named edge distribution-enhanced semantic segmentation neural network (EDENet), is proposed. EDENet hierarchically integrates HAM to generate representative and discriminative encoded features, providing available and reasonable cues for dense prediction. (4) Extensive experiments are conducted on three datasets, ISPRS Vaihingen [43] and Potsdam [44] and DeepGlobe [45] benchmarks. In addition, the results indicate that EDENet is superior to other state-of-the-art methods. In addition, the ablation study further tests the efficacy of EDA.
The remainders of this paper are organized as follows: Section 2 introduces the related works, including attention mechanism, 2DPCA and non-local block. Section 3 concretely presents the devised framework and pipeline of sub-modules. Section 4 quantitatively and qualitatively evaluates the proposed method on both aerial and satellite images. Finally, the conclusions are drawn in Section 5.

Attention Mechanism
Attention mechanism (AM) derives from human cognition process, selectively focusing on two main targets: (1) deciding whether the interdependence between the input elements should be considered; (2) quantifying how much attention/weight should be put to these elements. Since being successfully applyied to natural language processing (NLP) [46], sundry visual tasks have reaped many benefits, such as image classification, object detection, scene parsing and semantic segmentation. [47][48][49][50]. The foremost advantage is that AM can be adaptively generalized to all visual tasks by easily embedding to backbones.
Correspondingly, two tactics are adopted to build the attention modules in vision tasks: (1) devising an independent branch in a neural network to highlight the strong-correlated local regions or specific channels' feature maps, filtering the irrelevant information, such as [51,52]; (2) completely modeling the dependencies both in spatial and channel domains, Remote Sens. 2022, 14, 102 5 of 27 in which the attention matrix that explicitly quantifies the correlations is formed to reinforce the semantic knowledge, such as [31,32,53].
In addition, Yang et al. [54] presented a significant analysis and discussion on the one-step attention modules. They revealed that the suboptimal performance is obtained accompanying by incorporated noise in the irrelevant regions or pixels. In an attempt to eliminate the interferences, CCNet [53] repeatedly used attention modules that designed by Yang et al. to promote the feature maps.
Instead of stacking or listing attention modules, the correlation-based self-attention mechanism was proposed, which expresses the powerful capability to capture spatial and channel-wise dependencies simultaneously [31]. In addition, a medical image semantic segmentation network was deeply inspired by this idea [49]. Similarly, non-local neural networks [32], ACFNet [33] and OCRNet [34] were devised with desired improvements on many datasets. For capturing hierarchical features, Li et al. [55] proposed a pyramid attention network (PAN), combining attention module and spatial pyramid to extract precisely dense features. Meanwhile, Zhu et al. [56] have proved that the attention mechanism has great potential in promoting boundary awareness.
To determine the effects of edge information in a learnable style, self-attention heaves into sight. Provided that the feature maps' edge distributions are substantially available and undistorted, we can use covariance matrix analysis to quantify the prior edges in feature maps followed by learning attentive correlations, generating edge distribution attentive maps.

Revisiting 2DPCA
To unravel and learn the edge distribution, it is essential to explain how to represent the edge information in the learnt feature maps. As discussed above, our idea of building edge distribution attention module is inspired by the 2DPCA [42], which provides a simple yet efficient way to describe the prior distributions by covariance matrix analysis [57]. This part is presented to revisit the principal theory of 2DPCA for understanding how covariance matrix analysis regulates the images.
An arbitrary image can be characterized as a high-dimensional matrix, such as a natural RGB image is numerically a three-dimensional matrix. Practically, for the sake of understanding, the explanation starts with a single channel image. Let A denote an input image with single dimension and X ∈ R m×n , where m and n represent row and column respectively. Initially, the vector V p is designed to project X following the linear transformation, where V p is an n-dimensional unitary column vector, Y is projected vector with m dimensions. As can be seen, Y is defined as the projected feature vector of image X. Then, to measure the discriminatory power of V p , the total scatter of projected samples are used and characterized by the trace of covariance matrix. Therefore, where M cov represents the covariance matrix of projected feature vectors, tr(·) is a function to calculate the trace of matrix and J(·) denotes the total scatter. Regarding the physical significance, the purpose is to project all the samples by a particular projection direction. If all the pixels are projected, the total scatter goes maximized. Furthermore, tr(M cov ) is ascertained as follows, where E denotes the unitary matrix and (V p ) T means the transpose of V p . Consequently, the image covariance matrix is formed as where M cov I ∈ R n×n represents the image covariance matrix and is unequivocally nonnegative definite matrix. When extending to more images with the same spatial size, the average of all the training samples can be obtained. Here, the M cov I is re-inferred as where X is the average of all the training samples and N is the number of samples. Alternatively, Equation (2) is transformed to where V p is an n-dimensional unitary column vector. The optimal V p leads to maximized J(V p ). In other words, the eigenvector of M cov I conforms to the largest eigenvalue [53]. A single optimal projection axis is impossible. So, a set of M cov After finding the optimal projection vectors, these vectors are then delivered to extract the features. Given the image sample X, Thus, the feature matrix is produced with In general, 2DPCA creates a projection process to extract the feature matrix in virtue of covariance matrix analysis. This finding makes all the samples enlarged along with the principal directions and shrank along with the non-principal ones.

Non-Local Block
As a typical self-attention block, non-local block (NLB) captures position-wise correlations with comparative lower time and space occupation. Originally, non-local-a classical filtering algorithm-computes a weighted mean of all pixels, allowing distant pixels to contribute to the specific pixel along with patch appearance similarity. Following the idea of non-local means, the non-local block in neural networks is invented.
As detailed in Figure 1, the pipeline of the non-local block is presented. In the beginning, three convolutions are implemented to produce three multifarious feature maps. Then, they are reshaped to the given dimensionality. Next, the flattened features of the first and second branches are used to calculate similarity, generating positional self-attentive maps via matrix multiplication. At last, one more matrix multiplication is applied to the flattened features of the third branch and the self-attentive map to inject the position-wise dependencies to the raw features. Towards the end, a reshape operation followed by sophisticated features retain the same dimensionality as the input.
Formally, the NLB is described as where H W C p F   denotes the position-wise attention enhanced feature maps and is the attention maps that built upon the first two branches' feature maps. Intuitively, the flow path of NLB is clear and concise. On the one hand, the positionwise dependencies are modeled and injected by matrix multiplications. On the other hand, the pipeline shows that NLB is flexible. Because NLB only relies on the input feature maps. As a result, NLB presents its significance in embedding to various networks for different visual tasks. In the rest of this paper, NLB is termed as SAM in the proposed framework.

Overview
In contrast to ResUNet-a [58], SCAttNet [38] is inexpensive in time and space costs. Nonetheless, the results raise slightly. The finding indicates that the attention-based models are available for improving learnt representations of RSI. We conclude that these two studies ignore the prior distribution information of edge/contour, delineating object entities with blurs.
This part below illustrates and explains the overall framework of EDENet. Essentially, as presented in Figure 2, its shape looks like a variant of U-Net, based on encoderdecoder architecture. As for the encoder stage, in this study, many standard backbones are allowable, such as ResNet, MobileNet and DenseNet. In addition, the implementation details will be further discussed in the next section. Correspondingly, the decoder symmetrically recovers the feature maps for dense prediction. Instead of using unitary feature maps that output from the encoder, feature maps with multi-spatial size are concatenated for providing important contextual cues, including semantic, spatial and edge clues. To alleviate the structural loss, the HAMs are hierarchically embedded to boost the relevant feature maps. Noticeably, HAM hybridizes SAM and EDA parallelly, enriching the contextual dependencies and edge information simultaneously. The following is a brief description of HAM and EDA, including topological architecture and formalization. Formally, the NLB is described as where F p ∈ R H×W×C denotes the position-wise attention enhanced feature maps and A p ∈ R N×N is the attention maps that built upon the first two branches' feature maps. Intuitively, the flow path of NLB is clear and concise. On the one hand, the positionwise dependencies are modeled and injected by matrix multiplications. On the other hand, the pipeline shows that NLB is flexible. Because NLB only relies on the input feature maps. As a result, NLB presents its significance in embedding to various networks for different visual tasks. In the rest of this paper, NLB is termed as SAM in the proposed framework.

Overview
In contrast to ResUNet-a [58], SCAttNet [38] is inexpensive in time and space costs. Nonetheless, the results raise slightly. The finding indicates that the attention-based models are available for improving learnt representations of RSI. We conclude that these two studies ignore the prior distribution information of edge/contour, delineating object entities with blurs.
This part below illustrates and explains the overall framework of EDENet. Essentially, as presented in Figure 2, its shape looks like a variant of U-Net, based on encoder-decoder architecture. As for the encoder stage, in this study, many standard backbones are allowable, such as ResNet, MobileNet and DenseNet. In addition, the implementation details will be further discussed in the next section. Correspondingly, the decoder symmetrically recovers the feature maps for dense prediction. Instead of using unitary feature maps that output from the encoder, feature maps with multi-spatial size are concatenated for providing important contextual cues, including semantic, spatial and edge clues. To alleviate the structural loss, the HAMs are hierarchically embedded to boost the relevant feature maps. Noticeably, HAM hybridizes SAM and EDA parallelly, enriching the contextual dependencies and edge information simultaneously. The following is a brief description of HAM and EDA, including topological architecture and formalization.

Edge Distribution Attention Module
As previously stated, edges are of great importance for segmentation. Suppose that the more accurate boundary is delineated. Naturally, the localization of segments/objects is given. Existing networks, such as SegNet, U-Net and DeepLab V3+, have pointed out that both the accuracy of localization and recognition are equally helpful to the performance. However, these approaches rest on the convolutions' self-regulating ability, which is uncontrollable and variable. Especially for RSI, understanding and modeling the edge distributions impacts more on segmentation. As we all know, the objects of RSI are various, diverse and complex. The pixels lie in the central parts of objects always correctly labeled, while the pixels around edges are easily misclassified. The inconsistent, mixing and heterogeneous edge regions cause this, presenting massive error-prone pixels. Although the recently proposed ResUNet-a and SCAttNet are implemented to produce more consistent and smooth boundaries from enlarging the receptive field and paying attention to dependencies, the prior edge distributions have not been closely examined and leveraged.
The objective of designing EDA is to determine structural information in a self-attentive way, offering more comprehensive edge contexture. Visually, learning edges enable the objects with responsibly locating positions. Implicitly, the salient edges proffer discriminative contextual information, especially the object-background heterogeneity. This attribute helps the network conquer the visual ambiguities to the uttermost. With maximizing the certainty of classifying edge-around pixels, the learnt representations of objects get discriminative. Hereafter, the principles and technological flow are introduced.

Re-Defining Covariance Matrix for Feature Matrix
Before explaining the pipeline of EDA, it is necessary to re-define the covariance matrix (CM) for the feature matrix. As opposite to densely distributed central parts of entities, edges/contours reveal sparse characteristic. Originally, PCA takes advantages of CM to de-correlate the data to search an optimal basis, accounting for compactly representing the data. This process is also defined as Eigen value decomposition. Associated with PCA, the global data distribution can be modeled completely. However, feature maps always have hundreds of dimensions, causing high memory and time costs when extracting principal components. Surprisingly, Zhang et al. [55] raised a two-directional 2DPCA and re-

Edge Distribution Attention Module
As previously stated, edges are of great importance for segmentation. Suppose that the more accurate boundary is delineated. Naturally, the localization of segments/objects is given. Existing networks, such as SegNet, U-Net and DeepLab V3+, have pointed out that both the accuracy of localization and recognition are equally helpful to the performance. However, these approaches rest on the convolutions' self-regulating ability, which is uncontrollable and variable. Especially for RSI, understanding and modeling the edge distributions impacts more on segmentation. As we all know, the objects of RSI are various, diverse and complex. The pixels lie in the central parts of objects always correctly labeled, while the pixels around edges are easily misclassified. The inconsistent, mixing and heterogeneous edge regions cause this, presenting massive error-prone pixels. Although the recently proposed ResUNet-a and SCAttNet are implemented to produce more consistent and smooth boundaries from enlarging the receptive field and paying attention to dependencies, the prior edge distributions have not been closely examined and leveraged.
The objective of designing EDA is to determine structural information in a selfattentive way, offering more comprehensive edge contexture. Visually, learning edges enable the objects with responsibly locating positions. Implicitly, the salient edges proffer discriminative contextual information, especially the object-background heterogeneity. This attribute helps the network conquer the visual ambiguities to the uttermost. With maximizing the certainty of classifying edge-around pixels, the learnt representations of objects get discriminative. Hereafter, the principles and technological flow are introduced.

Re-Defining Covariance Matrix for Feature Matrix
Before explaining the pipeline of EDA, it is necessary to re-define the covariance matrix (CM) for the feature matrix. As opposite to densely distributed central parts of entities, edges/contours reveal sparse characteristic. Originally, PCA takes advantages of CM to de-correlate the data to search an optimal basis, accounting for compactly representing the data. This process is also defined as Eigen value decomposition. Associated with PCA, the global data distribution can be modeled completely. However, feature maps always have hundreds of dimensions, causing high memory and time costs when extracting principal components. Surprisingly, Zhang et al. [55] raised a two-directional 2DPCA and reported that calculating the row and column-directional principal representations is sufficient. To Remote Sens. 2022, 14, 102 9 of 27 sum up, CM here is re-defined as a projection matrix to regularize the feature matrix that contains the global distribution information. Formally, where F ∈ R H×W×n denotes the original feature maps, M C ∈ R a×a is the covariance matrix and F ∈ R H×W×n is the projection results by covariance matrix. Commonly, a single channel feature map with size of H × W can be flattened to a vector with length a = H × W. Accordingly, CM is nonnegative definite. Then, the Eigen Decomposition for CM is formulated as where β denotes the defined diagonal matrix of Eigen values, P consists of the orthogonal Eigen vectors. Hence, Essentially the same analysis to 2DPCA, an alternative explanation of this projection process is observed. The selected samples or anchors will be enlarged along with the principal directions and shrank along with the others.
Moreover, the CM can model the data dependencies across different components within the flattened vectors, in which the different physical meanings and units are acceptable.
In conclusion, of the two properties discussed above, we introduce the EDA on this basis of covariance matrix analysis. In EDA, a hand-engineered feature that describes edge distribution, also known as the Canny operation, is combined in a self-attentive way, highlighting and re-weighting edge information in learnt feature maps.

Edge Distribution Attention Module
Apart from the non-local block, the hand-engineered features with expertise and prior knowledge, such as boundary, shape and texture, are also of great significance. As we all know, the distinguishable edge information can provide a more refined localization of objects or regions. Previous studies, such as FCN-based networks, fail to preserve the edge information. Furthermore, the deeper network makes the edge smoothed out. Therefore, injecting the edge features to primarily learnt representations lends strong support to infer the pixel-wise labels.
In general, the EDA continues to use the parallel structure inspired by the non-local block. As presented in Figure 3, initially, the input feature maps F ∈ R H×W×C are convolved to three diverse representations. Filtered by Canny operator, which is explicitly applied as depth-wise convolution on each channel, three bran-new high-dimensional features are obtained. They are written as F c ∈ R H×W×C , F r ∈ R H×W×C and F n ∈ R H×W×C . The top branch aims to extract column-wise dependencies of edge, while the bottom one for rows.
For row-wise, the input feature map F r ∈ R H×W×C is split in channel dimension as As discussed in Sections 2.2 and 3.2.1, we define and formulate edge covariance matrix as follows: where F r = 1 C ∑ C i F i r is an average matrix and F i r ∈ R W×H . The Cov e r is re-defined row-wise edge covariance matrix. The subscript r denotes the row and superscript e means edge respectively. In addition, superscript T is the transpose of matrix. Therefore, the row-wise edge attention map is generated followed by a Softmax layer, where (i, j) represents position in Cov e r . Intuitively, the A e r (i, j) quantifies the correlations between i th row and j th row and A e r ∈ R H×H . Afterward, the depth-wise right matrix multiplication is applied to augment the feature map as (F n ) j ·A e r , where (F n ) j represents the feature map of j th channel.
where ( ) where subscript j denotes the th j channels' feature matrix and e F denotes the edge distribution enhanced feature maps.  As to column-wise edge attention pipeline, the similar calculations are implemented. First of all, the F c ∈ R H×W×C is split in channel dimension. In addition, the following matrix is defined, where F i c ∈ R W×H means the i th channel in F c ∈ R H×W×C and F c = 1 C ∑ C i F i c also represents the average values. Then, the column-wise correlations can be produced, where (i, j) represents position in Cov e c . Intuitively, the A e c (i, j) quantifies the correlations between i th column and j th column and A e c ∈ R W×W . Subsequently, a depthwise left matrix multiplication is realized. Finally, an EDA refines the input feature maps to where subscript j denotes the j th channels' feature matrix and F e denotes the edge distribution enhanced feature maps.

Hybrid Attention Module
In Sections 2.3 and 3.2.2, the pipeline of SAM and EDA were explained. The chapter that follows moves on to deliberate the fusion of these two modules.
Resort to the comprehensive analysis of residual connection; it is an efficient way to highlight the edge information in learnt representations. In this way, the pixels in the boundary areas have an auxiliary representation-tensor, making the error-prone pixels easily classified with high certainty and correctness. And the details are illustrated in Figure 4. learnable coefficients of weights.
Generally speaking, with the design of HAM, the edge-enhanced features are injected into position-attentive components. To further verify the importance of two kinds of features, the learnable coefficients are devised and they can be optimized along with training. Using HAM to model the dependencies of global edge distribution and position-wise dependencies over local and global areas, the learnt features can provide more reasonable and consistent semantic cues.

Datasets
Turning now to the experiments, the datasets and their properties are introduced firstly. Then, as listed in Table 1, three representative benchmarks are used to evaluate the performance. ISPRS Vaihingen and Potsdam datasets are acquired by airborne sensors with very high spatial resolution, while the DeepGlobe dataset contains satellite images from the Dig-italGlobe platform. The rest of this section presents the data description in detail.  Formally, let the input feature maps be F ∈ R H×W×C , where H is the height, W is the width and C is the number of channel. Feeding the original feature map to HAM, two parallel branches form two refined feature maps, they are F e and F p respectively. In addition, both the two refined feature maps have same dimensions with H × W × C. Thus, HAM generates the output feature maps, where F h ∈ R H×W×C is the output refined feature maps by HAM, µ and λ are the learnable coefficients of weights. Generally speaking, with the design of HAM, the edge-enhanced features are injected into position-attentive components. To further verify the importance of two kinds of features, the learnable coefficients are devised and they can be optimized along with training. Using HAM to model the dependencies of global edge distribution and position-wise dependencies over local and global areas, the learnt features can provide more reasonable and consistent semantic cues.

Datasets
Turning now to the experiments, the datasets and their properties are introduced firstly. Then, as listed in Table 1, three representative benchmarks are used to evaluate the performance. ISPRS Vaihingen and Potsdam datasets are acquired by airborne sensors with very high spatial resolution, while the DeepGlobe dataset contains satellite images from the DigitalGlobe platform. The rest of this section presents the data description in detail.

ISPRS Vaihingen dataset
The Vaihingen dataset [43] is acquired by airborne sensors that cover the Vaihingen region in Germany. Semantic labeling of the urban objects at the pixel level is challenging due to the high intra-class variance while the inter-class is low. Therefore, the semantic segmentation of very high-resolution aerial images drives extensive scholars to design advanced processing techniques. In the associated label images, six categories are annotated. They are impervious surfaces, building, low vegetation, tree, car and clutter/background.
As shown in Figure 5c,d, the raw image consists of three spectral bands: Red (R), Green (G) and Near Infrared (NIR). According to the public data, there are 16 images with a spatial size of around 2500 × 2500 available. The ground sample distance (GSD) is 9 cm. due to the high intra-class variance while the inter-class is low. Therefore, the semantic segmentation of very high-resolution aerial images drives extensive scholars to design advanced processing techniques. In the associated label images, six categories are annotated. They are impervious surfaces, building, low vegetation, tree, car and clutter/background. As shown in Figure 5c,d, the raw image consists of three spectral bands: Red (R), Green (G) and Near Infrared (NIR). According to the public data, there are 16 images with a spatial size of around 2500 × 2500 available. The ground sample distance (GSD) is 9 cm.

ISPRS Potsdam dataset
Another available semantic labeling benchmark of ISPRS is the Potsdam dataset [44], imaging on an airborne platform. The illustration of a random sample is presented in Figure  5a,b. The spatial size is 6000 × 6000 pixels with a 5 cm of GSD. The same annotations are labeled as the Vaihingen dataset. This dataset releases 24 images for academic objectives.

DeepGlobe dataset
DeepGlobe land cover classification dataset [45] contains high-resolution satellite images courtesy of DigitalGlobe. The available data consists of 803 images with the spatial size of 2448 × 2448. Correspondingly, the well-annotated ground truth images label seven categories. They are urban land, agriculture land, rangeland, forest land, water, barren land and unknown area. Figure 6 a,c are two samples and Figure 6 b,d are the associated label image.
Generally speaking, to extensively evaluate the performance, both aerial and satellite images are necessary to be tested. Considering the heterogeneous intra-class variants and

ISPRS Potsdam dataset
Another available semantic labeling benchmark of ISPRS is the Potsdam dataset [44], imaging on an airborne platform. The illustration of a random sample is presented in Figure 5a,b. The spatial size is 6000 × 6000 pixels with a 5 cm of GSD. The same annotations are labeled as the Vaihingen dataset. This dataset releases 24 images for academic objectives.

3.
DeepGlobe dataset DeepGlobe land cover classification dataset [45] contains high-resolution satellite images courtesy of DigitalGlobe. The available data consists of 803 images with the spatial size of 2448 × 2448. Correspondingly, the well-annotated ground truth images label seven categories. They are urban land, agriculture land, rangeland, forest land, water, barren land and unknown area. Figure 6a,c are two samples and Figure 6b,d are the associated label image.
Generally speaking, to extensively evaluate the performance, both aerial and satellite images are necessary to be tested. Considering the heterogeneous intra-class variants and inter-class similarities, the network is expected to learn the key information, unraveling the inherent correlations. So, ISPRS benchmarks and the DeepGlobe dataset are tested. inter-class similarities, the network is expected to learn the key information, unraveling the inherent correlations. So, ISPRS benchmarks and the DeepGlobe dataset are tested.

Hyper-Parameters and Implementation Details
Prior to the experiments, the hyper-parameter settings should be clearly defined and identified. As listed in Table 2, the hyper-parameters and implementation details are presented. Initially, the raw images and annotated ground truth are split to sub-patches with a spatial size of 256 × 256. Then, uniformly, the data partitioning subjects to a ratio of 8:1:1. In addition, the three parts of data is non-overlapping. The validation and test datasets are un-trained. In addition, the same data augmentations are applied.

Hyper-Parameters and Implementation Details
Prior to the experiments, the hyper-parameter settings should be clearly defined and identified. As listed in Table 2, the hyper-parameters and implementation details are presented. Initially, the raw images and annotated ground truth are split to sub-patches with a spatial size of 256 × 256. Then, uniformly, the data partitioning subjects to a ratio of 8:1:1. In addition, the three parts of data is non-overlapping. The validation and test datasets are un-trained. In addition, the same data augmentations are applied. Moreover, the backbone, ResNet 101, is commonly used due to the less transformation loss and feature distortions by residual connections. In our network, ResNet 101 corre-sponds to the standard encoder in Figure 2 and the decoder is symmetric. In addition, the layers and structures are referred to [59].
As listed in Table 3, several mainstream methods, including classical segmentation networks, attention-based methods and RSI-specific networks, are compared to evaluate performance comprehensively. Some of them are initially designed for natural images, yet we re-implement these methods on three RSI datasets successfully and yield a not bad result. In contrast, no previous study has investigated the statistical edge and incorporated it in an end-to-end learnable style, causing performance bottlenecks. ResUNet-a RSI-specific networks A deep learning framework for semantic segmentation of remotely sensed data [58] SCAttNet Semantic segmentation network with spatial and channel attention mechanism for high-resolution remote sensing images [38] EDENet Ours /

Numerical Metrics
In experiments, two widely used numerical metrics, OA (Overall Accuracy) and mIoU (mean intersection over union) are calculated to quantify the performance.
where TP denotes the number of true positives, FP denotes the number of false positives, FN denotes the number of false negatives and TN denotes the number of true negatives.

Comparison with State-of-the-Art
We reasoned that the extraction and injection of edge distributions attentively performs important effects on boosting the segmentation accuracy. In our work, we sought to establish a methodology for leveraging prior edge information of learnt features. Inspired by two directional covariance matrix analysis, EDA is devised and incorporated into EDENet. To evaluate the performance, the experiments are carried on three different benchmarks. The rest of this section will further compare and discuss the collected results. Table 4 reports results on the Vaihingen test set and highlights the best performance in bold. Apart from overall accuracy and mIoU, the class-wise accuracy and IoU are also collected in OA/IoU form for individual categories. Generally speaking, it is apparent that the highest OA and mIoU values are obtained by EDENet, demonstrating exceptionally good performance in accuracy. The baseline models, such as SegNet, U-Net and DeepLab V3+, are susceptible to the interference of ubiquitously subsistent intra-class variants and inter-class similarities. Among them, SegNet has an OA of 79.10%, dramatically lower than EDENet with more than ten percentages. DeepLab V3+ enlarges the receptive field to capture more localcontextual information, boosting the segmentation results by increasing about 2.5% in OA and 6% in mIoU than SegNet. However, the results are imprecise enough.

Results on Vaihingen Dataset
Resorting to the powerful capability in learning long-range dependencies of attention mechanisms CBAM, DANet, NLNet and OCRNet eventually corroborate the impacts of these dependencies. By the similarity analysis, also known as the generation of attention map, position-wise representations are refined with more contextual information that amplifies the margin distance of categorical representations. As a result, the geo-details are enhanced, reducing the uncertainty in identification. Compared to classical models, attention-based methods have experienced a remarkable overall improvement. The early proposed CBAM and DANet employ channel-wise attention and spatial-wise attention parallelly to enrich the contextual information, arising OA and mIoU at about 1% than DeepLab V3+. It is worth noting that CBAM and DANet have fewer parameters and time costs than DeepLab V3+. Moreover, NLNet proposed a positional self-attention mechanism, capturing channel and spatial correlations simultaneously. Therefore, the OA and mIoU further increase to more than 83% and 71%. Specifically, OCRNet even reaches more than 72% in mIoU and 85% in OA by quantifying and injecting pixel-object correlations in addition to pixel-pixel correlations.
Initially, these natural image-targeted networks are not inadequate for RSI. It is suggested that two critical properties of RSI lead to insufficient applications. One is that RSI is always acquired from a high-altitude angle. The other is the wide observation range and the covered complex and diverse visual objects by imaging sensors. For example, the visual illustration of cars is with different colors, shapes, textures, even sheltered by trees, buildings, or shadows. Striven to alleviate the interference, it is necessary to enhance the distinguishability of learnt representations. Conventionally, ResUNet-a integrates various strategies to promote the probability of correct classification, including multi-task inference, multi-residual connections, atrous convolutions, pyramid scene parsing pooling and optimized Dice loss function. Even though the accuracy, which achieves 87.25% in OA and 75.43% in mIoU, is acceptable, the size of the trained model parameter is large and the time and memory occupation are criticized. In contrast, SCAttNet is a lightweight model that uses a dual attention mechanism to optimize the learnt feature maps.
Thus far, the previous works have argued that the attention mechanism is also applicable in RSI semantic segmentation. Nevertheless, they equally handle the pixels, whether they are edge-pixel or not. As discussed in Section 1, the edge pixels are error-prone, conjointly impact the around pixels' classification to an extent. Starting from the image covariance matrix analysis, we find that edge distributions can be learned and utilized. Then, the column-wise and row-wise edge attention maps are generated and injected into learnt representations to highlight the edge pixels. Consequently, the OA and mIoU are strikingly boosted. Compared to ResUNet-a, which has the best performance on RSI yet, EDENet increases OA and mIoU more than 2% and 1% by a few matrix manipulations. More concretely, EDENet surprisingly enhances the segmentation performance on all categories, especially in distinguishing objects with confusing features. As for cars, which are sensitive to the boundary when recognizing, the growth of accuracy is excellent, with almost 5% to ResUNet-a. Unquestionably, the exact contour enables the easily-confused pixels in short-distance and long-distance to be correctly classified with high definiteness.
The visual inspection is presented in Figure 7. We randomly select two samples and predict the pixel-wise label. Intuitively, deep neural networks are capable of applying to RSI semantic segmentation tasks. Thus, the objects are labeled corresponding to ground truth. Nonetheless, the blurry boundaries are ubiquitous by existing methods, leading to unsatisfactory results. As for easily-confused low vegetation and trees, the delineation of boundaries plays a pivotal role in locating the position of objects. With an accurate location, the segments tend to be more complete and consistent.

Results on Potsdam Dataset
Different from the Vaihingen dataset, Potsdam has more data for training and validation. Meanwhile, the ground objects exhibit entirely heterogeneous visual characteristics, as previously illustrated in Table 5.
Coincidentally, the overall accuracy performs similar trends to Vaihingen. In addition, the OA and mIoU are relatively higher than Vaihingen by EDENet. Furthermore, all the category-wise accuracy of EDENet is also the highest to other models. This fact sug- To sum up, attributing to the distinct boundaries by injecting edge distributions attentively, overall accuracy and visualizations are significantly improved compared to other SOTA methods on the ISPRS Vaihingen benchmark.

Results on Potsdam Dataset
Different from the Vaihingen dataset, Potsdam has more data for training and validation. Meanwhile, the ground objects exhibit entirely heterogeneous visual characteristics, as previously illustrated in Table 5. Coincidentally, the overall accuracy performs similar trends to Vaihingen. In addition, the OA and mIoU are relatively higher than Vaihingen by EDENet. Furthermore, all the category-wise accuracy of EDENet is also the highest to other models. This fact suggests that EDENet manifests decent generalizability. Moreover, owing to the abundant wellannotated data for training, the robustness of the network is enhanced.
In general, attention-based methods are feasible for capturing various contextual information, facilitating the learning ability of geo-objects. Therefore, the accuracy of attention-based methods is holistically superior to the classical ones. Above all, OCRNet takes advantage of pixel-object contextual relationships besides pixel-wise relationships, revealing the significance of geo-objects' representations. In this way, OCRNet reaches more than 85% of overall accuracy and 72% of mIoU, dramatically improving. Similarly, SCAttNet designed dual attention modules to answer the complex of RSI objects, resulting in almost the same level of performance to OCRNet. By combining multiple tricks, ResUNeta boosts the overall accuracy of 2% than OCRNet.
However, the existing methods are still far from EDENet, which produces a very competitive result. Specifically, the impervious surfaces and buildings are barely misclassified. Moreover, the easy-confused low vegetation and trees are also duly partitioned by the prior knowledge of edge. In addition, the classification of cars depends on locating the position, which is closely related to the edge information. Eventually, cars have the most considerable growth.
For qualitative evaluation, two samples of the Potsdam test set are predicted and illustrated in Figure 8. Sharpen edges make the interior pixels of an object more accessible to be classified correctly. For instance, the low vegetation and trees always appear concurrence a neighboring. However, from the visual perspective in RSI, they are easily-confused by the high inter-class consistency. Once the boundaries are misregistered, a deluge of pixels is forced to be wrongly predicted. Although attention-based and RSI-specific approaches have leveraged contextual information to enhance the distinguishability between different categories' pixels, the blurry edges still trouble the semantic segmentation performance. As we can see, cars, marked as yellow, are challenging to outline under deficient boundaries wholly. Figure 8l shows the results produced by EDENet, by which the edges are retained to the uttermost refers to ground truth. In fact, EDENet segments RSI with high-fidelity and high consistency. Comparing classical and attention-based ones has convinced the achievability and availability of attention mechanisms in RSI feature optimization. CBAM, DANet and NLNet extract global context by attention modules to enrich the inference cues and a fair increase of more than 1% than DeepLab V3+ is produced. With more complementary contextual information, OCRNet further boosts OA by about 1%. Likewise, SCAttNet sequentially embeds channel and spatial attention modules to refine the learnt features adaptively. As a result, this lightweight network achieves similar performance to OCRNet. Empirically, ResUNet-a analyzes the essence of RSI when distinct confusable pixels and employs multiple manipulations. Therefore, the OA and mIoU are over 80% and 58% by ResUNet-a, which is the first place ever before. Unfortunately, these models ignore the impacts of localization by edge delineation, triggering rough boundaries. EDENet learns the edge distributions and injects them into feature optimization, provoking a significant improvement of accuracy.
In the visual inspections from Figure 9, the top sample covers five classes of ground. Inherently, the water areas surrounding rangeland and agricultural land are not visually recognizable. In addition, the classical networks work poorly on distinct these objects. In this context, consistent boundaries are critical in actuating the network to separate the different pixels assuredly. EDENet keeps the highest consistency with ground truth by highlighting the edges in learnt representations compared to the SOTA methods. Similar to the bottom sample, only a tiny part of the edges is out of position, leading to some Overall, EDENet expresses strong power in applying to RSI with different spatial resolution and various ground features according to the quantitative and qualitative evaluations. The exceptionally preferable results are obtained by learning and injecting edge distributions of feature maps, which indirectly help the network adjust error-prone pixels.

Results on DeepGlobe Dataset
As discussed above, the DeepGlobe dataset collects high-resolution sub-meter satellite imagery. Due to the variety of land cover types and the density of annotations, this dataset is more challenging than existing counterparts like ISPRS benchmarks. In this regard, the grade of difficulty in semantic segmentation is leveled up majorly.
As reported in Table 6, the quantitative results are collected, where the bold number indicates the best. Typically, the OA and mIoU have appreciably decrease compared to ISPRS benchmarks, ascribing to the indistinguishable spatial and spectral features by satellite sensors. Still and all, EDENet realizes the accurate classification of each category without exception, contributing to the highest OA and mIoU of more than 83% and 60%, respectively. To our knowledge, the differential accuracy across all categories is accounted for the imbalanced distribution, among which the agricultural land has the most pixels. Comparing classical and attention-based ones has convinced the achievability and availability of attention mechanisms in RSI feature optimization. CBAM, DANet and NLNet extract global context by attention modules to enrich the inference cues and a fair increase of more than 1% than DeepLab V3+ is produced. With more complementary contextual information, OCRNet further boosts OA by about 1%. Likewise, SCAttNet sequentially embeds channel and spatial attention modules to refine the learnt features adaptively. As a result, this lightweight network achieves similar performance to OCRNet. Empirically, ResUNet-a analyzes the essence of RSI when distinct confusable pixels and employs multiple manipulations. Therefore, the OA and mIoU are over 80% and 58% by ResUNet-a, which is the first place ever before. Unfortunately, these models ignore the impacts of localization by edge delineation, triggering rough boundaries. EDENet learns the edge distributions and injects them into feature optimization, provoking a significant improvement of accuracy.
In the visual inspections from Figure 9, the top sample covers five classes of ground. Inherently, the water areas surrounding rangeland and agricultural land are not visually recognizable. In addition, the classical networks work poorly on distinct these objects. In this context, consistent boundaries are critical in actuating the network to separate the different pixels assuredly. EDENet keeps the highest consistency with ground truth by highlighting the edges in learnt representations compared to the SOTA methods. Similar to the bottom sample, only a tiny part of the edges is out of position, leading to some blurry boundary and misclassification bias. However, the global accuracy and consistency are retained.
To sum up, EDENet can adaptively learn and highlights the edges without depending on the spatial resolution or the visual separability. Although the overall accuracy is decreased, EDENet outperforms SOTA methods remarkably. In addition, the achievements further demonstrate the strong generalizability of EDENet on multi-sensors imagery.

Ablation Study of EDA
The previous studies using CNNs indicate that learning distinguishable representations is essential to enhance segmentation performance. Furthermore, attention mechanisms are employed as an efficient way to refine representations. Nevertheless, they did not learn the edge information, which is pivotal to localize the objects and calibrate the pixels around the edge. The devised EDA bridges this gap by attentively learning prior edge distribution in feature maps at arbitrary scales. In addition, we heliacally embed this module to standard encoder-decoder architecture, generating excellent results.
To comprehensively evaluate EDA, the ablation study is implemented under the same hyper-parameters and runtime environment. Practically, a version that removes the EDA from HAM in Figure 4 is constructed and named as non-EDA. Now, as presented in Table 7, the OA and mIoU are collected to analyze the effects. Generally, EDA elicits about a 5% increase on Vaihingen and Potsdam datasets in OA and 4% on DeepGlobe. As for mIoU, the relative improvements are deserved. The effects of EDA are dramatically illustrated. Moreover, we monitor the loss and mIoU during the training process to support the evaluation, which can be seen in appendices.
Overall, EDA essentially guarantees unbiased edges, resulting in visually well-segmented objects and obtaining desirable quantitative accuracy.

Discussions
Prior studies have noted the importance of contextual information in enhancing the distinguishability of learnt representations. Attention mechanisms paved an effective way for capturing the context by several matrix manipulations. Thus, the accuracy has been boosted. However, the edges should be emphasized due to their essential locations. Once the edges are failed to be delineated, the blurs will lead to misconceived and omissive pixels. Therefore, our study seeks to produce a new attention module, which can be flexibly embedded into the end-to-end segmentation network and a learnable way to extract and inject edge distributions of learnt feature maps.
Inspired by the attention mechanism and covariance matrix analysis in 2DPCA, we propose EDENet, which hierarchically embeds hybrid attention modules that learns convolved features and highlights edge distributions simultaneously. As a result, the experiments on aerial and satellite images have illustrated the excellence of EDENet. Both numerical evaluations and visual inspections have supported this finding. Moreover, unlike SOTA RSI-specific methods, the OA and mIoU are improved significantly without the complex design of network architecture.

Conclusions
The semantic segmentation of RSI plays a pivotal role for various downstream applications. In addition, boosting the accuracy of segmentation has been a hot topic in the field. The existing approaches have produced competitive results by attention-based deep convolutional neural networks. However, the deficiency of edge information leads to blurry boundaries, even raising high uncertainties of long-distant pixels during recognition.
In this study, we have investigated an end-to-end trainable semantic segmentation neural network of RSI. Essentially, we first formulate and model the edge distributions of encoded feature maps inspired by covariance matrix analysis. Then, the designed EDA learns the column-wise and row-wise edge attention maps in a self-attentive fashion. As a result, the edge knowledge is successfully modeled and injected into learnt representations, facilitating representativeness and distinguishability. In addition to leverage edge distributions, HAM employs non-local block as another parallel branch to capture the position-wise dependencies. As a result, the complementary contextual and edge information are learned to enhance the discriminative capability of the network. In experiments, three diverse datasets from multiple sensors and different imaging platforms are examined. The results indicate the efficacy and superiority of the proposed model. With the ablation study, we further demonstrate the effects of EDA.
Nevertheless, there are still several challenging issues to be addressed. First of all, the multi-modal data are necessarily fused to improve the semantic segmentation performance, such as DSM information and SAR data. Moreover, the transferable models are of great concern to adaptively cope with the increasingly diverse imaging sensors. Furthermore, the basic convolution units also have the potential to be optimized in convergence rate while producing a global-optimal solution, as the previous work has validated [60]. In addition, the semantic segmentation tasks, image fusion [61], image denoising [62] and image restoration [63] of remote sensing images also rely on the feature extraction; the extension of the proposed module should be promising and challenging.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Ablation Study on the Vaihingen Dataset
As outlined in the previous discussions, the test aims to evaluate the efficacy and superiority of EDA. Under the constant environment and hyper-parameter settings, we implement the no-EDA version on the Vaihingen dataset and record the loss and mIoU during the training process.
As illustrated in Figure A1, the training loss is drawn. Embedding EDA makes the network convergence with a lower training loss (cross-entropy loss). Furthermore, EDENet drops the loss from 0.1850 to 0.0409 compared to no-EDA version. Correspondingly, the mIoU given in Figure A2 reveals the tremendous changes. EDENet reaches up to 95.51% on mIoU, while no-EDA version merely has 84.03%.
On the whole, we found evidence to suggest that edges may be closely related to contribute to calibrate the error-prone pixels around boundaries. From this perspective, injecting and underlining the edge distributions of leant representations plays a crucial role in sharpening the segments. implement the no-EDA version on the Vaihingen dataset and record the loss and mIoU during the training process. As illustrated in Figure A1, the training loss is drawn. Embedding EDA makes the network convergence with a lower training loss (cross-entropy loss). Furthermore, EDENet drops the loss from 0.1850 to 0.0409 compared to no-EDA version. Correspondingly, the mIoU given in Figure A2 reveals the tremendous changes. EDENet reaches up to 95.51% on mIoU, while no-EDA version merely has 84.03%.
On the whole, we found evidence to suggest that edges may be closely related to contribute to calibrate the error-prone pixels around boundaries. From this perspective, injecting and underlining the edge distributions of leant representations plays a crucial role in sharpening the segments.

Appendix B Ablation Study on the Potsdam Dataset
For the Potsdam dataset, the loss and mIoU during the training process are also compared. As shown in Figures A3 and A4, the curves are presented. The variation of spatial resolution and objects' visual sensitivity is incapable of degrading the performance of EDENet. EDA decreases a great deal of training loss. Numerically, the loss is dropped from 0.2199 to 0.0503 over 77%. Turn to mIoU, EDA boosts the result from 82.89% to 94.81%.
In general, the recorded results on the Potsdam dataset follow similar patterns to the Vaihingen dataset.

Appendix B. Ablation Study on the Potsdam Dataset
For the Potsdam dataset, the loss and mIoU during the training process are also compared. As shown in Figures A3 and A4, the curves are presented. The variation of spatial resolution and objects' visual sensitivity is incapable of degrading the performance of EDENet. EDA decreases a great deal of training loss. Numerically, the loss is dropped from 0.2199 to 0.0503 over 77%. Turn to mIoU, EDA boosts the result from 82.89% to 94.81%.
In general, the recorded results on the Potsdam dataset follow similar patterns to the Vaihingen dataset.

Appendix C Ablation Study on the DeepGlobe Dataset
DeepGlobe dataset consists of a deluge of satellite images showing diverse and complex land cover types. Figures A5 and A6 plot the training loss and mIoU changing status. There is a lower loss along with the training phase of EDENet. It eventually drops the loss

Appendix C Ablation Study on the DeepGlobe Dataset
DeepGlobe dataset consists of a deluge of satellite images showing diverse and complex land cover types. Figures A5 and A6 plot the training loss and mIoU changing status. There is a lower loss along with the training phase of EDENet. It eventually drops the loss