Hybridizing Cross-Level Contextual and Attentive Representations for Remote Sensing Imagery Semantic Segmentation

: Semantic segmentation of remote sensing imagery is a fundamental task in intelligent interpretation. Since deep convolutional neural networks (DCNNs) performed considerable insight in learning implicit representations from data, numerous works in recent years have transferred the DCNN-based model to remote sensing data analysis. However, the wide-range observation areas, complex and diverse objects and illumination and imaging angle inﬂuence the pixels easily confused, leading to undesirable results. Therefore, a remote sensing imagery semantic segmentation neural network, named HCANet, is proposed to generate representative and discriminative representations for dense predictions. HCANet hybridizes cross-level contextual and attentive representations to emphasize the distinguishability of learned features. First of all, a cross-level contextual representation module (CCRM) is devised to exploit and harness the superpixel contextual information. Moreover, a hybrid representation enhancement module (HREM) is designed to fuse cross-level contextual and self-attentive representations ﬂexibly. Furthermore, the decoder incorporates DUpsampling operation to boost the efﬁciency losslessly. The extensive experiments are implemented on the Vaihingen and Potsdam benchmarks. In addition, the results indicate that HCANet achieves excellent performance on overall accuracy and mean intersection over union. In addition, the ablation study further veriﬁes the superiority of CCRM.


Introduction
Remote sensing imagery (RSI) semantic segmentation has been a fundamental task in interpreting and parsing the observation areas and objects [1]. It strives to assign pixellevel categorical labels for the image. In recent years, there has been a growing interest in performing semantic segmentation for multi-source remote sensing data since it is of great significance in spreading to various applications, such as urban planning [2,3], water resource management [4,5], precision agriculture [6,7], road extraction [8,9] and so forth.
Afterward, DCNNs (deep convolutional neural networks) achieved satisfying performance by learning feature representations and classifier parameters simultaneously, which is deemed to learn more targeted features than conventional methods. Then, FCN and leveraging contextual information, there is still very little scientific understanding of modeling the correlations of the representations that come from different levels. We suppose that this is the main reason that makes the results far from optimal. A superpixel denotes an area generated by grouping pixels in remote sensing data, providing a more natural representation of image data than pixels. Therefore, in addition to independent pixel-level and superpixel-level representations, the cross-level correlations are beneficial for optimizing these two-level features, increasing the probability of correctly classifying the pixels.
Motivated by that the class label assigned to one pixel is the category of the superpixel that the pixel belongs to, we augment the representation of one pixel by exploiting the representation of the superpixel region of the corresponding class. Here, the HCANet (hybridizing cross-level contextual and attentive representations neural network) is proposed for remote sensing imagery semantic segmentation. In addition, the main contributions are as follows: (1) A cross-level contextual representation module (CCRM) is devised to exploit and harness the superpixel contextual information. After learning superpixels under the supervision of ground truth, the superpixel regions are generated and represented by aggregating the pixels lying in the corresponding regions. Moreover, the cross-level contextual representation is produced by quantifying and formulating the correlations between pixel and superpixels. At last, the correlations are injected to augment the representativeness and distinguishability of features.
(2) A hybrid representation enhancement module (HREM) is designed to fuse crosslevel contextual and self-attentive representations flexibly. As discussed above, the selfattention modules can facilitate pixel-wise representations effectively. Thus, a sub-branch that introduces non-local block to refine encoded feature maps is implemented. Afterward, this module adopts a concatenation operation followed by a 1 × 1 convolution layer to realize the injection of both two optimized representations before expansion.
(3) Integrating the above-designed modules, HCANet is constructed based on the encoder-decoder architecture for densely predicting the pixel-level labels. Furthermore, to losslessly boost decoder's efficiency, we implement DUpsampling (Data-dependent upsampling) [26] to recover feature maps with a one-step up-sampling instead of multiple times of up-sampling.
(4) The extensive experiments are implemented on ISPRS 2D Semantic Labeling Challenging for Vaihingen [27] and Potsdam [28]. In addition, the performance is evaluated by numerical indices and visualization inspections. In addition, the necessary ablation studies are conducted to verify the proposed method further.
This article is organized as follows. Section 2 briefly introduces the related works on the topic of remote sensing imagery semantic segmentation and attention mechanism. Section 3 presents the network architecture and embedding modules. Section 4 designs the experiments and discusses the results. Finally, Section 5 draws the conclusions and presents the future directions.

Semantic Segmentation of RSI
Semantic segmentation of RSI profits from the advancements of deep learning methods [29,30]. For example, Mou et al. [31] devised two network modules to discover relationships between any two positions of RS imagery. ScasNet (self-cascaded convolutional neural network) [32] utilized a self-cascade network to improve the labeling coherence with sequential global-to-local contexts aggregation. SDNF (superpixel-enhanced deep neural forest) [20] fused DCNNs with decision forests to form a training network with specific architecture. Marmanis et al. [33] captured the edges' detail for fine-tuning the semantic boundaries of objects. Likely, Edge-FCN [34] fuses prior edge knowledge and learnt representations, guiding the network to obtain better segmentation performance. Additionally, a remarkable neural network named ResUNet-a provided a framework for the task of semantic segmentation of monotemporal very high-resolution aerial images.
Based on U-Net, ResUNet-a innovatively integrates residual connections, optimized Dice loss function to realize accurate inference.
It is essential to refine the representations with sufficient contextual information, including edge, surrounding pixels, homogeneous positions. However, the methods mentioned above inject these clues in handcrafted fashion and not learnable.

Attention Mechanism
The attention mechanism is a strategy of allocating biased computational resources to highlight the informative parts. This mechanism allows the inputs to interact with each other "self" and determine what they should pay more attention to [35]. In other words, this mechanism layer stacking few parallelizable matrix calculations to form attentive maps, which provides an efficient tool to capture short and long-range dependencies. For example, a SE (squeeze-and-excitation) block is proposed [36], generating channel-wise attention. This block emphasizes the scene-relevant feature maps with spatial-irrelevant information. Convolutional block attention module (CBAM) integrated spatial and channel attention modules to contribute the representations with informative regions [37]. Moreover, the non-local block was devised for several visual tasks. In addition, the results indicate the superiority in accuracy and efficiency [38].
In addition to natural image processing, many attention-based works were advocated for RSI. In [39], the channel attention block was embedded for recalibrating the channelwise feature maps. Cui et al. [40] exploited the attention mechanism to match the caption nouns with objects in RSI. Then, the global attention upsampling [41] was introduced to provide global guidance from high-level features to low-level ones. Specifically, both positional and channel-wise relations are captured and integrated with serial and parallel manners, producing reasonable cues for inference [42]. Moreover, SCAttNet [24] was proposed to learn the attention map to aggregate contextual information for every point adaptively in RS imagery. Along with the analysis of local context, LANet [25] was proposed to bridge the gap between high-and low-level features. The representations are refined by patch attention modules. As a result, the performance on ISPRS 2D benchmarks reaches the SOTA (state-of-the-art) over several attention-based methods. In the same way, CCANet (class-constraint coarse-to-fine attentional deep network for subdecimeter aerial image semantic segmentation) [43] authorized class information constraints to obtain exact long-range correlational context. Furthermore, the results on two RS datasets verify the finegrained segmentation performance. Homoplastically, a cascaded relation attention module is designed to ascertain the relationship with channels and positions [44,45]. Sun et al. [46] designed a boundary-aware semi-supervised semantic segmentation network, generating satisfactory segments with limited annotated data.
Although numerous methods were studied and proposed incorporating attention mechanisms and achieved competitive performance, the relational context is insufficient. The previous approaches form superpixel regions unsupervisedly, contributing to uncertain and unreliable context, weakening the classification ability.

The Proposed Method
In this section, the details of the proposed method are presented and discussed. Before the analysis of overall framework, the directly-related preliminaries are introduced. Then, the proposed HCANet and embedded CCRM and HREM are illustrated and formally descripted.
As discussed in Section 1, in CCRM, we first model and formulate the correlations between pixel and corresponding superpixels to enhance the distinguishability of learnt representations. Then, HREM concatenates the superpixel contextual representations and self-attentive representations, generating the final representations for decoding.
The rest of this section first presents the relevant preliminaries used in the construction of HCANet. Specifically, the details of CCRM are presented after introducing the framework, including theoretical analysis, structure description and formalization.

Non-Local Block
As a typical design of self-attention, the non-local block (NL) [38] calculates the spatialwise and channel-wise correlations simultaneously by several matrix multiplications. In addition, this block is flexible for embedding into various frameworks. The topological architecture is illustrated in Figure 1 and the formal description of the NL is presented as follows. Let the input feature be X ∈ R H×W×C , where C, H and W indicate the channel, height and width, respectively. Three 1 × 1 convolutions are used as transformation function to produce three diverse embeddings. Formally, where φ ∈ R H×W×Ĉ , θ ∈ R H×W×Ĉ , γ ∈ R H×W×Ĉ andĈ is the reshaped feature's channel number. Then, the feature maps are flatten toĈ × N, where N = H × W. In addition, the similar matrix can be produced by, where V ∈ R N×N . After normalization, the similar matrix is transformed to where f denotes normalization function. In this paper, the Softmax function is opted as f . As to every position in γ, the output of attention layer is formed as, where O ∈ R N×Ĉ . In general, the final output is given by where W O is weights matrix by 1 × 1 convolution, Y is the refined feature maps of X and Cat(·) denotes the concatenation process. Non-local block helps the network capture long-range dependencies. This simple yet efficient way is of great significance in semantic segmentation performance. However, only pixel-wise dependencies cross spatial and channel are considered with non-local block, ignoring the pixel-superpixel correlation, which is the vital aspect pivotal for characterizing pixels.

Superpixel Context
Superpixels in RSI are the grouped pixels, which can be explicitly seen as various image regions. These regions help the network learn a more real representation than pixels. Figure 2 shows the illustration of superpixel context. The red line indicates the pixel to be represented. In addition, the partial superpixel region is expected to be marked in the light cyan line. Modeling the relationships between the pixel and superpixel regions positively impacts the distinguishability and separability of encoded representations significantly. For example, we abstractly deem the red box as one pixel lying in a black car object in Figure 2. Baseline networks, such as SegNet, U-Net and DeepLab V3+, extract the features from a regular local area with a size of 3 × 3, 5 × 5 or others (determined by configurations). If the pixel is located in the red car area, the learnt representations are very different. This intra-class inconsistency always leads to misclassification. If we can find a way to represent the similarity between the annotated pixel and the two cars with different colors, the intra-class inconsistency would alleviate automatically by the network.
Resort to the previous studies on similarity matrix analysis, it is achievable to capture the superpixel context and model the pixel-superpixel correlations in a learnable way using an attentive fashion. Hence, motivated by this target, the CCRM is investigated and formulated.

DUpsampling
As illustrated in Figure 3, the pipeline of DUpsampling is presented [26]. The specific pixel in the feature map with a lower resolution is analogously inferred to a 2 × 2 region by a ratio of 2 in DUpsampling. Let F D ∈ R H× W× C denotes the features to be up-sampled and G indicates the ground truth. As a prerequisite, G can be compressed without loss. Commonly, G is encoded as G ∈ {0, 1} H×W×C , which is a one-hot encoding fashion. The ratio between F D and G is represented as, In the decoder stage, F D ∈ R H× W× C is going to be up-sampled to the same spatial size with G. Then, the loss calculation is formed as, where B(·) is the bilinear up-sampling used in former architecture, such as SegNet, FCN, U-Net and so forth. With the incorporation of DUpsampling, the loss function is formed as, where D(·) represents the DUpsampling. It is supposed that the ground truth label G is not i.i.d. Therefore, G could be compressed without information loss. To determine the transformation matrix W, Tian et al. [26] proposed a learnable way. Firstly, G is compressed to a specific target spatial resolution G ∈ R H× W× C , which keeps the same size with F D . In addition, the criteria of W is minimizing the loss between L(F D , G) and L(F D , G). Since the spatial correlation of the label is learned, the recovered full size of predictions are more reliable.
As indicated in [26], DUpsampling significantly reduces the computation time and memory footprint of the semantic segmentation method by a factor of 3 or so. In addition, DUpsampling also allows a better feature aggregation to be exploited in the decoder by enlarging the design space of feature aggregation. In addition, Li et al. [47] have proved the efficiency and efficacy for RSI semantic segmentation task with DUpsampling.

The Framework of HCANet
As previously discussed, contextual information is of great importance in feature optimization. The atrous convolution-based networks, such as DeepLab V3+ and ResUNet-a, cost too much time and space to enrich the contextual information by enlarging the local receptive fields. However, the local information is still limited and insufficient. Alternatively, NLNet [38], CBAM [37], DANet [21], OCNet [23] and SCAttNet [24] employs an attention mechanism to capture long-range dependencies at position-wise. Unfortunately, only focusing on pixel-wise dependencies is not enough to learn the implicit correlations, in which the intra-class inconsistency and inter-class similarity always deteriorate the segmentation performance.
Attempt to optimize the learnt representations by extracting and leveraging richer contextual information, the HCANet is devised. The overall framework is presented in Figure 4. In general, HCANet is based on encoder-decoder architecture. In addition, two modules are designed to enhance the representations. One is CCRM that exploits superpixel context and captures the correlations between pixel and corresponding superpixel. Then, injecting these correlations to pixel-level representations to produce superpixel enhanced representations. Specifically, we first model and formulate the cross-level correlations between pixels and superpixels in RSI. The other is HREM that concatenates the self-attentive and superpixel enhanced representations to generate refined representations for decoding. Furthermore, to alleviate the loss of upsampling in the decoder stage, DUpsampling is incorporated. Finally, the Softmax classifier is employed to predict the pixels densely. The rest of this section will explain CCRM, HREM and DUpsampling in detail.

Superpixel Region Generation and Representation
First, of all, the superpixel regions is generated by coarse segmentations, which derives from the intermediate feature maps from backbone. During training, the ground-truth segmentation is investigated to improve the coarse regions by cross-entropy loss supervisiedly.
Let an input image denoted as F, the superpixel regions [SP 1 , SP 2 , . . . , SP k ] refer to the number of category k. Then, the representation of superpixel is obtained by aggregating all the pixels' representions in the corresponding superpixel region, formally, where F k denotes the kth superpixel's representation, X i is the representation of the corresponding pixel p i , α ki computes the degree for pixel p i belonging to the kth superpixel. In practical implementation, spatial softmax is applied to normalize superpixel region SP k .
In the process of experiments, the superpixel regions are produced by coarse segments. In addition, the coarse segments are initially realized using the encoded feature maps of the backbone without any extra computations.

Cross-Level Contextual Representation
As depicted in Figure 5, the pipeline of CCRM, cross-level contextual representation, is concretely introduced to explain the formation of superpixel enhanced representations. After the computation of superpixels' representations, we calculate the relationship between each pixel and each superpixel region as follows, where g(X i , F k ) serves as the relation function and formed as, where θ(·) and ϕ(·) are transformation functions implemented by 1 × 1 convolution followed by BN and ReLU layers. Therefore, the superpixel contextual representation F sp (i) of pixel p i is, where δ(·) and ρ(·) are transformation functions implemented by 1 × 1 convolution followed by BN and ReLU layers, K is the max categories. Overall, the calculation of cross-level correlations and superpixel context are inspired by non-local fashion [38]. Finally, the CCRM produces the superpixel enhanced representations by aggregating two vital parts of context, (1) original pixel-level representations and (2) superpixel contextual representations. Formally, where F sr (i) denotes the superpixel enhanced representations of pixel p i , X i is the original pixel-level representations, F sp (i) refers to the superpixel contextual representation of p i and Cat(·) denotes the concatenation process.

Hybrid Representation Enhancement Module
The fusion of superpixel enhanced representations and self-attentive pixel-level representations is of great significance to comprehensively leverage contextual information.
Non-local blocks, a kind of self-attention module, have been examined and evaluated in feature optimization [38]. This block helps the network captures and retains positional dependencies, yielding desired enhancements of representations.
With the intention of refining learnt representations, the self-attentive representations are beneficial along with superpixel context. Thereby, another concatenation operation is embedded, which facilitating the representations further. Formally, where F r (i) denotes the refined representation of pixel p i , F sp (i) is the superpixel contextual representation of pixel p i , F n (i) refers to the self-attentive representation by non-local block of pixel p i and Cat(·) denotes the concatenation operation. Therefore, the output representations of HREM hybridizes original pixel-level representations, superpixel contextual representations and self-attentive representations. For the sake of understanding, 1 × 1 convolution after concatenation is hidden. Finally, F r is fed to be upsampled and classified. The HREM allows a simple yet effective concatenation operation to fuse the two refined representations. One of them derives from the self-attention module with the contextual information of position-wise dependencies. The other one comes from the CCRM by injecting the pixel-superpixel correlations. Therefore, the comprehensive and hybrid learnt representations could provide reasonable and distinguishable cues for dense prediction.

Datasets
To evaluate the performance of the proposed method, extensive experiments are conducted on two ISPRS benchmarks. The data properties are presented in Table 1.

. ISPRS Vaihingen Dataset
The public 2D semantic labeling benchmark Vaihingen dataset is released by the International Society for Photogrammetry and Remote Sensing [27]. It contains highresolution true orthophoto tiles and corresponding digital surface models as well as labeled ground truth. Each tile consists of three spectral bands, red (R), green (G) and near infrared (NIR). The spatial size is around 2500 × 2000 with GSD (ground sample distance) of 9 cm. The available 16 images are partitioned randomly, in which 11 images are for training and 5 for validation and test. The labeled datasets' ground truth is made up of 6 categories: impervious surfaces, building, low vegetation, tree, car and clutter/background.

ISPRS Potsdam Dataset
The 2D semantic labeling benchmark Potsdam dataset [28] is composed of 38 high resolution images of size 6000 × 6000 pixels, with spatial resolution of 5 cm. Similarly, 5 categories are labeled. In addition, the 24 available images are divided into training and validation set. Test set is same to validation set.

Implement Details
Considering the practical data properties, the matched digital surface models are not involved in experiments. The same sub-patch size and data augmentations are implemented on raw data.
In the experiments, the settings of hyper-parameters are listed in Table 2. Essentially, the backbone is ResNet 101, which is indicated by black dotted box in Figure 4. Commonly, all the models are implemented on the PyTorch framework version 1.4.1 with NVIDIA Tesla V100-32GB graphic card under Linux OS.

Evaluation Metrics
Two numerical metrics, OA (Overall Accuracy) and mIoU (mean inter-section over union), are chosen to quantitatively evaluate the performance. Formally, OA = TP + TN TP + FP + FN + TN (17) mIoU = TP TP + FP + FN (18) where TP denotes the number of true positives, FP denotes the number of true positives, FN denotes the number of false negatives, TN denotes the number of true negatives.

Compare to State-of-the-Art Methods
In this part, several mainstream methods are compared to analyze the performance. The comparative methods include typical encoder-decoder, attention-based and SOTA RSI semantic segmentation networks.

Results on Vaihingen Test Set
As outlined in the introduction, we reasoned that complementary context is helpful to refine learned representations, which are essential to provide sufficient cues for dense predictions. Table 3 reports the results on the Vaihingen test set. It is evident that the results of HCANet obtained are in exceptionally good agreement with expectation. As can be seen, HCANet performs high consistency with ground truth. Statistically, the OA and mIoU of HCANet, 83.83% and 75.46%, respectively, are the highest. Meanwhile, the category-wise accuracy and IoU also verify the competitive performance of HCANet. Compare to SegNet and U-Net, the OA and mIoU are dramatically increased by about 9% and 12%. It should be noted that SegNet and U-Net are initially devised for natural images and biomedical images, in which the distinguishability of representations is undemanding. Moreover, DeepLab V3+ utilizes the atrous convolution to enlarge the receptive field, which paves an innovative way to capture more contextual information. Likewise, attempt to generate smooth objects' boundary, ResUNet-a adopts complicated multi-tasking inference strategy along with multiple tricks that own a large amount parameters, resulting in a higher OA and mIoU. Specifically, the building and clutter are delineated with the highest accuracy of 97% and 81%. Although satisfactory results are yielded, DeepLab V3+ and ResUNet-a are criticized for the prohibitive computation and GPU memory occupation. Striven to ultimately capture and leverage contextual information, attention-based methods are developed to improve the segmentation performance.
Unfortunately, CBAM, DANet and NLNet are slightly problematic in enhancing separability of representations that extracted from remote sensing images. This is because the disparity between remote sensing and natural imagery. The former one covers a variety of geomorphological details and is acquired with diverse conditions. The demandingly geo-distinguishable of learned representations is challenging. These three models are originally devised for natural images. Facing the remote sensing imagery, the robustness is deficient. NLNet, an advanced self-attention framework, only reaches about 82% in OA and 71% in mIoU. Furthermore, OCNet takes advantages of context to facilitate the performance, which are feasible both on natural and remote sensing images. In addition, the OA and mIoU reach more than 83% and 71%. Inspired by self-attention mechanism, SCAttNet is proposed for the purpose of remote sensing image semantic segmentation. As a result, the results are acceptable with a small amount of extra matrix multiplication. Figure 6 presents the visualization comparison on random samples from Vaihingen test set. Compared to other methods, HCANet displays progressive performance by learning integrate and complex representations. In addition, the confusable boundary and pixels are considerably classified. For example, HCANet can efficiently segment each car without adhesion and delineate a complete shape. However, the other methods perform the adhesion and incompleteness of cars.
In summary, HCANet comprehensively resorts the pixel-superpixel correlations and pixel-wise attentive maps, revealing the way to refine learned representation. Undeniably, the OA and mIoU are as good as can be expected by hybridizing cross-level contextual and attentive representations of remote sensing images.

Results on Potsdam Test Set
As presented in Table 4, the results on the Potsdam test set are collected. It is evident from the results that overall, there is a marked increase compared to other models. Moreover, the performance is almost identical to the Vaihingen test set. Beneficial from the sufficient data size for training, the OA and mIoU exhibit a minor increase with more than 1%. Nevertheless, the accuracy of impervious surfaces, building and clutter is marginally lower than DANet, ResUNet-a and SCAttNet, respectively. It could be concluded that the uncertainties and fluctuations are the dominant reasons. Furthermore, the gaps are narrow and negligible. After all, the OA and mIoU overwhelmingly indicate the superiorities of HCANet. Attention-based methods are of little significance. It is determined by the impractical application to remote sensing image semantic segmentation, which is more than an essential computer vision task. Then, ResUNet-a and SCAttNet incorporate the properties of the remote sensing domain, lending support to accurately labeling pixels. It is therefore desirable to understand the easily-confused pixels as well as possible. Figure 7 compares the segmentation performance on random samples of test set. Although attention-based methods, SCAttNet and ResUNet-a have a certain degree of competitiveness, HCANet is of great significance due to the refined representations with high distinguishability, separating the heterogeneous pixels readily. For example, low vegetation and tree is relatively rough. In addition, HCANet exhibits acceptable performance.  Overall, HCANet captures and fuses superpixel context, pixel-superpixel relationships and self-attentive maps, providing compelling cues for pixel-wise predictions. Hence, it appears to be apparent from the numerical results that the excellent prediction is agreed with the ground truth.

Ablation Study on CCRM
As reported in Section 1, refining learned representations with contextual information is profitable for producing sufficient cues for pixel-level predictions. Among the context, pixel-superpixel correlation is an essential element. HCANet captures and incorporates this kind of context in a self-attentive way. To comprehensively verify the efficiency and superiority, the ablation study on CCRM is designed and implemented. In addition, the non-CCRM version of HCANet is denoted as nc-HCANet. Table 5 lists the OA/mIoU of two models tested on the three datasets. Compared to nc-HCANet, the OA increases by about 5% for three datasets and mIoU increases by about 4% correspondingly. The incorporation of superpixel enhanced representations lends support to boost the performance. In addition to test results, the training loss (cross-entropy) per epoch of two models is collected and illustrated in Figures 8 and 9. Accordingly, the training loss is much lower than nc-HCANet both for the Vaihingen (Figure 8) and Potsdam (Figure 9) training sets. After 500 epoches of training, CCRM module drops the loss to 0.0349 while nc-HCANet still has 0.185 for the Vaihingen dataset. As to Potsdam dataset, CCRM helps the network decrease training loss from 0.1973 to 0.0271. As a result, the reduction of loss means the inconsistency of probability distribution tends to be more acceptable.  Apart from the accuracy comparison, the training time per epoch and inference time for a test image with spatial size of 256 × 256 are collected to emphasize the efficiency. As evident from Table 6, the training time per epoch is an average of 500 epochs. Therefore, concerning the significant increase in accuracy, the slight rise in time costs is acceptable. The inference time is generated by equally dividing the time costs for predicting the whole test set. As to single 256 × 256 image, the inference time grows up at 0.4 ms with the incorporation of CCRM. In summary, the improvements are intuitive and remarkable, owing to the utilization of more comprehensive contextual information, especially the pixel-superpixel correlation. Meanwhile, both training time and inference time are entirely negligible.

Conclusions
Striven to enhance the distinguishability of learned representations by semantic segmentation neural networks, HCANet is devised and implemented. First, of all, inspired by self-attention mechanism, the cross-level contextual representation module (CCRM) is designed to model the pixel-superpixel dependencies, which are injected to produce superpixel enhanced representations. Moreover, hybrid representation enhancement module (HREM) concatenates self-attentive representations implemented by non-local block with superpixel enhance representations generated by CCRM. Furthermore, the DUpsampling is embedded into decoder stage to recover feature maps to original spatial resolution losslessly.
The extensive experimental results provide a straightforward evidence that the complementary and complete contextual information enables high accuracy in pixel-wise semantic labeling. In addition, both short-range and long-range dependencies should be emphasized. Future work will focus on cross spatial resolution feature fusion in-depth at inexpensive time and space cost. In addition to capture the pixel-superpixel correlation of encoded feature maps, the shallow encoders' output feature maps should be further exploited.  Institutional Review Board Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.