CMANet: Cross-Modality Attention Network for Indoor-Scene Semantic Segmentation

Indoor-scene semantic segmentation is of great significance to indoor navigation, high-precision map creation, route planning, etc. However, incorporating RGB and HHA images for indoor-scene semantic segmentation is a promising yet challenging task, due to the diversity of textures and structures and the disparity of multi-modality in physical significance. In this paper, we propose a Cross-Modality Attention Network (CMANet) that facilitates the extraction of both RGB and HHA features and enhances the cross-modality feature integration. CMANet is constructed under the encoder–decoder architecture. The encoder consists of two parallel branches that successively extract the latent modality features from RGB and HHA images, respectively. Particularly, a novel self-attention mechanism-based Cross-Modality Refine Gate (CMRG) is presented, which bridges the two branches. More importantly, the CMRG achieves cross-modality feature fusion and produces certain refined aggregated features; it serves as the most crucial part of CMANet. The decoder is a multi-stage up-sampled backbone that is composed of different residual blocks at each up-sampling stage. Furthermore, bi-directional multi-step propagation and pyramid supervision are applied to assist the leaning process. To evaluate the effectiveness and efficiency of the proposed method, extensive experiments are conducted on NYUDv2 and SUN RGB-D datasets. Experimental results demonstrate that our method outperforms the existing ones for indoor semantic-segmentation tasks.


Introduction
Semantic segmentation is one of the essential techniques in scene understanding technologies. It aims to categorize each pixel and assists in the identification and segmentation of scene elements. Initially, the segmentation can be achieved by handcrafted features and machine-learning algorithms [1][2][3], among which, deep learning is the trend of current research [4][5][6][7]. For semantic-segmentation tasks, it is fully recognized that indoor scenes exhibit distinct aspects compared with outdoor scenarios. For instance, services relying on indoor-scene semantic segmentation (indoor navigation, intelligent furniture, etc.) are strongly demanded by individuals. Therefore, facing the aforementioned challenges, some works apply notable features of the images to assist the segmentation, e.g., edges information [8,9]. Additionally, the illumination variations, overlaps among objects, and the imbalanced representations of object categories in indoor scenes always make it impossible to distinguish numerous objects using solely RGB images [10].
Adding the depth information to traditional RGB information with low-cost RGB-D sensors is a conventional way to achieve better performance of indoor-scene semantic segmentation. Nevertheless, the depth images contain only range measurement information, which makes them challenging for feature extraction. As a result, it is natural to employ the three-channel HHA images (three channels represent the horizontal disparity, height above ground, and the angle of the pixel's local surface normal makes with the inferred gravity direction) [11], which are coded from one-channel depth images. The application of HHA images is demonstrated to be more efficient and robust than one-channel depth images [4,12,13]. The comparison of RGB, depth, and HHA images are shown in Figure 1. In order to improve feature embedding, it is necessary to consider the essence of data modality, i.e., the quality, attribute, or circumstance of each type. The RGB images describe lightness and saturation, which mainly represent appearance information, whereas the HHA images mainly represent geometric information [14]. As a result, RGB and HHA can complement one another, with HHA discriminating instances and contexts that share similar colors and textures [15], while RGB can assist with indistinguishable structures. As illustrated in Figure 2, the cushion on the sofa has a similar texture to the sofa, so it can be distinguished by the HHA images, and the pictures on the wall have a similar structure as the wall, so they can be distinguished by the RGB images. It is evident that the combination of both modality features from RGB and HHA images can effectively enhance the efficiency of feature embedding. On the basis of existing deep-learning methods for RGB-D semantic segmentation, two open problems are still widely discussed: how to fuse multi-modality (RGB and depth) features adeptly and how to improve the robustness w.r.t. imperfect data. The first problem is caused by the substantial variations between RGB and depth modalities [12], which lead to inappropriate feature fusion and inferior performance. Some works utilize the depth image as an extra channel [4,16,17], whereas some works extract features independently and fuse them via CNN-based architecture [14,18,19]. Despite this, these methods can only correlate RGB and depth information to a limited extent. Additionally, measurement noise, view angle, and occlusion boundary may affect the RGB sensors (e.g., resulting in overexposure) and depth sensors (e.g., resulting in data loss) during the data collection period, causing the second problem. Therefore, some works [20,21] apply pre-processing to RGB and depth images for denoising and filling in lost depth. However, pre-processing is not stable enough for feature extraction.
In this paper, we propose a novel Cross-Modality Attention Network (CMANet) for indoor-scene semantic segmentation. CMANet is designed under the encoder-decoder architecture. The encoder aims to build multi-step interaction between two modalities and extracts multi-level features from RGB and HHA images. Meanwhile, the decoder enhances the efficiency of feature representation and restores the feature maps to the corresponding resolution step-by-step. It is worth mentioning that the cross-modality ability ensures that CMANet can utilize the essence of different modalities effectively and fuse the RGB and HHA features adeptly. To be specific, the encoder has two parallel branches to extract RGB and HHA features. The parallel design scheme minimizes the unfavorable effects of mutual influence between RGB and HHA features. More importantly, to achieve appropriate feature fusion and filter out the noise of images, we propose the Cross-Modality Refine Gate (CMRG) based on the attention mechanism, weighting the crucial features in the first stage and aggregating them in the second stage. Moreover, the CMRG utilizes the features from different modalities, instead of relying on one modality, which increases the model robustness and achieves a great performance. Additionally, the output of the CMRG is propagated in a multi-step bi-directional operation into both branches to enhance the encoding of features. The decoder is an up-sampled ResNet, which gradually restores spatial resolution and integrates the encoding stage information. Additionally, we conduct pyramid supervision to improve the final semantic performance.
The main contributions of this paper are as follows: • We propose a novel RGB-HHA semantic-segmentation network: CMANet. On one hand, CMANet can effectively fuse and extract the cross-modality features; on the other hand, it improves the robustness and efficiency of feature embedding; • We design the CMRG based on the self-attention mechanism, which not only filters the noise and highlights advantages in the feature maps, but also facilitates to improve the representation and aggregation of cross-modality information; • We conduct extensive experiments on challenging NYUDv2 and SUN RGB-D datasets, on which CMANet evidences its robustness and effectiveness on indoor-scene semantic segmentation.
The rest of the paper is organized as follows. Section 2 briefly reviews the related works. Section 3 gives a detailed description of CMANet, which includes the network architecture, the processing modules, the training and optimization strategies, etc. Section 4 provides the experiment settings, evaluation methods, and the results, along with comparing the proposed method with existing methods and analyzing the strengths and limitations of our approach. Section 5 closes the paper with a brief conclusion and future research considerations.

Related Works
In this section, we briefly review the literature relevant to our work. The attention mechanism part focuses on the utilization of channel attention and spatial attention, and the RGB-D semantic-segmentation part elaborates the existing methods and essential concerns of cross-modality fusion.

Attention Mechanism
The attention mechanism is derived from the behavior of humans, which is to ignore irrelevant information and pay attention to what is essential. This strategy is commonly applied in the deep-learning field and performs well on Natural Language Processing (NLP) [22][23][24] and Computer Vision (CV) [15,25,26] tasks. Considering the attention mechanism in the CV field, we divide it into two categories based on enhancement type: channel attention (what to focus on) and spatial attention (where to focus on).
In deep learning, the feature maps that propagate in different channels usually represent different features [27], such as texture, boundary, and shape. As a result, channel attention assigns different weights to different channels, determining what features are important (what to focus on), and is widely used [28][29][30][31][32][33]. SENet [31] provides the first step to improving the representation ability of feature maps through channel attention by using the Squeeze-and-Excitation (SE) block. For better modeling capability, GSoP-Net [32] generates attention maps not only by utilizing first-order statistics (i.e., the global pooling descriptor) but by extracting high-order statistics as well. SKNet [33] combines different receptive field features with multiple branches to adapt the weights according to the input feature maps. These methods open up a novel aspect to feature extraction where the valuable information from existing channels is emphasized rather than continuously increasing the number of parameters. Thus, we adopt the channel-attention mechanism in the CMRG to refine the cross-modality feature maps so as to reduce the noise and improve feature representation.
Regarding one feature map, the region attribute can be represented by pixel values in different locations. The spatial attention can adeptly select the interrelated and meaningful regions across the entire map, then reinforce them by allocating higher weights (where to focus on), which has significance for image processing [34][35][36][37][38]. AGNet [36] applies Attention Gates (AGs) for medical image segmentation. The AGs restrain irrelevant regions and highlight significant features via implicit learning. PSANet [37] relates all the positions on the feature map to each other via a self-adaptively learned attention mask, thus alleviating the limitations of convolutional filters with small kernel size. Additionally, transformers have proven to be very effective with NLP and CV tasks [22,39]; therefore, the Vision Transformer (ViT) [38] processes images by cropping each image to 16 × 16 small samples, which is similar to split a sentence into several words. The spatial attention mainly concerns the crucial regions in one feature map, which means it can enhance the relationship among pixels. Inspired by these works, we consider the fact that RGB and HHA have different geometric properties and propose a cross-modality spatial-attention mechanism that can enhance region characteristics.
In order to better refine the feature maps, CBAM [40] combines the channel and spatial-attention modules, which will be discussed later. Motivated by this, we employ the sequential deployment of the channel and spatial modules for cross-modality features. In this way, the acquisition of what is essential comes not only from latent learning, but also from different physical characteristics from different modalities.

RGB-D Semantic Segmentation
For RGB-D semantic segmentation, not only the category labels are assigned to each pixel; the performance is also enhanced via the cross-modality feature fusion. Based on the existing successful deep-learning RGB semantic-segmentation structures [4][5][6][7], several works on RGB-D semantic segmentation are presented and demonstrate their reliability and practicality.
Since the RGB and depth modalities must be fully utilized in RGB-D semantic segmentation, the cross-modality fusion is crucial. The fusion can be achieved via elementwise summation, concatenation, or a combination of both, and adapted by latent learning [4,13,14,18,19,[41][42][43]. For learning common and specific parts from cross-modality features, ref. [14] designs the transformation network between the encoder and decoder, and extracts corresponding features via Multiple Kernel Maximum Mean Discrepancy (MK-MMD). RedNet [18] integrates ResNet [44] and encoder-decoder architecture. In addition, it adds skip connections to optimize decoding and applies pyramid supervision to avoid overfitting. Learning from the structure of RefineNet [45], RDF [13] fuses cross-modality features by applying a Multi-Modal feature Fusion (MMF) network, which is composed of several residual convolutional layers and element-wise summation. To avoid noisy and chaotic information affecting the effectiveness of the network, RAFNet [43] employs a three-stem branch encoder to process RGB, depth, and fusion features, respectively. Meanwhile, RAFNet utilizes the channel-attention model for refinement. The RGB-D semantic-segmentation methods mostly inherit the former deep-learning methods used for image processing, some of which insert the attention mechanism to enhance the features adeptly by filtering noisy and chaotic information. However, the relationship between RGB and depth features has received little interest.
Thus, we propose several processing modules for cross-modality features based on residual convolution and attention mechanism. In this structure, the residual convolution contributes to the latent learning and adjusts adaptively for feature maps, while the attention mechanism contributes to information refinement.

The Cross-Modality Attention Network
An effective cross-modality network aims to attenuate the image noise and combine the benefits of both RGB and HHA features. Regarding this, a novel attention-based mechanism is introduced in our proposed model (CMANet), and it results in the so-called CMRG design. Additionally, bi-directional multi-step propagation and pyramid-supervision training strategies facilitate the performance. In this section, we will present detailed descriptions of the proposed method in terms of the overall framework, the structure of the Cross-Modality Refine Gate (CMRG), the processing modules, the configuration of the encoder and decoder, and the pyramid-supervision strategy.

Network Architecture
In this subsection, we detail the structure of our CMANet, which includes the overall framework, the processing modules for cross-modality features, and the configuration of the encoder and the decoder.

Overall Framework
Influenced by SegNet [5], our proposed structure is based on encoder-decoder architecture. In the encoder part, latent modality features are extracted from RGB and HHA images, which serve as the input of the decoder. Following that, the decoder gradually reconstructs the high-dimensional features to the original spatial resolution and integrates the encoding stage information through skip connections to produce the results of semantic segmentation. In this structure, the encoder extracts the high-level features while the decoder restores the spatial information, which alleviates the problem of chaos in the semantic assignment in pixel-wise classification.
The architecture of CMANet is presented in Figure 3. The encoder has two CNN branches w.r.t. RGB and HHA. Each branch successively extracts latent modality features from RGB and HHA images. Here, ResNet [44] serves as the backbone for both branches.
In order to enhance the extraction and fusion of cross-modality characteristics, we present the Cross-Modality Refine Gate (CMRG), which is designed based on the selfattention mechanism. The CMRG module receives pairs of encoding feature maps from the RGB and HHA branches (e.g., the outputs of RGB-Layer1 and HHA-Layer1) and produces aggregated features. Regarding the cross-modality fusion, there are several CMRG modules, which correspond to distinct encoding stages. As a result, we can extract more valuable features with the availability of an encoder capable of fusing and boosting features via the separate implementation of encoding and fusion with CMRGs.

Agent1
Agent2 After encoding, CMRG5 refines the outputs of RGB-Layer5 and HHA-Layer5 to produce the final pair of feature maps. Nonetheless, the final pairings are the outputs of the encoder, which have different representational capabilities because they are derived from separate modalities and, hence, they cannot be joined by element-wise addition. Consequently, we employ the Context module to aggregate and refine the pairings of high-level feature maps from two modalities.
The decoder component is an up-sampled ResNet backbone consisting of five residual blocks, each of which comprises several Up-sampled Residual Units (URUs). In this structure, the decoder recovers the feature maps to the original spatial resolution stage-bystage via transposed convolution and combines the features from each encoding stage via Agent modules as skip connections.
The low-level features in the decoder have a better resolution with more position and detail information, but the high-level features contain rich semantic and category information. To improve the use of multi-level features and alleviate the gradient vanishing problem, we generate multi-scale semantic maps from five stages of decoding features for pyramid supervision.

Processing Modules
Here, we outline the processing modules that aid in the propagation of features. Residual Units The residual learning can effectively prevent the degradation of the model and resolve the gradient vanishing issue during back-propagation. The structure of residual units is shown in Figure 4. The Residual Convolutional Unit (RCU) and the Chained Residual Pooling (CRP) are utilized in the Agent and Context modules, which are the sub-components of RefineNet [45], whereas the Downsample Residual Unit (DRU) [18] and Upsample Residual Unit (URU) [44] are applied for the encoder and decoder, respectively. It is worth mentioning that, while the displayed Chained Residual Pooling (CRP) has two blocks in Figure 4b, we only utilize one block in our subsequent modules since one is sufficient for refinement.
Agent Module The skip connections between the encoder and decoder are utilized to replenish the detail loss caused by downsampling. Hence, the Agent modules provide certain intermediate addition of the multi-stage feature maps from the encoder to the corresponding decoder layers. The structure of the Agent module is illustrated in Figure 5a. After receiving two feature maps from the RGB and HHA branches, the Agent module first utilizes a 1 × 1 convolution layer to mitigate the explosion of parameters by reducing the dimension number. Then, each feature map goes through the RCU to adapt the elementwise sum fusion. The combined features are fed to a one-block CRP, where a pooling algorithm spreads large activation values while an additional convolution layer is added to learn the significance of the pooled features. Finally, before passing to the decoder, a Convolutional Block Attention Module (CBAM) is employed to filter and enhance the features. Note that, in order to improve the propagation, we couple the CBAM and residual learning via a shortcut connection.

Context Module
The Context module is applied to fuse the final outputs of the two CNN branches (RGB and HHA). As illustrated in Figure 5b, the Context module has similar components as the Agent module, but has additional RCUs and a 3 × 3 convolution layer. The first 1 × 1 convolution layer reduces the dimension from 2048 to 512. Then, two feature maps are fused by element-wise summation and finally output to the decoder after refinement.

Encoder and Decoder Configuration
The encoder and decoder have different types and numbers of the residual unit. The encoder with the backbone of ResNet-50 utilizes the Downsample Residual Unit (DRU) as illustrated in Figure 4c, with the 1 × 1 convolution layer with a stride of 2. Regarding the decoder, the residual units are applied to upsample the feature maps, as illustrated in Figure 4d, where the 2 × 2 convolution layer and the second 3 × 3 convolution layer both have a stride of 1/2. The encoder and decoder configuration is shown in Table 1, Input denotes the number of input feature channels, Output denotes the number of output feature channels, and Units denotes the number of residual units in this layer.

Cross-Modality Refine Gate
For RGB and HHA data, the former mainly record appearance information (e.g., color, texture) that can emphasize the visual boundary, whereas the latter primarily capture shape information (e.g., structure, spatial) that can highlight the geometric boundary. Thus, it is challenging to fully utilize RGB and HHA images via fusion and enhancement of cross-modality features. We propose the Cross-Modality Refine Gate (CMRG) based on the attention mechanism to aggregate features from multiple modalities.

Convolutional Block Attention Module
As discussed in Section 2, the attention mechanism has been extensively used in the CV field for determining where the focus should be placed and for deciding what is valuable. In particular, the Convolutional Block Attention Module (CBAM) combines channel and spatial-attention mechanisms during propagation, an achieves outstanding performance in feature extraction [40]. For further modification of the CBAM, we discuss its overall structure and sub-modules details first.
The structures of the CBAM and its sub-modules are shown in Figure 6. The CBAM refines the input feature maps by sequentially applying one channel-attention module and one spatial-attention module, as illustrated in Figure 6a. Given input feature maps F , the CBAM first infers a 1D channel-attention map M c using the channel-attention module and refines the feature maps via channel-wise multiplication; then, it infers a 2D spatial-attention map M s using the spatial-attention module and performs spatial-wise multiplication on the channel-refined feature maps to yield the output. The CBAM process can be formulated as follows: where F ∈ R C×H×W represents the input feature maps, F represents the channel-refined feature maps, F out represents the final output, and ⊗ denotes element-wise multiplication.
In the multiplication procedure, the attention values are broadcast as follows. Channelattention values are broadcast along the spatial dimension (channel-wise multiplication), while spatial-attention values are broadcast along the channel dimension (spatial-wise multiplication). Figure 6b,c describe the detailed structure of the channel-attention module and the spatial-attention module. As shown in Figure 6b, the channel-attention module first aggregates the spatial information of input feature F into two descriptors, average-pooled features F c avg and max-pooled features F c max , via average-pooling and max-pooling operations, respectively. Then, both descriptors are supplied to a shared network containing one hidden layer of multi-layer perception (MLP). A reduction ratio is set to the shared MLP in order to decrease parameter overhead. After propagation in the shared MLP, the descriptors are fused by element-wise summation and modified by a sigmoid function to produce the channel-attention map M c . At last, the channel-refined features F c can be generated by multiplying the attention map and input features. The channel-attention map is computed as follows: where σ denotes the sigmoid function, W 0 ∈ R C/r×C and W 1 ∈ R C×C/r represent the MLP weights, and r is the reduction ratio. As specified, the MLP weights are shared with both inputs. M c ∈ R C×1×1 represents the channel-attention map, F c avg ∈ R C×1×1 represents the average-pooled features, and F c max ∈ R C×1×1 represents the max-pooled features. The spatial-attention module is presented in Figure 6c. Similar to the aforementioned channel-attention module, the spatial-attention module aggregates channel information of the refined features to maps F s avg and F s max first via average-pooling and max-pooling along the channel axis. The two maps are then merged by concatenation. After that, the concatenated features are convolved by a standard convolution layer to generate the spatial-attention map M s . The final refined features are also produced by multiplication. The spatial-attention map is computed as follows: where σ denotes the sigmoid function, f 7×7 represents the 7 × 7 convolutional layer, and [; ] refers to the concatenation. M s ∈ R 1×H×W represents the spatial-attention map, F s avg ∈ R 1×H×W , and F s max ∈ R 1×H×W represent the pooled maps.

Structure of Cross-Modality Refine Gate
Although CBAM exhibits good performance w.r.t. the extraction and representation of features through the channel-and spatial-attention mechanisms, it is inadequate for the cross-modality features. Inspired by CBAM, where the channel-and spatial-attention modules are sequentially employed (and the attention mechanisms are applied with each modality separately), here, we design a unique attention module, the Cross-Modality Refine Gate (CMRG), which is designed to deal with cross-modality features. The structure of the CMRG is illustrated in Figure 7. Differing from the CBAM, the CMRG takes the multi-modality features as the input, instead of receiving only one feature maps. The self-attention mechanism in the CMRG utilizes the cross-modality information to produce the attention maps, which only rely on one modality in CBAM. As shown in Figure 7, the CMRG consists of two parts: the channel-attention module, and the spatial-attention module. The input of the CMRG is a pair of feature maps F RGB and F HH A , which are derived from the RGB and HHA branches, respectively. Firstly, the CMRG utilizes the channel-attention module to infer two 1D channel-attention maps-M c RGB and M c HH A -to refine F RGB and F HH A via channel-wise multiplication. By the sharing of descriptors from each modality and the inference of exclusive attention maps for each modality, the cross-modality channel-attention operation primarily filters out noisy and chaotic information and improves the effective characteristics of the original feature maps from the channel aspect. Then, the CMRG infers two 2D spatialattention maps-M s RGB and M s HH A -to enhance the channel-refined feature maps F cr RGB and F cr HH A via spatial-wise multiplication. The cross-modality spatial attention primarily strengthens the association among pixels in the feature maps in order to put more attention on the areas with similar characteristics, even though some of them are far from each other. Finally, the output F out is generated by adding the spatial-refined feature maps together. This process can be formulated as follows: where F RGB ∈ R C×H×W and F HHA ∈ R C×H×W represent the input feature maps, M c RGB ∈ R C×1×1 and M c HH A ∈ R C×1×1 represent the channel-attention maps, M s RGB ∈ R 1×H×W and M s HH A ∈ R 1×H×W represent the spatial-attention maps, F cr RGB ∈ R C×H×W and F cr HH A ∈ R C×H×W represent the channel-refined feature maps, F sr RGB and F sr HH A represent the spatial-refined feature maps, and ⊗ denotes element-wise multiplication. In the multiplication procedure, the attention values are broadcast the same as in the implementation in CBAM.
As shown in Figure 7, the channel-attention module first aggregates each feature maps into two 1D descriptors, totaling four via average-pooling and max-pooling, among which F c RGB_avg and F c RGB_max are generated from RGB feature maps, whereas F c HH A_avg and F c HH A_max are generated from HHA feature maps. Then, the descriptors from various modalities are concatenated to produce two cross-modality channel descriptors: F c avg and F c max . Both cross-modality descriptors are fed into two independent MLPs with one hidden layer: MLP RGB and MLP HH A . After the shared network is applied to each descriptor, element-wise summations are utilized to generate modality-specific channelattention maps M c RGB and M c HH A . The channel-refined procedure in CMRM can be formulated as follows: where σ denotes the sigmoid function and [; ] denotes the concatenation. It is worth mentioning that, different from the CBAM, both MLP have weights of W 0 ∈ R C/r×2C and W 1 ∈ R C×2C/r ; r is the reduction ratio. Furthermore, F c RGB_avg ∈ R C×1×1 and F c RGB_max ∈ R C×1×1 represent the RGB descriptors, F c HH A_avg ∈ R C×1×1 and F c HH A_max ∈ R C×1×1 represent the HHA descriptors, and F c avg ∈ R 2C×1×1 and F c max ∈ R 2C×1×1 represent the cross-modality descriptors.
Following the optimization of the features by the channel-attention module, the pair of feature maps-F cr RGB and F cr HH A -are fed into the spatial-attention module. Similar to the implementation of channel attention, the spatial-attention module initially aggregates crossmodality features via average-pooling and max-pooling along the channel axis, and infers four 2D maps, among which F s RGB_avg and F s RGB_max are generated from refined RGB feature maps, while F s HH A_avg and F s HH A_max are generated from refined HHA feature maps. Then, the concatenation is applied to combine all maps. The cross-modality map is then convolved by two independent standard convolution layers with kernel size of 7 × 7 to generate the modality-specific spatial-attention map M s RGB and M s HH A . The spatial-refined procedure in the CMRG can be formulated as follows: where σ denotes the sigmoid function, f 7×7 represents the 7 × 7 convolutional layer, and [; ] refers to the concatenation. F s RGB_avg ∈ R 1×H×W and F s RGB_max ∈ R 1×H×W represent the RGB maps, and F s HH A_avg ∈ R 1×H×W and F s HH A_max ∈ R 1×H×W represent the HHA maps.
To refine the features by taking advantages of different physical significance, the CMRG utilizes combined information from RGB and HHA to generate attention maps instead of relying solely on a single modality. Due to the limited sensing capabilities of the depth camera, this strategy improves the sturdiness and effectiveness of the backbone, especially in the HHA branch. To be specific, for the purpose of generating fine attention maps, the descriptors (maps) are derived using both average-pooling and max-pooling, with average-pooling descriptors (maps) representing the global information and maxpooling descriptors (maps) representing prominent information, thereby improving the availability of the attention maps. In addition, the CMRG employs channel-and spatialattention mechanisms to improve the representation and aggregation of cross-modality information. In this structure, the channel-attention module is primarily responsible for capturing 'what' is important, whereas the spatial-attention module is primarily responsible for determining 'where' should be prioritized.

Bi-Directional Multi-Step Propagation
It should be noticed that the CMRG can properly fuse the features from both branches. Moreover, Bi-directional Multi-step Propagation (BMP) is employed to reduce model complexity and improve propagation efficiency.
BMP propagates the refined results to the next layer in the encoder for more accurate and efficient encoding of the RGB and HHA features by minimizing the fusion result to half its original values rather than adding elements directly. The procedure of the BMP can be formulated as follows: (12) where REF denotes the output of the CMRG.

Pyramid Supervision
The pyramid-supervision training strategy mitigates the gradient vanishing issue by incorporating supervised learning over multiple levels; furthermore, it utilizes the features in different scales to improve the final semantic performance.
As illustrated in Figure 3, four intermediate side outputs are implemented, which are derived from the features of the four up-layers for pyramid supervision in addition to the final output of the decoder. Each output is generated after the corresponding feature maps are convolved by a 1 × 1 convolution layer. Unlike the final output, which has the original spatial resolution, the four side outputs have a different spatial resolution, with 1/2, 1/4, 1/8, and 1/16 the height and width of the final output. The loss function of pyramid supervision is formulated as follows: where where Loss(O n ) is the loss function for the final output or the side outputs. g i ∈ R denotes the class index on the pixel i of the groundtruth semantic map. s i ∈ R N c denotes the vector on the pixel i of the output score map and N c denotes the class number of the dataset.

Experiments
In order to verify the effectiveness of our proposed method, we conduct evaluation experiments on RGB-D public datasets NYUDv2 [20] and SUN RGB-D [21]. To evaluate the results, we compare the semantic-segmentation performance of various methods on three metrics: pixel accuracy, mean pixel accuracy, and mean intersection over union [4]. In addition, ablation experiments are performed on NYUDv2 with ResNet-50 [44] as the backbone, and certain analysis and discussion are provided.

Datasets
In this section, we introduce the public datasets that are utilized in our experiment. The two public datasets are: NYUDv2 [20] The NYUDv2 dataset consists of 1449 indoor RGB-D images with dense pixel-wise annotation. According to the official instructions [20], we split them into 795 training images and 654 testing images. A 40-category setting is adopted as in [46]. SUN RGB-D [21] The SUN RGB-D dataset contains 10,335 indoor RGB-D images with 37 categories, which include images from NYUDv2 [20], Berkeley B3DO [47], SUN3D [48], and newly captured RGB-D images. We divide the dataset into 5285 training images and 5050 testing images according to the official setting.

Implementation Details
We implement our experiments with Python 3.8 in the Ubuntu 18 operating system with Pytorch [49] framework. All models are trained on one Nvidia RTX A5000 graphics card with batch size of 7. Additionally, we use ResNet-50 pre-trained on ImageNet [50] as the backbone of the two branches in the encoder. We adopt the SGD optimizer with momentum 0.9 and weight decay 0.0005. The initial learning rate is 0.001 and it decays by a factor of 0.8 every 100 epochs. We employ the warm-up strategy in the first 15 epochs. The network is trained for 800 epochs with the NYUDv2 and SUN RGB-D datasets. We set the reduction ratio to 8 in all attention modules.

Experimental Results and Comparisons
We compare our CMANet with existing semantic-segmentation methods on the NYUDv2 and SUN RGB-D datasets. The results of the three aforementioned metrics on the two datasets are displayed in Tables 2 and 3, whereas the details of class IoU on NYUDv2 are displayed in Table 4. Table 2, our CMANet outperforms most of the state-of-the-art methods in semantic segmentation on the NYUDv2 dataset. On the three evaluation metrics, CMANet achieves 74.2% pixel accuracy, 60.6% mean accuracy, and 47.6% mean IoU. Additionally, on the most important metric-mean IoU-CMANet achieves a 2.7% improvement compared to RefineNet-101 [45], 1.7% compared to LSD-GF [51], and 0.1% compared to RAFNet [43]. It is noticed that we only utilize ResNet-50 as our backbone, suggesting that CMANet is capable of better performance with a more powerful backbone. However, the performance of CMANet is lower than that of RDFNet on mIoU by 0.1%. Despite its slight deficiency in segmentation performance, CMANet displays an improvement in memory and computing complexity according to the model efficiency analysis. Furthermore, additional experiments are conducted to compare the utilization of HHA images and depth images. The results display that the application of HHA images can slightly improve the performance of CMANet, with a 0.3% increase on mIoU.   As shown in Table 4, we also compare the category-wise results on class IoU. CMANet performs better than RefineNet-101 and LSD-GF over 28 and 22 classes (40 classes in total), respectively. These results demonstrate the robustness and effectiveness of CMANet in indoor-scene semantic segmentation.  Due to the limited data scale of NYUDv2 dataset, we also compare the semanticsegmentation performance of CMANet on the large-scale SUN RGB-D dataset with other existing methods, following the same training and testing strategy as on the NYUDv2 dataset. The comparison results are displayed in Table 3, where CMANet achieves the best perfor-mance among all the methods on all three evaluation metrics. The semantic-segmentation performance of CMANet on the SUN RGB-D dataset further verifies its validity.

Ablation Study
In order to investigate the functionality of the proposed network and its processing modules, extensive ablation experiments are performed on the NYUDv2 dataset. Each experiment is conducted with the same hyper-parameter settings during training and testing periods.
The ablation study w.r.t. CMRGs is performed to verify the functionality of CMRGs in different encoding stages. As displayed in Table 5, among the first three defective models, each of the first four defective models removes certain CMRGs, but the fifth contains them all, i.e., the original CMANet. It is interesting to find that the second defective models (G 3 , G 4 and G 5 are removed) outperform the first one (G 1 , G 2 are removed). Furthermore, with the gradual stacking of the CMRG from lower to higher stages, the performances of the defective models become better, with the best performance using the CMRG in all stages. From these facts, it is clear that the cross-modality fusion, i.e., the utilization of CMRGs, plays a vital role in performance improvement, especially in the earlier stages. This can be recognized by the fact that the low-level features are rough and compatible in a CNN-based model. The original CMANet performs the best, proving that the multi-stage cross-modality fusion is effective and the CMRGs are mutually reinforcing. Additionally, we also conduct an ablation study on CMRGs, skip connections, and pyramid supervision to evaluate the effect of these strategies; the results are displayed in Table 6. The first defective model is the baseline without any strategy; the second, third, and fourth defective models remove one corresponding strategy at a time; the fifth is the original model. We compare them on mean accuracy and mean IoU, showing that the order of influence from most significant to most minor is CMRGs, skip connections, pyramid supervision. According to the results, CMRGs can effectively improve the performance of semantic segmentation, while skip connections and pyramid supervision provide slight improvements.

Model Efficiency Analysis
Complexity in terms of time and space is an important metric to evaluate the efficiency of the model. Accordingly, we compare our CMANet with [13,53] to verify the model's efficiency. According to the results displayed in Table 7, our method achieves 23.8% parameters reduction and 19.4% FLOPs reduction compared to RDFNet [13] while maintaining almost the same performance. Meanwhile, CMANet outperforms 3DGNN [53] with a 20.5% inference time reduction and a 4.5% mean IoU improvement. However, CMANet has a larger number of parameters than 3DGNN and a lower mean IoU than RDFNet-50. Consequently, it can be seen that CMANet achieves a balance between model complexity and accuracy, which will be further improved in our future research.

Visualization
For the purpose of evaluating the performance of semantic segmentation the analysis is supposed to be not only quantitative but also qualitative; we should pay more attention to the interpretability of the results. Therefore, we conduct visualizations of the CMANet results.
Semantic Segmentation Qualitative Visual Results In Figure 8, we visualize some typical examples of semantic segmentation with our baseline, the defective models, and our proposed method CMANet. In Figure 8a, the bedroom has few objects on the bed, whereas Figure 8b has disorganized items on the bed. Figure 8c represents the hallway scene with a complex structure. Figure 8d has an obvious lighting imbalance in the bedroom. In Figure 8e, the table is cluttered with small and numerous objects. Figure 8f presents a scene in which there are not only strong lighting conditions, but also many overlapped and similar-texture objects. Compared to other methods, CMANet promotes semanticsegmentation results from the perspective of details and the misclassification phenomenon.

Pyramid Supervision Visualization
We conduct visualizations of the pyramid supervision by generating semantic maps from the final output and four side outputs; meanwhile, the performance of some randomly sampled examples is illustrated in Figure 9. The fourth to eighth columns denote the final output, and the four side outputs have spatial resolutions of 1/2, 1/4, 1/8, and 1/16, respectively, the height and width of the final output. As the semantic-segmentation of the outputs progresses from right to left, the refinement of pixel classification gradually improves. However, the results demonstrate that the output with a lower spatial resolution has a more remarkable performance on large-object (wall, TV, etc.) segmentation, as well as edge extraction, owing to its larger receptive field. In this way, the supervision in low-spatial-resolution outputs assists the higher ones via the recognition of boundaries and large objects; meanwhile, the supervision in high-spatial-resolution outputs can refine the semantic information. As a result, the pyramid supervision enables the enhancement of the final result via multi-scale semantic analysis.
Channel Attention Visualization To verify the effectiveness of the channel-attention mechanism, which can enhance the advantages and filter the drawback, we conduct visualizations of the RGB and HHA channel-refined features. In Figure 10, we randomly select two high-weighting feature maps from the channel-refined features. As shown in Figure 10a, the RGB feature maps focus on the significant texture regions (e.g., the wall hangings, the curtains), while the HHA feature maps are mainly concerned with the significant structure regions (e.g., the office chair, the bookshelf), as illustrated in Figure 10b. The results demonstrate that the CMRG enhances feature extraction by paying attention to essential features.

Cross-Modality Refine Gate Visualization
In order to understand refinement by the CMRG, we visualize the output of the CMRG-refined features. As illustrated in Figure 11, we conduct visualizations on some typical feature maps on two random sampled examples. In the first row of Figure 11, the CMRG refinement enhances the regions that share the same labels, such as the sofa, the floor, or the wall, while in the second row, the refinement emphasizes the table and objects on the table. As a result, the CMRG is capable of effectively building connections among the regions with similar characteristics, even if they are geographically dispersed.

Conclusions
In this paper, we proposed a novel CMANet method for indoor-scene semantic segmentation, which utilizes HHA and RGB images to enhance the robustness of segmentation in indoor scenes. According to our experiments, CMANet not only facilitates the learning process via the enhancement of the representation, robustness, and discrimination of the feature embedding, but also takes the advantage of cross-modality features. CMANet employs the encoder-decoder architecture. The encoder has a two-parallel-branch backbone that can extract and aggregate the specific features from RGB and HHA; meanwhile, the decoder generates multi-scale semantic maps that can improve the final segmentation results. Specifically, we designed the CMRG, which is the most crucial component in CMANet. The CMRG employs a sequence of cross-modality channel-and spatial-attention modules. The channel-attention module is responsible for capturing 'what' is important, whereas the spatial-attention module is responsible for determining 'where' should be prioritized. The CMRG filters the noisy information and integrates the features from different modalities (RGB and HHA). The CMRG can effectively enhance representations from both modalities by selecting key features and establishing connections between relevant regions. Additionally, we employ a bi-directional multi-step propagation strategy to provide assistance in propagating. The results of the ablation study and visualization demonstrate the significance of each proposed component. The experiments on the NYUDv2 and SUN RGB-D datasets verify the robustness and effectiveness of CMANet, and the results illustrate that the network outperforms the existing indoor-scene semantic-segmentation methods and achieves a new state-of-art performance. In our future research, we will focus more on increasing the efficiency of our network by reducing time and space complexity. Moreover, we will consider applying semi-supervised or weakly-supervised learning strategies for indoor-scene semantic segmentation due to the limited dataset scale and inaccurate data labeling.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.