Efﬁcient Semantic Segmentation Using Multi-Path Decoder

: Beneﬁting from the booming of deep learning, the state-of-the-art models achieved great progress. But they are huge in terms of parameters and ﬂoating point operations, which makes it hard to apply them to real-time applications. In this paper, we propose a novel deep neural network architecture, named MPDNet, for fast and efﬁcient semantic segmentation under resource constraints. First, we use a light-weight classiﬁcation model pretrained on ImageNet as the encoder. Second, we use a cost-effective upsampling datapath to restore prediction resolution and convert features for classiﬁcation into features for segmentation. Finally, we propose to use a multi-path decoder to extract different types of features, which are not ideal to process inside only one convolutional neural network. The experimental results of our model outperform other models aiming at real-time semantic segmentation on Cityscapes. Based on our proposed MPDNet, we achieve 76.7% mean IoU on Cityscapes test set with only 118.84GFLOPs and achieves 37.6 Hz on 768 × 1536 images on a standard GPU.


Introduction
The purpose of semantic segmentation is to predict the category label of each pixel in an image, which has always been a basic problem in computer vision. In recent years, with the deepening of research, the performance of semantic segmentation models has been greatly improved. This has also promoted the development of many practical applications such as autonomous driving [1], medical image analysis and virtual reality [2]. Following the fully convolutional network (FCN) [3], various architectures and mechanisms were introduced to capture contextual information and generate high-resolution representations. However, semantic segmentation models not only need to be accurate but also should be efficient in order to apply them to real-time applications.
To achieve high accuracy, state of the art semantic segmentation models modified the downsampling layers in their backbones. This makes the feature maps output by backbones usually 1/8 of the original image size. Such models need huge time and GPU memory during training and inference. This makes it difficult to apply these models to scenes that require real-time segmentation. In order to make semantic segmentation more widely used, many real-time models [4][5][6][7][8][9] have been proposed. But results of models which are not pretrained on ImageNet are not satisfactory. Therefore, we choose a light-weight pretrained on ImageNet classification model as our backbone in order to realize real-time inference. But unlike the classification task, which only need to extract semantic information, models need to extract semantic, shape and location information for objects and stuff in the semantic segmentation task. Meanwhile, previous models process color, shape, location and texture information together inside only one convolutional neural network. Because they are very different kinds of information, this approach may not be optimal. Therefore, we propose a multi-path decoder. Different types of features are extracted by different branches in the decoder. Then, feature maps output by different branches are fused to generate segmentation results.
In recent years, some studies have used edge detection as auxiliary information to help intensify the prediction of pixels on the edge of things and stuff. In [10], Towaki et al. proposed to process shape information in a separate branch, resulting in a two-stream convolutional neural network (CNN) architecture for semantic image segmentation. [11] treated the edges as another semantic category to enable the network to be aware of the boundary layout. [12] extended the CNN-based agnostic edge detector proposed in [13] and allows each edge pixel to be associated with multiple categories. Although the shape branch helps to improve the accuracy of the semantic segmentation model, we think it is unnecessary to incorporate an edge detection branch to the semantic segmentation model during inference. Among them, we propose a dual-stream CNN structure for semantic segmentation. We explicitly links shape information to a single processing branch, namely the shape stream, but the edge detection task and the semantic segmentation task in our model are independent. We do not use the edge branch during inference, which reduces computational complexity of our model. As an additional semantic class, edge learning enables the network to understand the layout of boundaries.
In this paper, we present an effective lightweight architecture for semantic segmentation on datasets of few annotated images, based on ResNet features and multi-path decoder. The results of our proposed approach cannot match the state-of-the-art models, but smaller and faster than them. To our best knowledge, our model surpasses any other models aiming at real-time inference. By experimenting and evaluating on two public datasets, Cityscapes and ADE20K, we prove the effectiveness of our approach. Our MPDNet achieves 76.7% mean IoU (Intersection over Union) on the Cityscapes test set and 43.14% on ADE20K validation set with the single scale input.
The paper is structured as follows. In Section 2, we introduce two kinds of semantic segmentation models. In Section 3, we demonstrate our approach in detail. We give an overview of the whole architecture, elaborate the core module in our model and describe the computation process of the framework. In Section 4, the ablation experimental results and the comparison with other semantic segmentation methods are shown to prove the effectiveness of our approach. Finally, our conclusions are shown in Section 5.

Related Works
Image classification networks are used as the backbone in semantic segmentation task for extracting rich features. However, for yielding accurate segmentation, semantic segmentation models need semantic information and spatial information that backbone trained for classification cannot afford. Meanwhile, to segment objects with different scales and classify ambiguous pixels, multi scale features are needed. To handle these problems, multiple modifications are made for pixel-level prediction. Most modifications focus on how to get multi-scale context information, which can be summarized to two kinds.
The first one is the modifications made for feature maps of same size but different receptive fields. Generally, models use feature maps with different receptive field sizes to represent different context information and then connect these feature maps to yield new feature maps which encode multi-scale context information. Many schemes are designed for combining feature maps of different receptive fields. Generally, many works first feed the last convolutional feature map of the backbone into a module that concatenates feature maps with different receptive fields, such as SPP(Spatial Pyramid Pooling) [14] or upgraded SPP, and then feed the feature maps output by this kind of module into the pixel-wise classifier [15]. DeepLab [16] use atrous convolution with different dilation rates to represent feature maps with different receptive fields. Combined with atrous convolution, DeepLab develops SPP to Atrous Spatial Pyramid Pooling (ASPP) module. By using different neurons to represent sub-regions with different sizes, pyramid scene parsing network (PSPNet) develops SPP to Pyramid Pooling module.
The second one is modifications made for feature maps of different size. After copious operations of convolution with stride and pooling, the spatial size of feature maps put out by the backbone is very small compared to the original image and the receptive field of the neurons in the last layer is larger than the neurons in the shallower layers. By the same token, the neurons in the shallower layers have smaller receptive fields and the shallower layers encode less semantic information and more spatial information. This kind of models want to use feature maps in the shallower layers with high resolution to compensate the low resolution of high-level features and yield accurate segmentation results. To encode multi-scale context information, the collection of feature maps come very naturally. Specifically, when got feature maps from each layer, a "U-shape" architecture is built and feature maps from deep layer to shallow layer are gradually fused [17][18][19][20].
Besides, many works combine these two kind of modifications. Ref. [21] uses an "U-shape" architecture equipped with an atrous spatial pyramid pooling(ASPP) module. UPerNet [22] combines Feature Pyramid Network(FPN) with a pyramid pooling module (PPM) from PSPNet. Our proposed MPDNet also combines these two kind of modifications.

Method
The framework of our model is demonstrated in Figure 1, termed as MPDNet. All semantic segmentation methods face a contradiction between accuracy and speed.  The backbone used in our framework is ResNet. There are four stages, Res1, Res2, Res3, Res4 in ResNet. The spatial resolution of feature maps put out by each stage of ResNet is 1/4, 1/8, 1/16, 1/32 of the original image. These feature maps are denoted as R 1 , R 2 , R 3 , R 4 , respectively. The feature maps from four different stages have different sizes. Many models use feature maps with different receptive field sizes to represent different context information and then connect these feature maps to yield new feature maps which encode multi-scale context information. The DeepLab series is one of the best. Their proposed ASPP module contains one 1 × 1 convolutional layer and three 3 × 3 convolutional layers with dilation rates of 6, 12, 18 respectively. The feature maps output by the encoder of DeepLab is 1/16 of the original image during training and 1/8 during inference. To save training and inference time, we do not follow their method. This makes that the height and width of feature maps output by the encoder of our model is 1/32 of the original image. Thus, we use feature maps in the shallower layers with high resolution to compensate the low resolution of high-level features and yield accurate segmentation results. To encode multi scale context information, we concatenate feature maps output by each block of the backbone.
At the same time, we found that under such circumstances, ASPP does not need to be as large as the original version in DeepLab to achieve good results. All the information is contained in the bottom feature maps in DeepLab (except DeepLabv3+). We do not need to be like this. Before inputting the bottom feature map into each atrous convolution branch, we input the bottom feature maps into a 1 × 1 convolutional layer to reduce the number of channels to 512. This makes the number of FLOPs used in the modified ASPP in our model at least 77% lower than that of ASPP in DeepLab during inference. And when we use four ASPPs with 64 output channels instead of one with 256 output channels, the segmentation result is better and the model is smaller.
In semantic segmentation task aiming at real-time inference, a powerful and efficient upsampling datapath is very important. The most efficient upsampling datapath is to use a 1 × 1 convolutional layer as lateral connection and fuse enlarged feature maps with feature maps delivered by lateral connection. Then send the largest feature maps to a classifier. Efficient but powerless. Many previous "U-shape" models use several 3 × 3 convolutions as the lateral connection. Then upsample these feature maps to a quarter of the original image, concatenate them and reduce the channel dimension to the number of object categories. We find that the concatenation operation improve the segmentation results a lot. But the concatenation is operated on the largest feature maps, which requires many FLOPs. Therefore, we use a new lateral connection method.
In order to extract different type of features, we use four identical ASPP modules and four identical decoders for four ASPPs. In each decoder, we denote the feature maps output by the ASPP module as LC 4 . LC 4 encodes the highest level of semantic information. Then the number of channels of R 3 is reduced to 64 by a lateral connection module which contains a 64 × C × 1 × 1 convolutional layer, one 64 × 64 × 3 × 3 convolutional layer with 64 groups and one 19 × 64 × 1 × 1 convolutional layer. By adding upsampled LC 4 and reduced R 3 , new feature maps are generated and are denoted as LC 3 . In the same way, we gradually enlarge feature maps from bottom to top and use lateral connections to fuse features encoded by shallower layers of ResNet. These feature maps are denoted as LC 1 , LC 2 , LC 3 , LC 4 , respectively. They have different sizes, gradually decreasing from LC 1 to LC 4 by a ratio of 2. To fuse them, we enlarge them up to the size of LC 1 by bilinear interpolation and concatenate these resized feature maps. And then a convolutional layer is followed to reduce the channel dimension of these concatenated feature maps to the number of object categories. Then we concatenate feature maps output by four decoders and another convolutional layer reduce the channel dimension of these concatenated feature maps to the number of object categories. In order to supervise each path of the decoder, we calculate loss for feature maps output by each path of the decoder. Four classifiers are applied after four paths of the main decoder. The above process is summarized in Algorithm 1. Where LC ij represents the j-th level feature maps in the i-th branch of the main decoder. C i represents the concatenation of four level feature maps in the i-th branch. S i represents the prediction of the i-th branch. S represents the final prediction of our model.

Input:
The image need to be segmented, Image; Output: The segmentation result, S, the main loss, L m , and the loss for each decoder, L d . 1 L d i =NLLLoss(S i , Label) 10: end for 11: S = Conv(Concat(S i )),where i ∈ [1, 4] 12: L m =NLLLoss(S, Label) 13: L d = ∑ L d i Considering the importance of edge information in semantic segmentation, we use another decoder to learn edge information in addition to the main decoder. Like deep supervision proposed in PSPNet for their model, what the edge decoder learns is a supervision for our model. In the edge decoder, we also use ASPP module and gradually recover edge information by combining information encoded by the shallow layers. The steps of generating the predictions of edges in the edge decoder are the same as main decoder. The entire process for edge decoder is summarized in Algorithm 2. In our proposed model, we need to calculate three kind of losses in total, the main loss L m , the supervision loss for each decoder L d and the edge supervision loss L e . Thus, the integrated loss function L total is finally formulated as below: where α and β are weights for balancing the losses.

Results
All extra non-classifier convolutional layers have batch normalization [23]. ReLU (rectified linear unit) is applied after batch normalization [24]. Same as many works, we use the "poly" learning rate policy where the learning rate at current iteration equals to the initial learning rate multiplying (1 − iter max iter ) power with power = 0.9. We set initial learning rate to 0.01 for Cityscapes, and 0.02 for ADE20K. Momentum and weight decay are set to 0.9 and 0.0001 respectively. To apply data augmentation, we adopt random resize between 0.4 and 1.5 and random cropping. And we also use other data augmentation schemes such as mean subtraction and horizontal flip. We set the batchsize to 16 during training.
The standard metrics used to evaluate semantic segmentation tasks include Pixel Accuracy (P.A.), which represents the proportion of correctly classified pixels, and mean IoU, which represents the intersection over union between the prediction and ground truth, averaged over all object categories. Generally, we use the mean IoU to evaluate the model on the Cityscapes dataset. We use the mean IoU and pixel accuracy to evaluate the model on the ADE20K dataset.

Cityscapes
Cityscapes segmentation dataset contains 19 foreground object categories and one background class. It contains 5000 high quality pixel-level finely annotated images collected from 50 cities in different seasons and is divided into 2975, 500, 1525 images for training, validation and testing. In addition, the dataset also includes 20,000 coarsely annotated images. Unlike the state-of-the-art models, we do not use coarse data in our experiments. The final results of our model are obtained by evaluating on 768 × 1536 images. Therefore, we set the cropsize to 512 during training. The number of training iteration is set to 40k. Table 1 shows the results when we set the number of paths to different numbers. We find that when the decoder has four paths, the model works best. Table 2 shows the detailed results with different settings. Table 3 shows the per-class results on Cityscapes validation set. The lateral connection used in the baseline is a conventional 256 × 3 × 3 convolutional layer. When we use ResNet-50 as the encoder and set cropsize to 768, we evaluate our model on 1024 × 2048 images and achieve 78.06% mIoU under single scale test. ResNet-101 and multi scale test scheme promote the result to 79.99% mIoU. But this is contrary to our original intention of real-time prediction. Therefore, in order to pursue efficiency, we set the cropsize to 512 and evaluate our model on 768 × 1536 images. When using ResNet-50 as the encoder, the baseline yields a result of 74.45% in Mean IoU. While MPDNet yields a result of 76.84% in Mean IoU, which brings 2.39% improvement in terms of Mean IoU. In our experiments, we use 64 dilated convolutions in each branch of deeplab in the edge decoder. Because we find that when the size of the edge decoder is a quarter of the main decoder, the model works best. If the size of the edge decoder is larger, the edge detection task will have a negative effect on the semantic segmentation task. Experimental results with single-scale testing are listed in Table 4 Table 5. Table 6 shows the comparison with other models. Several examples are shown in Figure 3.  Table 6. Comparison to state-of-the-art on the validation set of ADE20K. "-" represents that it is not reported in the related work.

Conclusions
In this work, we present a novel network, MPDNet, for fast and accurate semantic segmentation. The network has an encoder-decoder structure to encode the rich contextual information and recover the object boundaries. We use a powerful lateral connection to fuse the semantic information in deep layers with the spatial information in early layers. These connections help feature maps in deep layers with encoding abstract contextual information regardless of low-level details and small objects. Compared with various dilated architectures, this design considerably decreases computational complexity while achieving competitive results. Considering the difference between shape information and other information, we use two decoders to recover shape information and other information, respectively. The edge decoder affects the main decoder by adding the edge loss to the main loss.
There may be a more effective way to use the edge decoder to improve the main decoder. Meanwhile, we divide the main decoder to four paths to extract different type features. In each path, the enriched feature maps are concatenated to produce a high-resolution feature map. We supervise each branch of the main decoder by calculating loss for them. As an FPN-based model, MPDNet is relatively faster than many state-of-the-art models. Our experimental results show that our proposed model achieves good results on semantic segmentation benchmark Cityscapes (76.7%) with only 119GFLOPs.
Same as other semantic segmentation models, it requires a large number of finely annotated images to train our model. How to train an effective model with a small number of annotated images is still a problem to be solved. Due to the lack of high-resolution representations, it is still difficult for models to segment small objects, especially for models with large output stride. A model that can segment small objects well will greatly improve the benchmark. Meanwhile, we do not use information between frames in video segmentation. We look forward to a weakly supervised model for semantic video segmentation.