1. Introduction
Semantic segmentation is an important task in computer vision, as it constitutes pixel-wise classification of an image to mask each object in the scene. Hundreds of applications, such as autonomous driving, robotics, medical diagnostics, image editing, and augmented reality applications, incorporate semantic segmentation. Recent studies on semantic segmentation have achieved promising results using convolutional neural networks (CNNs), particularly encoder–decoder CNN architectures. In such architectures, the semantic segmentation task is modeled in two stages: the encoding stage, in which the image is down-sampled to obtain its deep semantic features, and the decoding stage, in which the semantic features are up-sampled to obtain a semantic segmentation mask of the same size as the input image. Encoder–decoder architecture can achieve highly accurate segmentation results; however, the decoder stage adds considerable computational complexity to the overall model and the input image size of such models is usually small. Furthermore, the decoding-stage reconstruction is inefficient because de-convolution or up-sampling layers are used, which eliminate small details and sometimes propagate noise.
The semantic segmentation problem can be seen as an image-to-image translation problem, as the aim of this task is to construct segmentation masks equivalent to each object in the input image. Further, the objects in the segmentation mask have the same boundaries as the input image; thus, the segmentation mask can be considered as an alternative to the image and, hence, the segmentation task can be performed using image reconstruction techniques. The efficient sub-pixel CNN [
1] has shown promising results in image and video super-resolution tasks because of its depth-to-space (DTS) layer, from which a high-resolution image can be reconstructed from many low-resolution images. This layer performs image reconstruction through pixel reordering of the low-resolution feature maps obtained from the CNN network, in order to form super-pixels of the high-resolution image with very accurate borders and object details. Thus, the DTS layer efficiently obtains accurate semantic segmentation masks with clear borders, at far higher speeds than the traditional encoder–decoder architectures usually used for segmentation tasks. The DTS layer has many advantages, including its lower computation count and considerably higher accuracy compared with the decoder stage of the encoder–decoder architecture. Therefore, modeling can be accelerated, while high segmentation accuracy is maintained.
Vision transformers, which apply a transformer network to images in a manner similar to natural language processing, are a recent achievement in computer vision. Vision transformers allow parallelization in the processing of a sequence dataset or a sequence of patches of the target image. This is achieved through positional encoding, which allows the network to learn the position of the patch in the input “big image”. Vision transformer-based methods achieve better accuracy than CNN-based methods for classification, segmentation, and object detection tasks without increased computational cost.
In this study, we propose the DTS-Net deep network for semantic segmentation using a sub-pixel CNN, to address the semantic segmentation complexity problem that arises for encoder–decoder architectures with or without attention methods while retaining high segmentation accuracy. We also present DTS-Net-Lite, a lightweight version of this network. Our contributions can be summarized as follows:
Rather than a pixel-wise classification problem, we treat the semantic segmentation problem as an image-to-image translation problem through regression using the DTS layer, and construct segmentation maps using the higher-resolution image reconstruction approach of the super-resolution task;
We propose DTS-Net, a deep model that uses Xception architecture as a feature extractor for high-accuracy critical applications, as well as a small lightweight model, DTS-Net-Lite, for high-speed critical applications that uses MobileNetV2 architecture as a feature extractor;
We reduce the typical decoding stage complexity for segmentation mask construction to that of the DTS image construction layer and show that this layer can construct segmentation masks with far lower computational cost and much higher precision than conventional CNN-based decoding architectures.
We propose a new segmentation improvement technique namely nearest label filtration (NLF) to improve the segmentation by correcting the wrong predicted pixels by DTS-layer in the segmentation mask.
The proposed method achieves higher accuracy and speed than the recent semantic segmentation methods. Further, learning of highly detailed feature maps is possible as we depend on CNN architecture in addition to the DTS layer; this allows the construction of higher-resolution prediction maps. In addition, we explore the joint semantic segmentation and depth estimation task and achieve promising results. A preview of our results is shown in
Figure 1. The remainder of this paper is structured as follows.
Section 2 summarizes related work and
Section 3 presents the proposed method and architectures.
Section 4 and
Section 5 discuss the training and test datasets and present the results with comparison to state-of-the-art (SOTA) methods, respectively.
Section 6 discusses future work and
Section 7 presents conclusions.
2. Related Work
Recent semantic segmentation studies have shown that encoder–decoder architecture can efficiently perform segmentation tasks. The first encoder–decoder architecture was the fully convolutional network (FCN) [
5], in which the same architecture was used for image classification but the final dense layers were replaced with 1 × 1 convolutional layers having the same weights. The decoder stage was a simple up-sampling layer. The FCN obtained relatively good results, motivating further research on encoder–decoder architectures. Later, SegNet [
6] was proposed as a deep encoder–decoder architecture. This architecture features pooling indices shared between the max-pooling layers in the encoder stage and the corresponding max-unpooling layers in the decoder stage. SegNet exhibited impressive semantic segmentation results on outdoor and indoor segmentation datasets. U-Net [
7] is another impressive encoder–decoder architecture, which was proposed for microscopy cell segmentation in medical images. U-net suggests residual connections between corresponding layers in the encoder and decoder stages and, hence, achieves considerable segmentation accuracy.
The four versions of DeepLabVx also constitute considerable contributions to the semantic segmentation task. DeepLabV1 [
8] tackled the problem of inefficient down-sampling through a wide field of view convolution using the Atrous convolution [
8], which increases the spatial field of the convolutional window using the same weights as the normal convolution. In that work, the conditional random field (CRF) was also proposed, which uses an energy function derived from the summation of a unary potential term calculated from the probability distribution of the output label of each pixel, along with a binary potential term calculated from the correlation between pixel labels. In general, the CRF allows the model to learn small image details. With DeepLabV2 [
9], Atrous Spatial Pyramid Pooling (ASPP) was proposed to enhance the model learning at multiple scales of the feature maps. In addition, the VGG16 [
10] used in DeeplabV1 was replaced with ResNet [
11], which yielded better performance. The ASPP was further improved for DeepLabV3 [
12], with the use of different sampling rates in the ASPP in a cascaded manner. Finally, for DeepLabV3+ [
13], the encoder architecture was replaced with a depth-wise separable convolution-based architecture, and Aligned Xception was adopted. The latter is a modified version of Xception [
14] that replaces the max-pooling layers in the original architecture with strided convolutional layers with more Xception blocks; this facilitates higher accuracy and speed during processing.
Zhao et al. proposed the pyramid scene parsing network (PSPNet) [
15], which adopts a pyramid parsing module that learns the global context of the image through region-based aggregation. Pyramid pooling is employed to learn the context of the image from the final small feature maps. Later, Zhao et al. presented the image cascade network (ICNet) [
16], which performs high-speed semantic segmentation on high-resolution images using cascade feature fusion. The latter approach mixes the features obtained from the input image on different scales in a method called cascade label guidance. Up-sampling layers are then used to resize the image to the input image size. As another approach, the bilateral segmentation network (BiSeNet) [
17] performs real-time semantic segmentation using an architecture with two paths. The first is a spatial path for spatial information preservation and the other is a context path for general context learning through down-sampling. The features obtained by the two paths are then combined via a feature fusion technique. ResNeSt, developed by Zhang et al. [
18], applies channel-wise attention to different network branches to learn diverse feature representations and cross-feature information. Shi et al. [
19] proposed Hierarchical Parsing Net for semantic segmentation which enhances scene parsing by learning the global scene information and the contextual relation between objects in the scene using a deep neural network. Chen et al. [
20] proposed a one-shot semantic segmentation method which uses the multiclass labels information during training to encourage the network to learn more accurate semantic features of each category and they also proposed the pyramid feature fusion module to mine the fused features of the objects and a self-prototype guidance branch to support the segmentation task. Although all the previously mentioned methods presented challenging results, they adopted inefficient decoding stages that eliminate the object details and introduce some noise.
Zoph et al. [
21] proposed pre-training and self-training techniques using stronger augmentation on ImageNet [
22] across the different image sizes using EfficientNet [
23] architecture, and showed that pre-training and self-training are mutually beneficial and improve the accuracies of both the semantic segmentation and object detection tasks. Further, Rashwan et al. [
24] proposed dilated SpineNet or SpineNet-Seg, which is a neural architecture search discovered network from DeepLabv3. In this approach, the scale permuted networks originally used for object detection in the semantic segmentation task are evaluated, adopting a customized dilation ratio per block. Bai et al. [
25] proposed the multiscale deep equilibrium model (MDEQ), MDEQ backpropagates through the equilibrium points of multiple scale features simultaneously using simple differentiation to avoid storing intermediate states, they attained a high segmentation accuracy on CITYSCAPES however, their model has high computational complexity. Termritthikun et al. [
26] proposed EEEA-Net in which they employed a neural architecture search method to search an optimized model with the lowest number of parameters by using an early exit population initialization algorithm. They could achieve an average accuracy segmentation model but the number of the parameters of their model was extremely low. Ding et al. [
27] recently proposed RepVGG in which they adopted a VGG-like architecture composed of a stack of
convolution and Relu. RepVGG models ran much faster than ResNet-50 or ResNet-101 with higher accuracy on the classification and semantic segmentation tasks. It is important to note that Aich et al. [
28] initiated the direction of using depth-to-space for segmentation when they employed it to perform binary segmentation for satellite maps in DeepGlobe dtaset [
29] using ResNet and VGG16 backbones but their implementation was not efficient enough and our model can produce better quality segmentation than their model.
Other methods [
30,
31] employed depth information to support the semantic segmentation task. Kang et al. [
30] proposed a depth adaptive deep neural network for semantic segmentation using a depth-adaptive multiscale convolutional layer consisting of the adaptive perception neuron and the in-layer multiscale neuron to adjust the receptive field at each spatial location and to apply the different size of the receptive field at each feature space to learn features at multiple scales, respectively. Gu et al. [
31] proposed Hard Pixel Mining for semantic segmentation using a multiscale loss weight map generated by the depth data to enforce the model to pay more attention to the hard pixels in segmentation. They employed the depth data during the training step only and did not use it in the testing step. Other studies have shown the ability of CNN models to perform the joint task of semantic segmentation and depth estimation. For example, Mousavian et al. [
32] proposed a multi-scale fully convolutional CNN for simultaneous semantic segmentation and monocular depth estimation. In this architecture, a CNN model is coupled with a fully connected conditional random field (CRF) to obtain the contextual relation and the interactions between the semantics of the image and depth cues. Zhang et al. [
33] developed joint task-recursive learning (TRL) for semantic segmentation and depth estimation and showed that TRL can recursively refine the results from both tasks using a task attention module. Further, a hybrid CNN for depth estimation and semantic segmentation (HybridNet) was developed by Lin et al. [
34]; this network functions by sharing the parameters that can yield mutual improvement for each task. Finally, He et al. [
35] proposed semantic object segmentation and depth estimation (SOSD)-Net for application to monocular images; in this approach, semantic objectness is used to perform image processing based on the geometric relationship between the two tasks. The aligned Xception architecture is employed.
In all of the related work, the studies adopted mainly encoder-decoder approaches with some methods adopted extra attention methods or transformers. The proposed method mainly eliminates the need for a complex decoding stage which is computationally expensive, eliminates some of the image details, and propagates noise. Hence, we propose a simple and fast approach for dense prediction. We use the DTS-layer as the decoding stage to directly construct a dense map out of small feature maps extracted from an encoding stage.