LASNet: A Light-Weight Asymmetric Spatial Feature Network for Real-Time Semantic Segmentation

: In recent years, deep learning models have achieved great success in the ﬁeld of semantic segmentation, which achieve satisfactory performance by introducing a large number of parameters. However, this achievement usually leads to high computational complexity, which seriously limits the deployment of semantic segmented applications on mobile devices with limited computing and storage resources. To address this problem, we propose a lightweight asymmetric spatial feature network (LASNet) for real-time semantic segmentation. We consider the network parameters, inference speed, and performance to design the structure of LASNet, which can make the LASNet applied to embedded devices and mobile devices better. In the encoding part of LASNet, we propose the LAS module, which retains and utilize spatial information. This module uses a combination of asymmetric convolution, group convolution, and dual-stream structure to reduce the number of network parameters and maintain strong feature extraction ability. In the decoding part of LASNet, we propose the multivariate concatenate module to reuse the shallow features, which can improve the segmentation accuracy and maintain a high inference speed. Our network attains precise real-time segmentation results in a wide range of experiments. Without additional processing and pre-training, LASNet achieves 70.99% mIoU and 110.93 FPS inference speed in the CityScapes dataset with only 0.8 M model parameters.


Introduction
In recent years, with the rapid development of deep convolution neural networks (DCNN), semantic segmentation is becoming more and more popular, which has been applied to many computer vision tasks [1][2][3], and significant progress has been made in semantic segmentation. The low-accuracy and slow-speed problems that are difficult to be solved by previous graph theory and pixel clustering methods have been solved. Semantic segmentation technology plays a crucial role in the medical image [4,5], remote sensing mapping [6,7], automatic driving [8,9], and indoor scene [10]. Moreover, with the rapid development of the GPU industry, more complex DCNN models can be realized and applied, and semantic segmentation technology is constantly improved. However, there are three major problems in the semantic segmentation network based on deep learning: • The semantic segmentation accuracy is high, but the network model parameters are large. • The semantic segmentation network is lightweight, but the segmentation accuracy is insufficient. • The semantic segmentation network cannot fully use context information.
There are three solutions to the above problems: reducing the size of the input feature image, improving the convolution block structure, and using the encoder-decoder architecture.
The first method is to reduce the size of the input feature map, such as ENet [11], SegNet [12], and ERFNet [13], which can improve the inference speed but lose some spatial information. The second method is to strengthen the convolution block structure, such as AGLNet [14], GINet [15], and DSANet [16], which can improve the accuracy of semantic segmentation but reduce the inference speed. The third method uses encoder-decoder architecture, such as LRDNet [17], ERFNet, and FSFNet [18]. After the input image is given, DCNN can learn the feature map of the input image through encoder-decoder architecture. The network can gradually realize the category annotation of each pixel, achieve the end-to-end effect, reduce the amount of calculation, and realize fast inference speed and high-quality segmentation accuracy.
To solve these problems, we propose a lightweight asymmetric spatial feature network(LASNet), which can reduce the loss of spatial details, improve the inference speed, and a better balance between speed and accuracy. Moreover, we design a lightweight asymmetric spatial convolution Module(LAS). We use a residual unit with a skipping connection to prevent network degradation and adopt a channel shuffling operation to enhance the robustness of the network. At the same time, we use the encoder-decoder architecture. We validate LASNet on the CityScapes dataset and achieve satisfactory results. Our LASNet has good semantic segmentation accuracy and fast inference speed compared with state-of-the-art methods, as shown in Figure 1. The main contributions of this paper are as follows: • We propose a novel deep convolution neural network called LASNet, which adopts an asymmetric encoder-decoder architecture. Through ablation study, the optimal parameters such as module structure, dilation rate, and dropout rate are obtained, which is helpful to build a high-precision and real-time semantic segmentation network. • To preserve and utilize spatial information, we propose the LAS module, which adopts asymmetric convolution, group convolution, and dual-stream structure to balance inference speed and segmentation accuracy. However, the LAS module's computational complexity is much lower. In the encoding part of LASNet, which uses the LAS module to process downsampling features, reduce the number of network parameters, and maintain strong feature extraction ability. • We propose a multivariate concatenate module, which is used by the decoder of LASNet for upsampling. The module can reuse shallow features of images, which helps to improve the segmentation accuracy and maintain a high inference speed. The remainder of this paper is structured as follows. In Section 2, related work on semantic segmentation, convolutional factorization and attention mechanism is introduced. Following that, the detailed architecture of LASNet is introduced in Section 3. Furthermore, the experiments can be found in Section 4. Finally, the concluding remarks and future work are given in Section 5.

Related Works
Semantic segmentation is a challenging task in computer vision. Especially in the field of automatic driving, low computational complexity and high segmentation accuracy are needed in practical applications. To meet the above requirements, the design and network architecture of CNNs need to be carefully arranged.For example, ResNet [19], VGG [20], Inception [21][22][23] and MobileNet [24][25][26], with using deep learning network frameworks for semantic segmentation, which can predict each information of different semantic categories of the image. So that the automatic driving system can judge the environment around them according to the accuracy of the training model based on the pixel level. For example, roads, cars, pedestrians, sidewalks, and buildings. The fully convolutional network (FCN) [27] transforms the classification network into a network structure for segmentation tasks, which proves end-to-end network training on the segmentation problem.

Semantic Segmentation
In order to improve the semantic segmentation accuracy of automatic driving or intelligent robot, SegNet was proposed by the University of Cambridge. However, this method had a large calculation and low segmentation accuracy. So it was difficult to be used in the field of real-time semantic segmentation. Deeplab-v3+ [28] used the encoderdecoder structure in semantic segmentation and arbitrarily controlled the resolution of the features extracted by the encoder. At the same time, in order to fuse multi-scale information, Deeplab-v3+ used dilated convolution to expand the receptive field. PSPNet [29] considered the global background of the image to generate the prediction at the local level. It is recognized that ENet proposed the first real-time semantic segmentation network, which adopted encoder-decoder architecture to get good segmentation accuracy and inference speed with few model parameters. ERFNet used residual units and deconvolution to maintain efficiency and improve the accuracy of semantic segmentation, which cannot consume too many resources. The Context Guide block proposed by CGNet [30] can obtain the context information and learn local and global features. AGLNet adopted asymmetric encoder-decoder architecture and used a split-shuffle-non-bottleneck unit to generate downlink sampling characteristics while maintaining strong representation ability, AGLNet made the network scale smaller and improved the segmentation accuracy. LMFFNet [31] extracts sufficient features with fewer parameters and fuses multiscale semantic features to effectively improve the segmentation accuracy. SGCPNet [32] uses the spatial details of shallow layers to guide the propagation of the low-resolution global contexts, in which the lost spatial information can be effectively reconstructed. MAFFNet [33] can effectively extract depth features and combine the complementary information in RGB and depth. Like this work, we use a full resolution of 1024 × 2048 on the CityScapes dataset.

Convolutional Factorization
At present, most advanced real-time semantic segmentation networks use convolution factorization, which decomposes the standard convolution into several asymmetric convolutions to reduce the computational complexity and improve the depth of the network.
The calculation cost of the standard convolution layer is usually calculated according to the parameters of "Mult-Adds [24]", which can be shown as: Convolution factorization usually decomposes a 2D convolution into two asymmetric convolutions (e.g., decompose n × n to 1 × n and n × 1), such as group convolution [34], depth separable convolution [24], and its extended version [35]. The calculation cost of asymmetric convolution is calculated using the same symbol, which can be shown as: where MAC s is the calculation cost of standard convolution, MAC a is the calculation cost of asymmetric convolution, K is the kernel size; in the feature map, H is the spatial height, W is the spatial width; In and Out are the number of input channels and output channels, respectively. Specifically, the group convolution can reduce training parameters and prevent overfitting. The filter is divided into different groups to reduce training parameters and to avoid overfitting, which has been widely used in many real-time semantic segmentation networks. Different from these efficient networks, our proposed LAS module avoids standard convolution and reduces computational complexity. Compared with ShuffleNet [36,37], which is convoluted only half of the input feature channels, our proposed LASNet makes full use of the input channel with multiple convolution branches. The multi-path structure of the LAS improves the feature extraction ability of the network. In addition, our proposed LAS module enhances the information exchange in the feature channel while maintaining the computational cost similar to the standard convolution.

Attention Mechanism
In recent years, attention mechanism [38,39] has been widely used in various tasks of computer vision, such as image processing, voice recognition, or natural language processing. SENet [40] is divided into two operations with sequence and exception. The purpose of the squeeze operation is actually to extract the spatial information, and the exception operation is used to fully capture the channel correlation. CBAM [41] is a simple and effective attention module for the convolutional neural network. Given an intermediate feature map, CBAM will infer the attention map along two independent dimensions (channel and space). Then, attention mapping is multiplied by the input feature map for adaptive feature optimization. Triplet attention [42] establishes the dependency relationship between dimensions through rotation operation and residual transformation, which encodes the channel and spatial information with negligible computational overhead. Coordinate attention [43] decomposes channel attention into two 1-dimensional feature coding processes to aggregate features along with two spatial directions. Then, the generated feature map is encoded into a pair of direction aware, and position-sensitive attention maps, which can be complementarily applied to the input feature map to enhance the representation of the object of interest. In this work, triplet attention is used in the Transform Module, and it works well.
Equation (3) is one of the core operations of triplet attention, where 0d is the 0-th dimension of maximum and average pooling operation, and operator [.] means concatenate operation.

LASNet
In order to reduce the computational cost and improve the segmentation accuracy, we propose a lightweight asymmetric spatial feature network called LASNet. Firstly, we propose the LAS module, which is the core component for semantic feature extraction in the network. Thereafter, we designed a new transform module, and multivariate concatenate module. In the transform module, the attention mechanism makes the network pay more attention to the essential features of the feature map. The multivariate concatenate module upsamples the feature image to the size of the input image and completes more complex boundary segmentation. Finally, we introduce the architecture of the LASNet, as shown in Figure 2.

LAS Module
In order to achieve a balance between inference speed and segmentation accuracy, we designed a new module called LAS. Our proposed LAS adopts a residual connection structure, which is to prevent gradient disappearance or gradient explosion by building a high-performance deep network. In order to reduce the computational complexity of convolution, we first adopt downsampling of the feature map in the residual unit to reduce the amount of computation. Then, we use two convolutions to extract the features and upsample the feature map to match the size of the input map. We use bilinear interpolation to upsample and downsample, which is more convenient than convolution. Finally, we use channel shuffle to achieve feature reuse with disrupting channels. We apply asymmetric convolution, group convolution, and dual-stream structure into the residual unit of LAS. There are LAS-A, LAS-B, and LAS-C structures in the LAS module, as shown in Figure 3. Due to this unique design, LASNet has fast inference speed and high accuracy.

LAS-A Module
In the LAS-A module, the input feature map size and the output feature map size is 128 × 256, and the number of channels is 64. We focus on finding convolution operator combinations with higher accuracy and faster inference speed in this feature map size. We use dilated convolution to integrate the multi-scale context information on the pixel level. Compared with standard convolution, dilated convolution can accommodate a wider receptive field area without increasing the number of parameters. It is very effective to use asymmetric convolution for network structure. 3 × 3 dilated convolution is decomposed into 3 × 1 and 1 × 3 dilated convolution. Although the accuracy decreases slightly, the parameters are reduced by 33%.

LAS-B Module
In the LAS-B module, the input feature map size and the output feature map size is 64 × 128, and the number of channels is 96. In this feature map size, the accuracy of convolution operators with different combinations is not much different. Therefore, we focus on finding convolution operator combinations with faster inference speed. We analyzed various convolution parameters and FLOPs. Furthermore, we found that 3 × 1 and 1 × 3 depth dilated convolution has a fast inference speed and a large receptive field, which is suitable for the structure of the LAS-B module.

LAS-C Module
In the LAS-C module, the input feature map size and the output feature map size is 32 × 64, and the number of channels is 128. We use the dual-stream structure to extract the features of the feature map and design the LAS-C module. Furthermore, we adopt the split-convolution-cat-shuffle operation, which can reduce computational complexity. At the beginning of the LAS-C module, the number of input channels is evenly divided into two low-dimensional branches by split operation. In order to decrease the computation of standard convolution, we use the asymmetric convolution in the residual unit. The concatenation operation is used to merge the convolution outputs of the two branches so that the number of channels remains the same. Finally, the same channel shuffling operation is used to communicate information between two branches. Channel shuffling can be regarded as feature reuse. With the data flowing to the deepest layer of the network, the network capacity is expanded to a certain extent without significantly increasing the complexity.

Multivariate Concatenate Module
We use the multivariable concatenate module (MCM) to filter and fuse feature maps of different scales to achieve better prediction accuracy. MCM uses bilinear interpolation for upsampling to recover the size of the feature map. Then, MCM concatenates the channels of the multivariate information feature map to combine the network with multi-scale context information, which makes the structure can effectively improve the performance of the network. Finally, two 1 × 1 convolution layers are used to increase the number of channels of the feature map so that it concatenates the number of channels of the multivariate feature map. The structure of MCM is shown in Figure 4. The LAS-A module and LAS-B module output is an input of MCM, respectively, and through concatenate operation, bilinear upsampling, and 1 × 1 convolution sequential processing.

Transform Module
The transform module contains three branches. The first branch: channel attention calculation branch; the second branch: channel C and space W dimensional interaction capture branch; the third branch: channel C and space H dimensional interaction capture branch, and finally the output features of the three branches are summed. The structure of the transform module is shown in Figure 5. We use the transform module to extract highlevel semantic information from the feature map, which uses the attention mechanism to suppress irrelevant features and focus on the essential features. Meanwhile, the transform module can increase the depth of the network and improve network performance. Subsequently, we perform ablation studies on the transform module to verify the effectiveness of the attention mechanism in the transform module.

LASNet Architecture
Our LASNet follows a lightweight encoder-decoder architecture. Different from the traditional networks, our LASNet adopts an asymmetric architecture, where an encoder generates downsampled feature maps, and the subsequent decoder upsamples the feature maps to match the input resolution. The detailed structure of our proposed model is shown in Table 1. In our architecture, the first layer is the downsampling module, as shown in Figure 4. Continuous downsampling operations can reduce the size of the feature map in order to extract high-level semantic information. In terms of channel number, the number of channels increases with the downsampling rate. However, we consider the low computational overheads to limit the channel size to 128. We apply LAS-A, LAS-B, and LAS-C modules to feature maps of different scales, and each module is stacked with four. This method can efficiently extract the semantic information of the feature graph according to the size of the feature map. In addition, the use of dilated convolutions allows our structure to expand the receptive field without losing resolution, and obtain multi-scale context information, which further improves the accuracy. Compared with the larger kernel sizes, this technology can reduce the amount of calculation without introducing additional parameters. We also added dropout to the LAS module to achieve regularization and a slightly improved dropout rate in LAS-B and LAS-C to enhance the regularization effect, which will bring better benefits. We will prove this later in the experiment. Each bilinear interpolation layer in the encoder will pass through a 1 × 1 convolution, which can adjust the number of channels without significantly increasing the number of parameters. In the transform module, the attention mechanism enables the network to suppress irrelevant characteristics and focus on essential features. The multivariable concatenate module of the decoder completes more complex boundary segmentation, which gradually recovers the spatial information lost by multiplexing the shallow features. Because the number of channels in the deep semantic feature map is too large, we use 1 × 1 convolution for dimensionality reduction and feature fusion. Finally, bilinear upsampling is used to recover the resolution step-by-step.

Experiment
In this section, we conducted semantic segmentation experiments on the challenging dataset CityScapes [44] to demonstrate the high segmentation accuracy and inference speed of our proposed LASNet. In order to better understand the potential behavior of semantic segmentation networks in machine vision, we also carried out some ablation studies.

Implement Details
We tested LASNet on the CityScapes dataset, which is a common benchmark for real-time semantic segmentation. The CityScapes dataset has 5000 images from driving scenes in 50 urban environments, which is including of 2975 training images, 500 validation images, and 1525 test images, with the image size of 1024 × 2048. It has 19 categories of dense pixel annotations. For a fair comparison, we use the original image size 1024 × 2048 as the input resolution of the CityScapes dataset.
LASNet is trained end-to-end using Adam optimizer for CityScapes dataset. For CityScapes dataset, we prefer a large batchsize (set to 8) to use GPU memory fully. The initial learning rate is set to 1e-3. During our training process, we adopted the "poly" learning rate strategy [29], in which the power of the learning rate is 0.9, and the weight attenuation is set to 5e-4. Furthermore, for the CityScapes dataset, the maximum number of the training epoch is set to 350.
In the training process of data enhancement, we use random horizontal flipping and random scaling from 0.5 to 2 for the input image. Finally, we randomly cut the image into a fixed size for training. All images of the CityScapes dataset were normalized to zero mean and unit variance.

Comparative Experiments
In order to demonstrate the advantages of our network, we selected 11 most advanced lightweight models as comparison networks, including SegNet, ENet, ICNet [45], ESP-Net [46], CGNet, ERFNet, DABNet [47], FSCNN [48], FPENet [49], FSFNet, NDNet [50]. The experimental results of some network models are generated using the default parameter settings given by the author, while others are directly reproduced from the published literature. All comparison networks are evaluated and measured by the mean Intersection over Union (mIoU) class score, which is commonly used in the evaluation of semantic segmentation model indicators. mIoU represents the ratio of the intersection and union of the real value to the predicted value. Each class calculates its own IoU as follows: TP, FN, and FP, respectively, represent true positive, false negative, and false positive. After calculating the average value of IoU of each class, the main evaluation indicator mIoU of semantic segmentation is obtained, and the expression is as follows: where i represents the real value, and k + 1 represents the number of categories (including empty categories).

Analysis of CityScapes Evaluation Results
For the fairness of experimental data, all comparison networks will use the same hardware platform and NVIDIA Titan XP GPU for training. Table 2 compares our LASNet with selected state-of-the-art networks. Experimental data shows that LASNet is superior to these networks, with high classification accuracy and high inference speed. In these methods, our proposed LASNet has only 0.8 M network parameters without pre-training in ImageNet, which has 110.93FPS inference speed and 70.99% mIoU. As can be seen from the experimental data in Table 2, LASNet still has 110.93FPS in terms of inference speed when the size of the input feature map is 1024 × 2048. Compared with FSFNet, the segmentation accuracy of LASNet is higher than 1.8%. The inference speed of other lightweight networks is similar to our LASNet. However, the segmentation accuracy is low. For example, the network model parameters of FPENet are only 0.13 M, and the inference speed is 110 FPS, but the segmentation accuracy is 15% lower than that of our LASNet. We also compare with some relatively large networks and give Table 2, the detailed IoU of each class is shown in Table 3. The results show that compared with ERFNet and ICNet, our LASNet has similar segmentation accuracy, but the inference speed is lower than 70-90 FPS. Figure 6 shows the results of the CityScapes dataset after these comparative network segmentation. The experimental results show that compared with these networks, our proposed LASNet has higher accuracy and faster inference speed for different scales of target segmentation, which proves the advanced level of our network.  In order to prove the effectiveness of our proposed LAS module, we used the CityScapes dataset to ablation the LAS module and combined LAS-base, LAS-A, LAS-B, and LAS-C into our network. Table 4 analyzes the contribution of each combination to LASNet performance. It can be observed that the introduction of different LAS modules can improve the segmentation accuracy. Compared with the basic module, the LAS module concatenates the semantics of high-level features and the spatial details of low-level features to improve performance. The mIoU results using the combination of LAS-A and LAS-B reached 71.39%, which is 1.74% higher than the baseline. On the other hand, the combination of LAS-A, LAS-B, and LAS-C was 1.34% higher than the baseline, and the segmentation accuracy was 70.99%. Considering FPS, parameters, and FLOPs, we finally chose the combination of LAS-A, LAS-B, and LAS-C, which has the smallest parameters and FLOPs, faster inference speed, and higher segmentation accuracy.

LAS Module Number
Our LASNet design architecture uses different numbers of LAS modules, and we verified the impact of different numbers of each LAS module on the LASNet performance. Table 5 analyzes the contribution of the number of each LAS module on the performance of LASNet. It can be seen that the introduction of different numbers of LAS modules can improve the accuracy of segmentation. The increased number of different LASmodules can improve the performance compared to the basic module. However, when the number of LAS Modules is much or little, it has a negative effect. When the number of LAS Modules is 2, the inference speed is fast but the quality is too low, Furthermore, when the number of LAS Modules is 6, the accuracy is reduced. Finally, considering the FPS, parameters and FLOPs, we finally choose the combination of LAS-A, LAS-B and LAS-C with the number of all 4.

Dilation Rate
We use dilated convolution [51] to expand the receptive field and aggregate semantic information to realize the flexible fusion of multi-scale context information in the LAS module. There are four expansion convolution layers in LAS-A, LAS-B, and LAS-C blocks. We performed ablation experiments on the dilation rate, which are {1, 1, 1, 1}, {1, 2, 3, 4}, {1, 2, 5, 9} and {1, 2, 4, 8}. We compare the network segmentation accuracy in these four cases. The experimental results are given in Table 6. We can find that according to the use of dilated convolution, the segmentation accuracy of CityScapes dataset can differ by up to 3%. Therefore, in the structure of LASNet, we use the dilation rate of 1, 2, 4, 8 to achieve the best segmentation accuracy. Table 6. Ablation studies of different dilation rates. The red color font indicates the optimal result.

Dropout Rate
In this section, we will show to select the appropriate dropout rate to improve the segmentation accuracy of LASNet. In Table 7, we analyze the impact of dropout on performance in the LAS module and modify the dropout rate in each LAS block. The experimental results show that the use of a dropout rate is effective, and the segmentation accuracy of the CityScapes dataset can differ by 0.49%. In particular, the dropout rate of the LAS module gradually increases from 0.01, which shows the best segmentation performance. Because dropout in the LAS module can simplify the model, improve the regularization effect, and model generalization force, which can avoid overfitting. Therefore, the dropout rate increased from small to large is suitable for our architecture and shows good performance. Table 7. Ablation studies of different dropout rates. The red color font indicates the optimal result. We use the attention mechanism in the transform module to alleviate the contradiction between model complexity and expression ability. With the help of the way the human brain processes information overload, the spatial attention is used as the critical part with greater weight so that the model's attention can be more focused on this part. In Table 8, we compare four cases in the transform module of LASNet: no-transform module, CBAM, triplet attention, and coordinate attention. Extensive experimental results show that selecting the appropriate attention mechanism is effective. The segmentation accuracy of CityScapes dataset can differ by 0.36%. In particular, the use of triplet attention in the transform module shows the best segmentation performance. This is because the dependency between dimensions is established through rotation operation and residual transformation, which can encode the channel and spatial information. Therefore, triplet attention is more suitable for our architecture. Figure 7 shows the IoU of the CityScapes dataset after segmentation by all ablation studies.

Conclusions
This paper describes a lightweight asymmetric spatial feature network (LASNet), which is an encoder-decoder network for real-time semantic segmentation of automatic driving. The encoder adopts channel splitting and shuffling operations in the residual unit, which strengthens information exchange in the way of feature reuse. LAS-A, LAS-B, and LAS-C modules quickly extract the semantic information of the feature map according to the size of the feature map. Then, the attention mechanism in the transform module makes the network pay more attention to the semantic features of the feature map. Finally, the multivariable concatenate module of the decoder completes more complex boundary segmentation and gradually recovers the spatial information lost by the encoder due to the reduction of the size of the feature map. The entire network proves end-to-end network training. To evaluate our network, we conducted experiments on popular datasets. The experimental results show that our LASNet is better than the comparative SOTA network in the segmentation accuracy and efficiency of the urban landscape dataset. In the future, we will strive to quantify the model parameters and deploy them in embedded devices.