Multi-Scale Depthwise Separable Convolution for Semantic Segmentation in Street–Road Scenes

: Vision is an important way for unmanned mobile platforms to understand surrounding environmental information. For an unmanned mobile platform, quickly and accurately obtaining environmental information is a basic requirement for its subsequent visual tasks. Based on this, a unique convolution module called Multi-Scale Depthwise Separable Convolution module is proposed for real-time semantic segmentation. This module mainly consists of concatenation pointwise convolution and multi-scale depthwise convolution. Not only does the concatenation pointwise convolution change the number of channels, but it also combines the spatial features from the multi-scale depthwise convolution operations to produce additional features. The Multi-Scale Depthwise Separable Convolution module can strengthen the non-linear relationship between input and output. Speciﬁcally, the multi-scale depthwise convolution module extracts multi-scale spatial features while remaining lightweight. This fully uses multi-scale information to describe objects despite their different sizes. Here, Mean Intersection over Union (MIoU), parameters, and inference speed were used to describe the performance of the proposed network. On the Camvid, KITTI, and Cityscapes datasets, the proposed algorithm compromised between accuracy and memory in comparison to widely used and cutting-edge algorithms. In particular, the proposed algorithm acquired 61.02 MIoU with 2.68 M parameters on the Camvid test dataset.


Introduction
The major domains of application for semantic segmentation include embedded AI computer devices and self-driving systems, which have attracted a great deal of attention.These practical applications, however, must have rigorous memory requirements in addition to outstanding outcomes with high precision.The semantic segmentation algorithms that are now in use are widely used and generate competitive memory performance at the expense of segmentation accuracy [1,2].Rtseg addresses computationally efficient solutions by presenting a real-time semantic segmentation benchmarking framework with a decoupled design for feature extraction and decoding methods [3].Enet is a quick framework for semantically segmenting pixels.It is based on VGG16 and combines factorizing filters with conventional and dilated convolution to provide quicker inference [4].Moreover, some algorithms result in a large number of parameters while achieving excellent segmentation accuracy [5].To solve the problem of semantic segmentation and pixel-by-pixel classification based on the CNN, the existing CNN structure used for classification is converted into FCN.Local areas in the image are classified to obtain rough label maps; then, deconvolution (bilinear interpolation) is performed to obtain pixel-level labels.To achieve more accurate segmentation results, CRF can be used for post-processing [6].Usually, the information in the last layers may be too spatially coarse to allow precise localization to be performed.On the contrary, earlier layers may be precise in localization but may not capture semantics.In order to solve this problem, the hypercolumn at the pixel is defined as an activation vector to obtain detailed semantic information [7].Thus, it becomes a difficult task to build an efficient semantic segmentation network that balances memory and accuracy.
The output picture for semantic segmentation must have the same resolution as the input image, unlike the object detection [8,9], object identification [10,11], and image classification [3] tasks.Hence, corresponding networks must have the ability to classify pixels.Different objects need to be classified in terms of the semantic segmentation task in road scenes.This necessitates that the network comprehends spatial feature information among different classes, such as bicycles, autos, and pedestrians.In addition, the same class probably contains several instances that are located in different locations in the road image.This requires the network to understand spatial location information.Large categories such as roads, skies, and buildings contain a large number of pixels, while others, including pedestrians, lane lines, signs, and traffic lights, contain a small number of pixels.Specifically, it is possible for different instances in the same class to have different shapes due to their different distances, as shown in Figure 1.As a result, despite their various sizes, the network must be able to distinguish among classes or instances based on their shapes.Thus, it is crucial to keep boundary information in the extracted picture representation.Several algorithms expand branches [3,[12][13][14][15], combine various stages, or alter the network's connection path based on standard convolution in order to extract multi-level information.However, since such algorithms only emphasize competitive performance with high precision, a wide range of parameters are introduced in practical applications, which presents a hard challenge for AI computer systems.In addition, some algorithms [4,6,16,17] use efficient convolution methods or simplify the network structure to reduce parameters.For example, Xception [18] makes three branches merge into one and use group convolution to reduce the parameters.In order to decrease the number of parameters, these algorithms usually ignore the extraction of multi-scale features, which greatly sacrifices the accuracy.We constructed a semantic segmentation module that balances accuracy and memory using Depthwise Separable Convolution [19] and emphasizes the extraction of multi-scale features to increase accuracy.Compared with standard convolution, deep separable convolution can achieve local feature extraction with less computational complexity, resulting in fast inference speed and local feature extraction ability.For unmanned sports platforms, this is very important.We provide a novel module called Multi-Scale Depthwise Separable Convolution based on Depthwise Separable Convolution, which is intended to be efficient in terms of feature extraction in various feature spaces while being lightweight.The two primary components of this module are concatenation pointwise convolution and multi-scale depthwise convolution.In order to extract multi-scale spatial information, multi-scale deep convolution uses deep convolution with different kernel sizes.To develop additional features, concatenation pointwise convolution filters input channels before combining them.This paper's primary contributions are the following:

•
A brand-new module called Multi-Scale Depthwise Separable Convolution is proposed in this paper.This module can extract multi-scale information while remaining lightweight.

•
The proposed structure makes a trade-off between accuracy and memory.It significantly reduces the storage requirements of embedded AI computing devices, which is advantageous for real-world applications.
The rest of this essay is structured as follows: Related works, some of the current approaches used, and issues that need to be resolved are all described in Section 2. The evolution process is thoroughly explained in Section 3. The performance of the proposed algorithm and that of other algorithms involved are evaluated on different datasets in Section 4. Finally, we draw the final conclusion and discuss follow-up issues in Section 5.

Semantic Segmentation Task
For picture segmentation, the OTSU algorithm and Maximum Entropy employ statistics of the color information, but they ignore local characteristics, making the segmentation results more susceptible to environmental change [20][21][22].Moreover, the homogeneity of object color information is critically important for the aforementioned methods.Iteration is used by Genetic Algorithm and Simulated Annealing to find the global optimum.Even though they have low computing efficiency and are prone to becoming trapped in local optima, they can dynamically modify segmentation criteria [23].Support Hyperplane is constructed using Vector Machine as an interval boundary, and the kernel function is added to increase non-linearity [24].Nevertheless, the creation of non-linear and sample-dependent kernel functions often depends on a particular issue [25].While such conventional algorithms produce decent segmentation results, they are very dependent on a certain environment and frequently fail when there are some aberrant spots in the picture.Because deep learning algorithms can automatically extract the object characteristics and have produced the greatest results in many situations, they can achieve superior segmentation results in those problems.
There are two typical models for segmentation networks: (1) An encoder network, interpolation, and pixel-wise classification layers make up the initial segmentation model.(2) The second segmentation model is made up of a pixel-wise classification layer, a related encoder network, and a decoder network.The encoder network generates sparse abstract spatial features, which are always convolved with trainable filters or extrapolated from the nearest neighbors to yield dense spatial features.Fully Convolutional Network (FCN) [26] upsampling employs bilinear interpolation [27] to restore full input resolution, and the encoder network is identical to the convolutional layers of VGG16.The FCN result is quite positive, although some of the details seem coarse.Then, skip connection is utilized to improve the description of details by copying high-resolution characteristics from the encoder network to the decoder network.Bilinear interpolation is replaced with deconvolution in the upsampling procedure in Unet [28], which is based on FCN.High-resolution features transferred from the encoder network are concatenated with features from deconvolution by Unet before being convolved with trainable filters to convey context information to higher-resolution layers.Segnet [29] is intended to be memory and computationally efficient during inference.By pooling indices, it takes the place of skip connection and obtains additional information.A pyramid pooling module and an encoder network make up PSPNet [30].The encoder network is the same as ResNet's [31] convolutional layers.By aggregating context based on multiple regions, the pyramid pooling module might collect more global context data.In terms of speed and segmentation performance, BiseNet [32] was made to be effective.In order to produce high-resolution features and achieve an adequate receptive field, it creates two routes.It also adds the new Feature Fusion Module to effectively merge features on top of the two existing pathways.To solve the basic challenge of significantly decreasing the amount of processing required for pixel-wise label inference, ICNet [33] comprises three branches with various resolutions operating under appropriate label guidance.The cascade feature fusion unit is also introduced in order to swiftly accomplish high-quality segmentation.

Multi-Scale Feature Extraction
Extracting multi-scale features is an excellent choice to widen the non-linear relationship between input and output.GoogLeNet [20] produces exciting results in image classification on the ImageNet dataset.It suggests a new structural module for Inception that can extract multi-scale features from various feature spaces.To increase the breadth of the network, Inception V1 combines three convolutional layers with various kernel sizes and a 3 × 3 max pooling layer.Inception V2 [34] also consists of four branches, similarly to Inception V1.In addition, Inception V2 introduces batch normalization [34] before the convolution operator and replaces 5 × 5 convolution with two 3 × 3 convolution operations.Compared with Inception V1, Inception V2 decreases the parameters.To accelerate computing speed, Inception V3 [35] incorporates factorization, which divides 7 × 7 convolution into 1 × 7 convolution and 7 × 1 convolution.Residual connection [31] and the Inception module are combined in Inception V4 [36] to speed up performance and cut down on training time.The aforementioned Inception structures are only used in image classification.

Computational Method in Convolution Operation
Standard convolution [37] is a common computational method of convolution operation that has wide applications in most network architectures [38][39][40][41][42][43].This convolutional method is the operation of obtaining the dot product.For the given n × n convolution kernel, it needs to obtain the corresponding n × n area point-by-point dot product.Standard convolution extracts rich information, but it produces extensive parameters in a range of convolution operations; then, difficulties are encountered in training.Different from standard convolution, factorization [4] resolves n × n convolution into 1 × n and n × 1.So, it only obtains the dot product in the 1 × n or n × 1 region, and after that, it reduces the number of parameters and quickens training.It does, however, lose some feature data.Different from standard convolution, group convolution [44,45] is the first to divide lots of convolution operations into several groups; then, each group is independently trained.An efficient convolution technique called Depthwise Separable Convolution [16,19] combines pointwise and depthwise convolution.The number of groups is maximized in depthwise convolution, a special type of group convolution.Standard convolution with a 1 × 1 kernel is called pointwise convolution.It does not only filter input channels; it also combines them to provide new features.Both the number of parameters and the calculation time for Depthwise Separable Convolution are intended to be as efficient as possible.

Proposed Algorithms
A neural network must be able to distinguish among classes, despite their various sizes, based on their shapes.The majority of existing networks produce the same receptive field in a certain layer and are unsuitable to describe those classes that differ in size in the same class.In addition, standard convolution produces lots of parameters and is difficult to train.We suggest a Multi-Scale Depthwise Separable Convolution module based on Depthwise Separable Convolution.In terms of feature extraction in various feature spaces and parameters, this module is intended to be efficient.
Depthwise Separable Convolution changes the computational method by dividing n × n standard convolution into 2 convolution operations, as seen in Figure 2: 1.
To filter input channels, depthwise convolution is used, which is a group convolution operations with the same number of groups as input channels.2.
To integrate features from a depthwise convolution operation to produce new features, pointwise convolution, a common convolution operation with a kernel size of 1 × 1, is used.
Comparing Depthwise Separable Convolution with normal convolution, one can see that it is able to both extract the features and minimize the number of parameters.However, it is unable to extract features in different feature spaces.This makes it challenging to distinguish among classes based on their forms and sizes.As illustrated in Figure 3, we utilize 1 × 1, 3 × 3, and 5 × 5 kernels, respectively, to filter the input channels and then concatenate them, thus overcoming this problem.Figure 3 shows that the aforementioned module may extract multi-scale features from various feature spaces.Each branch in this module is an independent Depthwise Separable Convolution operation, and three branches, respectively, filter input channels in different feature spaces.The above module only filters input channels in different feature spaces.It does not combine those filtered input channels to produce new characteristics.As demonstrated in Figure 4, an additional layer is added to create new features before concatenation by computing a linear combination of features using 1 × 1 convolution.Compared with the module in Figure 3, the module in Figure 4 widens the non-linear layers and adds a layer.The non-linear link between input and output is strengthened by enlarging the non-linear layers, adding a layer to combine features from different feature spaces to create new features.However, the pointwise convolution layer is an additional layer.We adjust the structure to eliminate the redundant layer, as shown in Figure 5.

feature-input depthwise(conv
Different from the module in Figure 5, concatenation is placed before depthwise convolution and then concatenates the features; pointwise convolution merges with the additional layer; the 5 × 5 standard convolution is replaced with 3 × 3 dilated convolution (the dilation rate is 2).The combination of three depthwise convolution operations is called multi-scale depthwise convolution, and the combination of concatenation and pointwise convolution is called concatenation pointwise convolution.Figure 5 shows the proposed module, Multi-Scale Depthwise Separable Convolution.We can learn that this module consists of multi-scale depthwise convolution and concatenation pointwise convolution, which include convolution operation, batch normalization [34], and ReLU [46].Multi-scale depthwise convolution is used to extract spatial features in different feature spaces.Concatenation pointwise convolution not only changes the number of channels but also combines those spatial features from multi-scale depthwise convolution to create new features.Compared with Depthwise Separable Convolution, Multi-Scale Depthwise Separable Convolution widens the non-linear layers to strengthen the non-linear relationship between input and output.However, the proposed module produces more parameters.

Experimental Results
The effectiveness on Camvid [47,48] and KITTI [49], which are often used in semantic segmentation tasks, is assessed in this section.The proposed network structure is similar to that of FCN-8s, but the channel numbers in every layer are different, as shown in Table 1.The network adopts Back Propagation [50].First, we name the structure FCN-base when the convolution method is standard.Then, we respectively replace standard convolution with Depthwise Separable Convolution and Multi-Scale Depthwise Separable Convolution and then name them DS-FCN and MDS-FCN, respectively.To demonstrate the legitimacy and sanity of the suggested module, we first compared MDS-FCN and FCN-base, and MDS-FCN and DS-FCN.A second comparison was made between MDS-FCN and cuttingedge algorithms such as PSPNet, BiseNet, DeepLab, ICNet, and FSSNet, as well as the commonly used FCN-8s, Segnet, and Enet.All the performances were measured using MIoU and parameters.The whole proposed network structure is shown in Figure 6.

Parameter Setting
Before training, neural networks such as FCN and Segnet always use additional data for pre-training or are appended to a pre-trained architecture.Pre-training/trained and fine-tuning were not employed in this research.The parameters of the neural networks involved were randomly initialized and were only trained on the given dataset.In Camvid and KITTI, the figures were randomly cropped to 480 × 352 and then fed into neural networks.The parameters, such as the number of iterations, batch size, and learning rate, were set according to experimental experience and were finally set to 4, 1200, and 0.0025, respectively.All tests were run on a single GPU (GTX 1070ti).

Performance Evaluation on the Camvid Dataset
As seen in Figure 7, the Camvid dataset consists of 701 color-scale road photos that were taken in various places.We used the common split [51] to make comparisons with earlier works simple and fair.A total of 367 photos were utilized for training; a total of 101, for validation; and 233, for testing.Segmenting 11 classes in the Camvid dataset was used to verify performance.In this classification, automobiles were grouped with truck_bus; roads were grouped with lanes; and kids were grouped with pedestrians.
When the convolution method is standard convolution, Depthwise Separable Convolution, or Multi-Scale Depthwise Separable Convolution, we name the corresponding structure FCN-base, DS-FCN, or MDS-FCN.The problem of segmenting 11 classes on Camvid was completed using the aforementioned three network architectures, and MDS-FCN delivered results that are competitive in terms of MIoU and parameter count.Table 2 reveals that Depthwise Separable Convolution replacing standard convolution resulted in a small rise in MIoU from 52.58 to 54.71 and a sharp drop in parameters from 40.36 M to 1.68 M. The suggested module, Multi-Scale Depthwise Separable Convolution, builds on Depthwise Separable Convolution to increase the non-linear relationship between input and output in various feature spaces and can distinguish among classes that have varying sizes within a class.MIoU dramatically increased from 52.58 to 61.02 when Multi-Scale Depthwise Separable Convolution was used in place of conventional convolution.Compared with DS-FCN, the MIoU of MDS-FCN was more than 6, which was higher than that of DS-FCN.On the other hand, the parameters of MDS-FCN were only 1 M more than those of DS-FCN.In addition, the proposed module showed better performance in most classes, such as cars, roads, traffic lights, sidewalks, etc.In terms of large classes, such as trees, sky, and roads, standard convolution, Depthwise Separable Convolution, and Multi-Scale Depthwise Separable Convolution produced similar segmentation results.However, in terms of small classes, the per-class Intersection over Unions (per class IoU) obtained by the modules mentioned above was significantly different.The proposed module obviously produced better results than the other modules involved.We compared MDS-FCN with the commonly used FCN-8s [26], Segnet [29], and Enet [4], and cutting-edge algorithms such as PSPNet [30], BiseNet [32], Dilation8 [52], DeepLab [53], ICNet [33], and FSSNet [54] on Camvid.The results of our semantic segmentation job are quite promising, as demonstrated in Table 3 and Figure 8.In semantic pixel-wise segmentation tasks on Camvid, MDS-FCN traded off accuracy and parameters, as shown in Table 3 and Figure 8. FCN-32 could finish the challenge of segmenting 11 classes, but the results are coarse, and many details could not be accurately delineated.Enet uses an efficient convolution method to decrease parameters but sacrifices accuracy.Compared with FSSNet, MDS-FCN adopts a parallel structure to extract spatial features in different feature spaces, producing more parameters; however, its MIoU increased from 58.6 to 61.02.Modern algorithms such as PSPNet and BiseNe multiply branches, combine many stages, or alter the network's connection pattern to achieve competitive performance with excellent accuracy, but they also add numerous parameters and consume a lot of memory.Except ENet and FSSNet, the rest of the networks produced lots of parameters that were more than 20 M. In particular, although Dilation8 obtained 65.3 MIoU on Camvid, it produced 140.8 M parameters, about 138 M more than MDS-FCN.MDS-FCN only produced 2.68 M parameters, significantly less than most state-of-the-art algorithms, while reducing accuracy as little as possible, which is a benefit for synchronous operation with multiple algorithms under limited resources.

Performance Evaluation on KITTI Dataset
As seen in Figure 9, the KITTI pixel-level semantic segmentation benchmark comprises 400 color-scale pictures.As we could not collect the ground truth for 200 testing photos and wished to confirm the effectiveness of few-shot learning, 200 images were utilized for training, and 200 images, for testing.As a result, we randomly split the first 200 training pictures into two groups.The training portion included 140 photographs, while the testing portion included 60 images.The task at hand was to divide 19 classes into categories such as roads, buildings, pedestrians, trees, etc.Then, we compared MDS-FCN, a neural network architecture related to Multi-Scale Depthwise Separable Convolution, with the commonly used FCN-8s, Unet, Segnet, and Enet.In a similar manner, we compared Multi-Scale Depthwise Separable Convolution with standard convolution and Depthwise Separable Convolution.In all experiments, pretraining/trained and fine-tuning were not used.The suggested approach generated competitive results that traded off accuracy and parameters, as shown in Tables "ref" Tables 4 and 5, and Figure 10.The non-linear link between input and output is established by neural networks by extracting spatial characteristics.Studying those spatial features needs several data collected in all cases.For instance, just 140 photos were employed to train the neural network on the KITTI pixel-level semantic segmentation benchmark.Neural networks must be capable of few-shot learning for this to work.
We can learn from Table 4 that when standard convolution in FCN-8s was replaced with Multi-Scale Depthwise Separable Convolution, the MIoU significantly increased from 43.02 to 51.71.This demonstrates that extracting multi-scale features from diverse feature spaces appears to be quite effective for classifying entities in a variety of complicated environments.However, when standard convolution in FCN-8s was replaced with Depthwise Separable Convolution, the MIoU decreased from 43.02 to 41.26.From Table 5, we can learn that both MDS-FCN and Enet produced competitive results that are clearly superior to those of the algorithms mentioned above.In terms of MIoU, MDS-FCN outperformed the rest of the algorithms in Table 5.In terms of parameters, although MDS-FCN adopts a parallel structure to extract spatial features in different feature spaces, it produced 2.68 M parameters less than most other algorithms and occupied fewer storage resources.In addition, MDS-FCN could produce smooth segmentation results and delineate more details (see Figure 9) in instances based on their shapes despite their different sizes.FCN-32s, FCN-8s, Unet, and Segnet could not retain the boundary information and could not accurately delineate small classes.

Performance Evaluation on the Cityscapes Dataset
We compared MDSNet with the widely used FCN-8s, Segnet, and Enet, and stateof-the-art algorithms such as BiseNet, ICNet, Dilation10, DeepLab, and PSPNet on the Cityscapes dataset.Table 6 and Figure 11 show the competitive results.
The algorithms aforementioned in Table 6 were able to finish the challenge of segmenting 19 classes on the Cityscapes dataset.Early widely used networks such as FCN and Segnet use standard convolution to extract object features, thus being easily affected by objects on multiple scales and different locations.This seemed to have a bad effect on segmentation accuracy.In addition, extensive standard convolution operations produced many parameters and slowed down the inference speed.Compared with those early algorithms, MDS-FCN showed a great advantage in terms of accuracy, parameters, and inference speed.Most lightweight networks, such as ESPNet, ENet, ERFNet, and BiseNet, used efficient convolution methods to accelerate the inference speed and decrease parameters.ESP-Net and Enet only stressed inference speed and parameters.They significantly sacrificed segmentation accuracy.ERFNet, BiseNet, and ICNet balanced the relationship among segmentation accuracy, parameters, and inference speed and had similar segmentation accuracy.BiseNet had an obvious advantage in inference speed, and ERFNet had an advan-tage in parameters.Compared with ESPNet and ENet, MRDNet showed uncompetitive performance both in terms of parameters and inference speed, but it significantly improved segmentation accuracy from 60.3 to 68.5.MDS-FCN showed an acceptable inference speed that was slower than that of ERFNet, BiseNet, and ICNet.However, it had an advantage in accuracy and parameters.DeepLab and PSPNet showed high segmentation accuracy, which was higher than that of MDS-FCN by more than eight points, but they had extensive parameters and slow inference speed.This is not applicable to real-world applications that have a strict requirement on inference speed.MDS-FCN showed competitive performance, 68.5 MIoU with only 2.68 million parameters, while maintaining 13.4 fps on a single GTX 1070Ti card on Cityscapes, showing the ability to delineate objects based on their shape despite their small size and producing smooth segmentation results.In addition, due to mutual occlusion and blurry boundaries among objects, it is difficult to extract information such as boundaries.On the other hand, excessive downsampling results in the loss of some detailed information, especially for elongated objects with extensibility that cannot be recovered during subsequent upsampling, resulting in segmentation failure, as shown in Figure 12.

Conclusions
We propose Multi-Scale Depthwise Separable Convolution, a novel convolution module made to balance accuracy and parameters.Concatenation pointwise convolution and multi-scale depthwise convolution make up this module.To produce new features, it filters the input channels and combines them.The suggested module, which is based on Depthwise Separable Convolution, can extract multi-scale spatial features in various feature spaces while preserving small parameter values.On the Camvid, KITTI, and Cityscapes datasets, MDS-FCN produced competitive results, i.e., MIoU of 61.02, 51.71, and 68.5, respectively, while producing 2.68 M parameters.Compared with FCN-8s, Unet, Segnet, and Enet, it can segment both small and large classes well due to the fact that this module may increase the non-linear interaction between input and output and broaden the non-linear layers.In addition, MDS-FCN is more efficient in terms of memory, since it moderately reduces parameters by changing the computational method of the convolution operation.However, MDS-FCN cannot effectively extract their spatial features for a few classes.Future work will mainly focus on a novel architecture based on Multi-Scale Depthwise Separable Convolution.Compared with MDS-FCN, the novel architecture will be designed to improve the computational time, reduce parameters, and improve accuracy.

Figure 2 .
Figure 2. Depthwise separable convolution; inp and oup are the channel numbers in different layers.

Figure 3 .
Figure 3. Depthwise Separable Convolution with different kernels; inp, med1, med2, med3, and oup are the numbers of channels in different layers, where the sum of med1, med2, and med3 is equal to oup.

Figure 4 .
Figure 4. Depthwise Separable Convolution with different kernels followed by additional layer; inp and oup are the numbers of channels in different layers.

Table 1 .
Detailed structure of network.

Table 2 .
Comparison results of different convolution modules on Camvid.

Table 4 .
Comparison results of different convolution modules on KITTI.