Lightweight Convolutional Neural Networks with Model-Switching Architecture for Multi-Scenario Road Semantic Segmentation

A convolutional neural network (CNN) that was trained using datasets for multiple scenarios was proposed to facilitate real-time road semantic segmentation for various scenarios encountered in autonomous driving. However, the CNN inhibited the mutual suppression effect between weights; thus, it did not perform as well as a network that was trained using a single scenario. To address this limitation, we used a model-switching architecture in the network and maintained the optimal weights of each individual model which required considerable space and computation. We, subsequently, incorporated a lightweight process into the model to reduce the model size and computational load. The experimental results indicated that the proposed lightweight CNN with a model-switching architecture outperformed and was faster than the conventional methods across multiple scenarios in road semantic segmentation.


Introduction
Semantic segmentation is an important road detection application for autonomous driving. This application must be both accurate and able to operate in real-time to ensure passenger's safety. Existing convolutional neural networks (CNNs) are effective for road semantic segmentation. However, variations in road conditions, weather, and the time of the day affect the detection stability and reduce the accuracy of the semantic segmentation.
To achieve a stable and accurate semantic segmentation under diverse road conditions, we propose a convolutional neural network (CNN) with a model-switching architecture (MSA) that was trained using data with variations in terms of road conditions, weather, and time of day. The proposed CNN optimizes multi-model deep learning by using diverse data, and the model-switching architecture enables it to select the most appropriate model for the context detection on the basis of CNN classifiers. Training using multiple models avoids the problem of reduced detection performance due to a mutual suppression between weights, which occurs when a single-model CNN is trained using the data of multiple contexts.
However, semantic segmentation using a multi-model CNN with classifiers for different contexts and model-switching results in a tremendous computation load. There-fore, we additionally propose a lightweight method to facilitate CNN-based semantic segmentation by using CNN classifiers and multiple models to achieve a specific level of performance, reduce the number of calculations, and increase the calculation speed. The article is organized as follows: Section 2 introduces different types of road semantic segmentation. Section 3 presents proposed methods. Section 4 is the experiment results. Section 5 is our conclusion.
The main contributions of this paper are:

Model-Switching Architecture
Multi-model [46][47][48][49][50] is now a popular method in different applications. To accommodate multiple scenarios for road semantic segmentation, data from diverse situations must be used during model training. However, weights are affected by mutual suppression when a training model uses multiple scenarios, which reduces accuracy. Therefore, we used a model-switching architecture to eliminate the mutual suppression of weights. Figure 2 presents an example of this architecture that uses a heuristic decision tree to switch between models for different scenarios. The gray nodes (D, E, F, and G) represent models in the neural network, and the orange (B and C) and blue (A) nodes represent the CNN classifiers that are used to select the model for segmentation.

Model-Switching Architecture
Multi-model [46][47][48][49][50] is now a popular method in different applications. To accommodate multiple scenarios for road semantic segmentation, data from diverse situations must be used during model training. However, weights are affected by mutual suppression when a training model uses multiple scenarios, which reduces accuracy. Therefore, we used a model-switching architecture to eliminate the mutual suppression of weights. Figure 2 presents an example of this architecture that uses a heuristic decision tree to switch between models for different scenarios. The gray nodes (D, E, F, and G) represent models in the neural network, and the orange (B and C) and blue (A) nodes represent the CNN classifiers that are used to select the model for segmentation.
The proposed CNN is illustrated in Figure 1. It is based on a model-switching architecture that selects a model for semantic segmentation depending on the road conditions. This algorithm reduces misjudgments and increases segmentation accuracy. To allow a faster execution, lightweight processes are incorporated into the network to reduce the computational load. These two features of the proposed neural network are described in the following section.

Model-Switching Architecture
Multi-model [46][47][48][49][50] is now a popular method in different applications. To accommodate multiple scenarios for road semantic segmentation, data from diverse situations must be used during model training. However, weights are affected by mutual suppression when a training model uses multiple scenarios, which reduces accuracy. Therefore, we used a model-switching architecture to eliminate the mutual suppression of weights. Figure 2 presents an example of this architecture that uses a heuristic decision tree to switch between models for different scenarios. The gray nodes (D, E, F, and G) represent models in the neural network, and the orange (B and C) and blue (A) nodes represent the CNN classifiers that are used to select the model for segmentation.

Lightweight Processes
To increase the execution speed and maintain accuracy, lightweight processes, as de-tailed in this section, were used to reduce the computational load of the CNN.

Separable Convolution
We employed the concept of a separable convolution, which was first used in Mo-bileNet [9]. To demonstrate the resulting reduction in computational load, we compared conventional convolution illustrated with separable convolution illustrated in Figure 3.
To increase the execution speed and maintain accuracy, lightweight processes, as detailed in this section, were used to reduce the computational load of the CNN.

Separable Convolution
We employed the concept of a separable convolution, which was first used in Mo-bileNet [9]. To demonstrate the resulting reduction in computational load, we compared conventional convolution illustrated with separable convolution illustrated in Figure 3. The reduction in computation: For the examples in Figure 3, the computation that is required for separable convolution is around 4.4% of that which is required for conventional convolution.

Reduction in Convolutional Layers
We employed the concept used in InceptionNet V3 [51] to reduce the number of layers in the original CNN to lower the weights and the amount of calculation, as well as to increase the execution speed of the CNN. The method for reducing the number of convolutional layers is depicted in Figure 4. Conventional convolution computing consumption: Separable convolution computing consumption: D K : Dimension of convolutional kernel. M: Number of input data channels. N: Number of output data channels. D F : Dimension of output data feature.
The reduction in computation: For the examples in Figure 3, the computation that is required for separable convolution is around 4.4% of that which is required for conventional convolution.

Reduction in Convolutional Layers
We employed the concept used in InceptionNet V3 [51] to reduce the number of layers in the original CNN to lower the weights and the amount of calculation, as well as to increase the execution speed of the CNN. The method for reducing the number of convolutional layers is depicted in Figure 4.  The number of 3 × 3 layers that is reduced into a single layer is calculated as:  The number of 3 × 3 layers that is reduced into a single layer is calculated as: n: Number of 3 × 3 layers. The N × N convolution and M × M convolution that can be combined into a single layer is calculated as: The computational loads for n-layered 3 × 3 convolution and single convolution by using Equations (1) and (2) are as follows: n layered 3 × 3 conventional convolution: One ((3·n) -(n -1)) × ((3·n) -(n -1)) conventional convolution: The feature dimension after kernel convolution. M: Number of input data channels. N: Number of output data channels. n: Number of 3 × 3 layers.
According to Equation (3), the two layered 3 × 3 convolution with 64 channels can be substituted by a single 5 × 5 convolution, which could reduce the computational cost by 87.7%, and by replacing two conventional convolutions with a separable convolution we could reduce the computational cost by 99.3%.

Remove Max Pooling and Maintain Output Size
To retain the detailed features of an image for increasing the accuracy of semantic segmentation and to reduce the computational cost for increasing the execution speed of the CNN, we eliminated the max pooling layers and changed the number of convolutional strides to maintain the output size after max pooling. The 2 × 2 max pooling operation is depicted in Figure 5. Max pooling selects the largest value in the mask; thus, sharper features in an image are retained, but the information that is not sharp is lost. The feature map after max pooling was half the size of the original feature map ( Figure 6). The number of strides in the convolution was adjusted accordingly to reduce the computation cost and increase the execution speed.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 6 of 15 strides to maintain the output size after max pooling. The 2 × 2 max pooling operation is depicted in Figure 5. Max pooling selects the largest value in the mask; thus, sharper features in an image are retained, but the information that is not sharp is lost. The feature map after max pooling was half the size of the original feature map ( Figure 6). The number of strides in the convolution was adjusted accordingly to reduce the computation cost and increase the execution speed.

Experiment
Experiments were conducted to validate the proposed model. A model with the proposed model-switching architecture was constructed. We used VGG as the CNN classifier for model selection and the FCN for semantic segmentation.

Experiment
Experiments were conducted to validate the proposed model. A model with the proposed model-switching architecture was constructed. We used VGG as the CNN classifier for model selection and the FCN for semantic segmentation.

Hardware and Software Platform
For this study, our computer configuration was as show in Table 1.

Model-Switching Architecture with Classifiers
In the experiment, we used four driving scenarios: rainy, sunny, cloudy, and rainy at night. Figure 7 illustrates the topology of the architecture. Two classifiers, namely, the day-and-night classifier and the weather classifier, were used to select the CNN model for performing semantic segmentation in various scenarios. The model-switching architecture used in this study is depicted in Figure 7. It used VGG16 as the day-and-night classifier (blue) and the weather classifier (green) to select a suitable CNN for semantic segmentation.

Model-Switching Architecture with Classifiers
In the experiment, we used four driving scenarios: rainy, sunny, cloudy, and rainy at night. Figure 7 illustrates the topology of the architecture. Two classifiers, namely, the day-and-night classifier and the weather classifier, were used to select the CNN model for performing semantic segmentation in various scenarios. The model-switching architecture used in this study is depicted in Figure 7. It used VGG16 as the day-and-night classifier (blue) and the weather classifier (green) to select a suitable CNN for semantic segmentation. Identifying weather conditions and determining whether it is day or night is challenging. For these reasons, we employed a CNN classifier to realize model switching. Figure 8 presents the operation of the CNN classifier in scenario detection. Figure 9 depicts the model selection mechanism. By using the day-and-night classifier, the network selected an appropriate model for the time of the day; subsequently, by using the weather classifier, the network selected an appropriate model for the given weather scenario to perform road semantic segmentation. Identifying weather conditions and determining whether it is day or night is challenging. For these reasons, we employed a CNN classifier to realize model switching. Figure 8 presents the operation of the CNN classifier in scenario detection. Figure 9 depicts the model selection mechanism. By using the day-and-night classifier, the network selected an appropriate model for the time of the day; subsequently, by using the weather classifier, the network selected an appropriate model for the given weather scenario to perform road semantic segmentation.   Figure 10 illustrates the conventional VGG16 that was used in our experiments. VGG16 is portable, accurate, and easy to modify. However, it is excessively large and requires considerable calculations. Thus, the lightweight processes that were previously described were incorporated into VGG16 to obtain a lightweight fast VGG (LWF-VGG). We used the VGG16 classifier for classification, VGG16 FCN for semantic segmentation, and lightweight VGG16 to construct a lightweight classifier and FCN.   Figure 10 illustrates the conventional VGG16 that was used in our experiments. VGG16 is portable, accurate, and easy to modify. However, it is excessively large and requires considerable calculations. Thus, the lightweight processes that were previously described were incorporated into VGG16 to obtain a lightweight fast VGG (LWF-VGG). We used the VGG16 classifier for classification, VGG16 FCN for semantic segmentation, and lightweight VGG16 to construct a lightweight classifier and FCN.  Figure 10 illustrates the conventional VGG16 that was used in our experiments. VGG16 is portable, accurate, and easy to modify. However, it is excessively large and requires considerable calculations. Thus, the lightweight processes that were previously described were incorporated into VGG16 to obtain a lightweight fast VGG (LWF-VGG). We used the VGG16 classifier for classification, VGG16 FCN for semantic segmentation, and lightweight VGG16 to construct a lightweight classifier and FCN. Figure 10 illustrates the conventional VGG16 that was used in our experiments. VGG16 is portable, accurate, and easy to modify. However, it is excessively large and requires considerable calculations. Thus, the lightweight processes that were previously described were incorporated into VGG16 to obtain a lightweight fast VGG (LWF-VGG). We used the VGG16 classifier for classification, VGG16 FCN for semantic segmentation, and lightweight VGG16 to construct a lightweight classifier and FCN. To reduce the CNN size and the computational load, we replaced the conventional convolutions in VGG16 with separable convolutions, as illustrated in Figure 11. This change decreased the size of VGG16 from 1.6 GB to 235MB, or by 86%. To reduce the CNN size and the computational load, we replaced the conventional convolutions in VGG16 with separable convolutions, as illustrated in Figure 11. This change decreased the size of VGG16 from 1.6 GB to 235MB, or by 86%. Using Equation (3), the number of convolutional layers in VGG16 can be reduced as displayed in Figure 11. This reduction from the original 22 to 12 layers reduced the size of VGG16 to 226 MB.

Lightweight Fully Convolutional Neural Network
To further reduce the computational load, we removed the max pooling layers (Fig-Figure 11. Light-weight process of VGG16. Using Equation (3), the number of convolutional layers in VGG16 can be reduced as displayed in Figure 11. This reduction from the original 22 to 12 layers reduced the size of VGG16 to 226 MB.
To further reduce the computational load, we removed the max pooling layers ( Figure 11). The number of strides of the remaining separable convolutional layers was changed to 2 to ensure the same output as that obtained with 2 × 2 max pooling layers but with a slightly higher accuracy and faster execution speed. To reduce the computational load and accelerate computation speed, we also used the theorem in Equation 4 to build LWF-VGG tiny.

Experimental Results
The proposed FCN with a model-switching architecture and LWF-VGG was used for road semantic segmentation for different scenarios.

Comparison of Various Methods in Semantic Segmentation
To test the performance, we used the KITTI road dataset which is widely used for road semantic segmentation research. It comprised 289 training images with a resolution of 1242 × 375. We also performed a NTUT sunny dataset, NTUT cloudy dataset, NTUT rainy dataset and NTUT night rainy dataset. Each of them comprised 250 training images with a resolution of 1920 × 1080. Figure 12 shows the road segmentation results for the datasets using LWF-VGG FCN.  Table 2 shows the performance of the various approaches in the KITTI dataset. Image-based, image + RGB-D based, and image + LiDAR-based approaches had a comparable performance. However, image-based approaches incur a lower equipment cost and the techniques can be more easily realized. To solve different driving scenarios in Taiwan and show the performance of our approach, we used four different datasets for testing. Table 3 shows the performance in different scenarios and a comparison of different approaches.   Table 2 shows the performance of the various approaches in the KITTI dataset. Imagebased, image + RGB-D based, and image + LiDAR-based approaches had a comparable performance. However, image-based approaches incur a lower equipment cost and the techniques can be more easily realized. To solve different driving scenarios in Taiwan and show the performance of our approach, we used four different datasets for testing. Table 3 shows the performance in different scenarios and a comparison of different approaches. Figure 12 shows the road semantic segmentation results of LWF-VGG FCN for the sunny scenario in the KITTI dataset and NTUT Sunny dataset. The performance in the KITTI dataset of LWF-VGG FCN and LWF-VGG FCN tiny were very close to the modern state of the arts. For the maximum F -score and maximum precision, LWF-VGG FCN had better performance compared to the others. For recall, LWF-VGG FCN tiny had better performance compared to the others. The KITTI dataset is a widely used dataset in the field of road semantic segmentation. However, it's not suitable for Taiwanese road scenarios with various climates. To validate its applicability to road semantic segmentation in Taiwan, the proposed FCNs were tested using a dataset for road conditions in Taipei City, which was collected by the National Taiwan University of Technology (NTUT). The NTUT dataset comprised four sets of images for different weather conditions and times of day: sunny, cloudy, rainy, and night rainy. According to the scenarios named the NTUT sunny dataset, NTUT cloudy dataset, NTUT rainy dataset, and night rainy dataset, respectively, to show the performance of LWF-VGG tiny, we used the state of the art methods called MultiNet [34], BiSeNet [52], and BiSeNet v2 [53] to obtain a comparison.  Table 3 shows the performance of LWF-VGG FCN and LWF-VGG FCN tiny compared with different approaches. The NTUT rainy dataset was the most challenging dataset of our driving scenario. Due to the reflection and refraction on the surface, the feature of the road image became complicated. To retain the complicated features of the rainy road image, LWF-VGG FCN and LWF-VGG FCN tiny used the theorem from Section 3.2.3 which removed the max pooling and changed the convolution strides to maintain the output size. This method could solve the complicated feature in rainy day. It also reduced the computational load and accelerating computation speed. LWF-VGG FCN tiny had approaching inference speed compared to the state of the arts.

Multi-Scenario Road Semantic Segmentation with Model-Switching Architecture
In real-world applications, road conditions change constantly. To determine whether the proposed FCN with the model-switching architecture adapts to changing conditions, we combined the four sets of images into a mixed dataset of 1000 images with a resolution of 1920 × 1080 and called it the NTUT mixed dataset. The test results obtained for the NTUT mixed dataset are depicted in Figure 13. To switch the proper model for road image segmentation, we needed to train the classifiers. Collecting training data and labeling is hard work that requires a significant amount of time and human work. To reduce the human work and time consumption, the few-shot image classification is important research now. The day-night dataset consisted of 207 day-time images and 86 night-time images. The weather dataset used for weather classification consisted of 190 cloudy scenario images, 173 rainy scenario images, and 240 sunny scenario images. In Table 4, the model size and recognition accuracy of VGG16 and LWF-VGG as classifiers were compared at different times of day and under distinct weather conditions. The image recognition accuracy was comparable for different scenarios, but the model size of LWF-VGG was considerably smaller at only approximately 14% of the model size of VGG16.

Multi-Scenario Road Semantic Segmentation with Model-Switching Architecture
In real-world applications, road conditions change constantly. To determine whether the proposed FCN with the model-switching architecture adapts to changing conditions, we combined the four sets of images into a mixed dataset of 1000 images with a resolution of 1920 × 1080 and called it the NTUT mixed dataset. The test results obtained for the NTUT mixed dataset are depicted in Figure 13. To switch the proper model for road image segmentation, we needed to train the classifiers. Collecting training data and labeling is hard work that requires a significant amount of time and human work. To reduce the human work and time consumption, the few-shot image classification is important research now. The day-night dataset consisted of 207 day-time images and 86 night-time images. The weather dataset used for weather classification consisted of 190 cloudy scenario images, 173 rainy scenario images, and 240 sunny scenario images. In Table 4, the model size and recognition accuracy of VGG16 and LWF-VGG as classifiers were compared at different times of day and under distinct weather conditions. The image recognition accuracy was comparable for different scenarios, but the model size of LWF-VGG was considerably smaller at only approximately 14% of the model size of VGG16.  The left side of Figure 13 shows the semantic segmentation results obtained using the LWF-VGG FCN after training with the NTUT mixed dataset. The right side of Figure 13 shows the results obtained using multiple LWF-VGG FCNs that were trained using various datasets in different contexts and then switched to a suitable model for semantic The left side of Figure 13 shows the semantic segmentation results obtained using the LWF-VGG FCN after training with the NTUT mixed dataset. The right side of Figure 13 shows the results obtained using multiple LWF-VGG FCNs that were trained using various datasets in different contexts and then switched to a suitable model for semantic segmentation by using the model-switching architecture. To demonstrate the effectiveness of the model-switching architecture, we trained one LWF-VGG FCN by using the NTUT mixed dataset and trained four LWF-VGG FCNs by using the NTUT sunny dataset, the NTUT cloudy dataset, the NTUT rainy dataset, and the NTUT night rainy dataset, respectively, and used the model-switching architecture to switch to a suitable model for semantic segmentation. A comparison of the results obtained using a single VGG16 FCN, multiple VGG16 FCNs with the model-switching architecture, a single LWF-VGG FCN, and multiple LWF-VGG FCNs with the model-switching architecture is presented in Table 5.  Table 5 presents a comparison of multiple CNNs with their own weights and modelswitching architectures with that of a single CNN with multiple weights. Both the multiple VGG16 FCNs and multiple LWF-VGG FCNs outperformed the single VGG16 FCN and single LWF-VGG FCN, and the multiple LWF-VGG FCNs with the model-switching architecture were faster than the single VGG16 FCN. LWF-VGG FCNs tiny were even faster.

Conclusions
The proposed model-switching architecture prevents mutual suppression between weights in a single CNN model. This architecture uses multiple CNN models to store the weights of various states individually and uses one or more CNN classifiers to identify the current state and switch to a suitable model for semantic segmentation. However, the model-switching architecture and semantic segmentation require considerable computation, resulting in time delays. Therefore, we proposed a simple but effective light-weight CNN method to increase the calculation speed. The execution speed was limited by the memory bandwidth in this experiment. Nevertheless, it was approximately two times faster than that of the original method. The proposed method was particularly effective for VGG-like CNNs.
This paper used FCN for semantic segmentation which is the pioneer in field of semantic segmentation. However, the architecture was limited to the aspect ratio of the input images. Additionally, we plan to conduct bilateral network research on semantic segmentation deep learning in the future.