Research on an Intelligent Driving Algorithm Based on the Double Super-Resolution Network

: Semantic segmentation plays a very important role in image processing, and has been widely used in intelligent driving, medicine, and other fields. With the development of semantic segmentation, the model has become more and more complex and the resolution of training pictures is higher and higher, so the requirements for required hardware facilities have become higher and higher. Many high-precision networks are difficult to apply in intelligent driving vehicles with limited hardware conditions, and will bring delay to recognition, which is not allowed in practical application. Based on the Dual Super-Resolution Learning (DSRL) network, this paper proposes a network model for training high-resolution pictures, adding a high-resolution convolution module which improves segmentation accuracy and speed while reducing computation. In a CamVid dataset, taking the road category as an example, IOU is 95.23%, which is 4% higher than DSRL, the real-time segmentation time of the same video is reduced by 46% from 120 s to 65 s, and the segmentation effect is better and faster, which greatly alleviates the recognition delay caused by high-resolution input.


Introduction
Semantic segmentation is a basic computer vision task. Its purpose is to classify each pixel in the picture. It is widely used in the fields of intelligent driving, medical imaging, and pose analysis. According to research [1], when traditional cars are replaced by private autonomous vehicles, the number of cars owned by each family can be reduced, the maintenance cost will be less than traditional cars, and the mileage of family vehicles will increase by 57%. According to a survey, consumers are willing to pay the premium related to the purchase of vehicles equipped with automatic equipment. Research [2] shows that cumulative energy and greenhouse gas can be reduced by 60% in the basic case after a series of strategic deployments, and can be further reduced by 87% through accelerated grid decarburization, dynamic performance sharing, vehicle life extension, the improved efficiency of computer systems, the improved fuel efficiency of new vehicles, etc. Therefore, intelligent driving vehicles will be widely used. However, in the field of intelligent driving, semantic segmentation needs to maintain real-time detection while maintaining high accuracy. However, in an application with limited hardware facilities, a high-precision network cannot be put into use, and the recognition delay is also very large. The following are some classic networks for semantic segmentation: UNet [3], Deeplabs [4][5][6], PSPNet [7], SegNet [8], etc. These semantic segmentation networks usually need to use high-resolution atlas training to achieve high accuracy. High-resolution pictures can effectively transfer

You Only Look One-Level Feature
The Feature Pyramid Network [17] (FPN) is a basic component in the recogniti system used to detect objects with different scales. The FPN framework is shown in Figure The main core benefits of FPN are two: on the one hand, FPN can fuse multi-scale featu maps to obtain better representation; on the other hand, it is a divide-and-conquer str egy, which detects targets on different levels of feature maps according to different sca of targets. Qian Chen [18] et al. proposed You Only Look One-level Feature. This pap studies the influence of two gain fittings of FPN on a single-stage detector. In this pap FPN is regarded as a Multiple-in-Multiple-out (MiMo) encoder. Four types of encod are studied: Multiple-in-Multiple-out (MiMo), Multiple-in-Single-out (MiSo), Single-Multiple-out (SiMo), and Single-in-Single-out (SiSo). It is found that the SiMo encoder h only one input feature, and the C5 feature layer can achieve the same performance as t MiMo encoder without feature fusion. The results are shown in Figure 3. These pheno ena illustrate two facts: (1) C5 feature provides sufficient semantic information for object detection at differe scales, which enables the SiMo encoder to achieve the same results as the MiMo e coder; (2) The benefit of multi-scale feature fusion is far less important than the divide-an conquer strategy, so multi-scale feature fusion may not be the most significant bene of FPN.

You Only Look One-Level Feature
The Feature Pyramid Network [17] (FPN) is a basic component in the recognition system used to detect objects with different scales. The FPN framework is shown in Figure 2. The main core benefits of FPN are two: on the one hand, FPN can fuse multi-scale feature maps to obtain better representation; on the other hand, it is a divide-and-conquer strategy, which detects targets on different levels of feature maps according to different scales of targets. Qian Chen [18] et al. proposed You Only Look One-level Feature. This paper studies the influence of two gain fittings of FPN on a single-stage detector. In this paper, FPN is regarded as a Multiple-in-Multiple-out (MiMo) encoder. Four types of encoders are studied: Multiple-in-Multiple-out (MiMo), Multiple-in-Single-out (MiSo), Single-in-Multiple-out (SiMo), and Single-in-Single-out (SiSo). It is found that the SiMo encoder has only one input feature, and the C5 feature layer can achieve the same performance as the MiMo encoder without feature fusion. The results are shown in Figure 3. These phenomena illustrate two facts: (1) C5 feature provides sufficient semantic information for object detection at different scales, which enables the SiMo encoder to achieve the same results as the MiMo encoder; (2) The benefit of multi-scale feature fusion is far less important than the divide-and-conquer strategy, so multi-scale feature fusion may not be the most significant benefit of FPN.

Involution
Ordinary convolution has the following two characteristics: the spatial invariance of convolution, and channel specificity. It also has two defects: one is that the receptive field is small and difficult to capture in long-distance dependence, and the other is the redundancy of information between channels. On this basis, D Li et al. proposed the concept of involution. The involution is structurally opposed to ordinary convolution. The convolution kernel is shared in the channel dimension, and the special convolution kernel in the spatial dimension can make the modeling more flexible. The structure of involution is shown in Figure 4.

Involution
Ordinary convolution has the following two characteristics: the convolution, and channel specificity. It also has two defects: one is th is small and difficult to capture in long-distance dependence, and th dancy of information between channels. On this basis, D Li et al. p of involution. The involution is structurally opposed to ordinary con lution kernel is shared in the channel dimension, and the special conv spatial dimension can make the modeling more flexible. The struc shown in Figure 4.

Involution
Ordinary convolution has the following two characteristics of convolution, and channel specificity. It also has two defects: o field is small and difficult to capture in long-distance dependen redundancy of information between channels. On this basis, D concept of involution. The involution is structurally opposed to or convolution kernel is shared in the channel dimension, and the sp in the spatial dimension can make the modeling more flexible. Th is shown in Figure 4.

Involution
Ordinary convolution has the following two characteristics: the spatial invariance of convolution, and channel specificity. It also has two defects: one is that the receptive field is small and difficult to capture in long-distance dependence, and the other is the redundancy of information between channels. On this basis, D Li et al. proposed the concept of involution. The involution is structurally opposed to ordinary convolution. The convolution kernel is shared in the channel dimension, and the special convolution kernel in the spatial dimension can make the modeling more flexible. The structure of involution is shown in Figure 4. is small and difficult to capture in long-distance dependence, and the other is th dancy of information between channels. On this basis, D Li et al. proposed the of involution. The involution is structurally opposed to ordinary convolution. Th lution kernel is shared in the channel dimension, and the special convolution kern spatial dimension can make the modeling more flexible. The structure of invo shown in Figure 4.  . Involution structure (the involution kernel H i,j ∈ R K×K×1 (G = 1 in this example for ease of demonstration) is yielded from the function φ conditioned on a single pixel at (i, j), followed by a channel-to-space rearrangement. The multiply-add operation of involution is decomposed into two steps, with ⊗ indicating multiplication broadcast across C channels and ⊕ indicating summation aggregated within the K × K spatial neighborhood).

The convolution kernel size of involution is
This means that all channels share convolution kernels. In the involution, the fixed weight matrix is not used as in the ordinary convolution, but the corresponding involution kernel is generated according to the characteristic graph. Spatial specificity makes the convolution kernel have the ability to capture multiple feature representations at different spatial locations, and improves the problem of long-distance pixel dependence. The channel invariance performance reduces the redundant information between channels to a certain extent and improves the computing efficiency of the network. In essence, this design from ordinary convolution to internal convolution redistributes the computing power at the top level, and the essence of network design is the distribution of computing power, in order to adjust the limited computing power to the position where it can give full play to its performance. This involution module is easy to implement and can be easily combined with various network models. It can easily replace conventional convolution to realize an excellent backbone network structure.

Network Structure
In the network model of Dual Super-Resolution Learning (DSRL), in order to reduce the impact of high-resolution pictures as input on the increase of network computing, firstly, sub-sampling the high-resolution image of 960 × 720 to 480 × 360, and the picture size becomes half of the original. For the low-resolution feature layer, simple upsampling is carried out through Semantic Segmentation Super-Resolution (SSSR) and Single Image Super-Resolution (SISR) to restore to the original image size. This article compares the color pictures of the original size, 1/2 downsampling, and 1/2 downsampling + 2x upsampling; the pictures are not visually different, and we use the operator of [-1 -1 -1; -1 8 -1; -1 -1 -1] to extract the edges of the above three graphs. It can be found that the edge features extracted from the original image have more noise, but the image details are also well preserved. The edge feature noise extracted after 1/2 downsampling is reduced, but the details of the object also become rough; the edge feature noise and object details extracted after 1/2 downsampling + 2x upsampling are greatly reduced. In the following experiment, parts of these three images are used as input and the segmentation effects are compared. The experimental results show that although downsampling will reduce the noise, the missing details are more important, and the amount of noise has little effect on accuracy. Images and their respective extracted edge features as shown in Figure 5. Therefore, this paper proposes a new network model based on the Dual Super-Resolution Learning (DSRL) network model to improve the above problems. The network is divided into two modules. One is the low-resolution image convolution module based on the super-resolution theory; the other is the convolution module of high-resolution pictures. In this paper, only the C5-level feature layer is extracted with reference to You Only Look One-level Feature (YOLOF). The C5-level feature layer has sufficient semantic information, so the low-resolution convolution module does not carry out feature fusion, expands the receptive field range through expansion convolution, and then recovers to high resolution through upsampling. However, since the image is downsampled twice at the beginning, resulting in the loss of features of the original image, a convolution module of the high-resolution image is added to the network to make up for the loss of features caused by the reduction of resolution. In order to avoid the proliferation of network parameters caused by the convolution of high-resolution images, this module only performs a small amount of convolution, and partial convolution is replaced by internal convolution to reduce the amount of calculation. The network structure is shown in Figure 6, maintaining two branches during training and two branches during testing. Pruning occurred during testing to remove Mean Square Error (MSE) loss branches and to reduce the amount of calculation. Therefore, this paper proposes a new network model based on the Dual Super-Resolution Learning (DSRL) network model to improve the above problems. The network is divided into two modules. One is the low-resolution image convolution module based on the superresolution theory; the other is the convolution module of high-resolution pictures. In this paper, only the C5-level feature layer is extracted with reference to You Only Look One-level Feature (YOLOF). The C5-level feature layer has sufficient semantic information, so the low-resolution convolution module does not carry out feature fusion, expands the receptive field range through expansion convolution, and then recovers to high resolution through upsampling. However, since the image is downsampled twice at the beginning, resulting in the loss of features of the original image, a convolution module of the high-resolution image is added to the network to make up for the loss of features caused by the reduction of resolution. In order to avoid the proliferation of network parameters caused by the convolution of high-resolution images, this module only performs a small amount of convolution, and partial convolution is replaced by internal convolution to reduce the amount of calculation. The network structure is shown in Figure 6, maintaining two branches during training and two branches during testing. Pruning occurred during testing to remove Mean Square Error (MSE) loss branches and to reduce the amount of calculation. High-resolution module convolution module.

Loss Function
The network loss function consists of three parts: one is the cross-entropy loss function composed of the network output and the actual segmentation graph, and the other is the binary-cross-entropy loss function composed of the network low-dimensional feature layer and the feature graph sampled under the actual segmentation graph to the corresponding size. The last part consists of the Mean Square Error (MSE) between the network output and the actual picture. The real segmentation's edge features are shown in Figure 7 (edge extraction from ground truth). The Cross-Entropy (CE) loss function is shown in Formula (1). and refer to the segmentation predicted probability and the corresponding category for pixel . The Binary Cross-Entropy (BCE) loss function is shown in Formula (2). and refer to the target value and the value of model output. The Mean Square Error is shown in Formula (3). and refer to the target value and the value of model output. The whole loss function is shown in Formula (4). 1 and 2 are set as 0.2 and 0.4.

Loss Function
The network loss function consists of three parts: one is the cross-entropy loss function composed of the network output and the actual segmentation graph, and the other is the binary-cross-entropy loss function composed of the network low-dimensional feature layer and the feature graph sampled under the actual segmentation graph to the corresponding size. The last part consists of the Mean Square Error (MSE) between the network output and the actual picture. The real segmentation's edge features are shown in Figure 7 (edge extraction from ground truth). The Cross-Entropy (CE) loss function is shown in Formula (1). y i and p i refer to the segmentation predicted probability and the corresponding category for pixel i. The Binary Cross-Entropy (BCE) loss function is shown in Formula (2).

Construction of Dataset
In this paper, a CamVid (Cambridge-driving Labeled Video Database) dataset was selected, which was composed of 960 × 720 high-resolution pictures intercepted by videos taken during the real driving process of vehicles. It was divided into 32 categories, such as bicycles, roads, cars, and so on. This paper divided the training set, verification set, and test set according to the proportion of 7:2:1. In order to enhance the generalization ability of the model, data enhancement methods such as flipping and clipping were used for the training set data.

Network Model Evaluation Index
Assuming that there are k classes (including 1 − k target classes and one background class), The larger the value of the evaluation index, the more accurate the predicted pixel classification is.

Construction of Dataset
In this paper, a CamVid (Cambridge-driving Labeled Video Database) dataset was selected, which was composed of 960 × 720 high-resolution pictures intercepted by videos taken during the real driving process of vehicles. It was divided into 32 categories, such as bicycles, roads, cars, and so on. This paper divided the training set, verification set, and test set according to the proportion of 7:2:1. In order to enhance the generalization ability of the model, data enhancement methods such as flipping and clipping were used for the training set data.

Network Model Evaluation Index
Assuming that there are k classes (including k − 1 target classes and one background class), k − 1 represents the total number of pixels belonging to the i class predicted as j class, and specifically, p ii represents TP (true positive); p ij indicates FP (false positive); and p ji indicates FN (false negatives). The evaluation indicators included the following categories: (1) PA (Pixel Accuracy): The ratio between the number of pixels correctly classified and all pixel points is shown in Formula (5).
The larger the value of the evaluation index, the more accurate the predicted pixel classification is.
(2) MPA (Mean Pixel Accuracy) calculated the average value based on the proportion of correctly classified pixel points to all pixel points, and the formula is shown in (6).
(3) MIOU (Mean Intersection over Union): The ratio between the intersection between the real value and the predicted value and the union between the real value and the predicted value is averaged, and the formula is shown in (7).
(4) DICE: The ratio of the intersection of 2 times the predicted result and the real result to the predicted result plus the real result is shown in Formula (8), where X represents the real value, Y represents the predicted value.
The larger the value of the evaluation index, the more accurate the predicted pixel classification is.

Analysis of Training Results
The framework of the neural network built in this paper was PyTorch. The model of the graphics card used was RTX2060 8G. The size of DSRL and MY network parameters in this paper are shown in Table 1. We compared the road classes with the largest proportion in the CamVid dataset, and the results are shown in Tables 2 and 3. The experimental results show that the total network parameters in this paper were reduced from 8091 MB to 5438 MB. Compared with the DSRL network, the network structure in this paper improved the values of IOU, PA, and DICE: (1) The IOU value increased from 91.17% to 95.23%; (2) PA value increased from 94.42% to 98.99%; (3) DICE increased from 56.59% to 60.49%.
The road segmentation diagram is shown in Figure 8 (the red part is the result of road segmentation by the network, and the gray part is the standard value). The segmentation results of the DSRL network were not good for the segmentation of small objects similar to small lane lines. However, after adding the high-resolution image convolution module in this paper, the segmentation effect of small objects was improved, which shows that the high-resolution convolution module added in this model can effectively make up for the loss of the input image due to 1/2 downsampling. Although the noise will be reduced after downsampling, the priority is not as good as it is for the object details.
in this paper, the segmentation effect of small objects was improved, which shows that the high-resolution convolution module added in this model can effectively make up for the loss of the input image due to 1/2 downsampling. Although the noise will be reduced after downsampling, the priority is not as good as it is for the object details.
VGG16, ResNet101, ResNet50, and CSPdarkNet53 were used as backbone networks to compare the total network parameters, parameter size, and PA, IOU, and DICE. The results are shown in Tables 4 and 5.   It can be seen from Tables 4 and 5 that the network model with VGG16 as the backbone network could reach IOU, PA, and DICE similarly to the network model with Res-Net50 and ResNet101 as the backbone network with less parameters. Taking the original image as the high-resolution network input, the comparison of various backbone network segmentation images is shown in Figure 9 (the red part is the result of the segmentation of the road class by the network).   It can be seen from Tables 4 and 5 that the network model with VGG16 as the backbone network could reach IOU, PA, and DICE similarly to the network model with ResNet50 and ResNet101 as the backbone network with less parameters. Taking the original image as the high-resolution network input, the comparison of various backbone network segmentation images is shown in Figure 9 (the red part is the result of the segmentation of the road class by the network).
As can be seen from various backbone network segmentation pictures in Figure 9 (the red part is the result of the segmentation of the road class by the network): (1) The network with VGG16 as the backbone can be achieved with half as few parameters than ResNet50 and ResNet101 with a similar effect. In terms of the segmentation accuracy of the lane line part of the road, the accuracy of VGG16 and ResNet50 is similar. Both lane lines can be clearly segmented, which is better than ResNet101. In terms of the segmentation accuracy of the tire shape at the bottom of the car, the segmentation accuracy of VGG16 is slightly better than ResNet50 and ResNet101, which can better fit the tire shape.  As can be seen from various backbone network segmentation pictures in Figure 9 (the red part is the result of the segmentation of the road class by the network): (1) The network with VGG16 as the backbone can be achieved with half as few parameters than ResNet50 and ResNet101 with a similar effect. In terms of the segmentation accuracy of the lane line part of the road, the accuracy of VGG16 and ResNet50 is similar. Both lane lines can be clearly segmented, which is better than ResNet101. In terms of the segmentation accuracy of the tire shape at the bottom of the car, the segmentation accuracy of VGG16 is slightly better than ResNet50 and ResNet101, which can better fit the tire shape. Comparing ordinary convolution, ResNet, and CSPdarknet (the above three convolution structures are shown in Figure 10), it can be found that CSPdarknet cuts the input feature map to the channel, and only uses half of the original feature map to input into the residual network for processing. In forward propagation, the other half is directly spliced by the channel with the output of the residual network at the end. The advantages of doing this are as follows: (1) Only half of the input is involved in the calculation, which can greatly reduce the amount of calculation and memory consumption; (2) In the process of back propagation, a completely independent gradient propagation path is added, which can prevent feature loss caused by excessive convolution, and there is no reuse of gradient information.
(a) (b) (c)  Comparing ordinary convolution, ResNet, and CSPdarknet (the above three convolution structures are shown in Figure 10), it can be found that CSPdarknet cuts the input feature map to the channel, and only uses half of the original feature map to input into the residual network for processing. In forward propagation, the other half is directly spliced by the channel with the output of the residual network at the end. The advantages of doing this are as follows: (1) Only half of the input is involved in the calculation, which can greatly reduce the amount of calculation and memory consumption; (2) In the process of back propagation, a completely independent gradient propagation path is added, which can prevent feature loss caused by excessive convolution, and there is no reuse of gradient information. As can be seen from various backbone network segmentation picture (the red part is the result of the segmentation of the road class by the netwo (1) The network with VGG16 as the backbone can be achieved with half as ters than ResNet50 and ResNet101 with a similar effect. In terms of the accuracy of the lane line part of the road, the accuracy of VGG16 and similar. Both lane lines can be clearly segmented, which is better than R terms of the segmentation accuracy of the tire shape at the bottom o segmentation accuracy of VGG16 is slightly better than ResNet50 an which can better fit the tire shape. (2) The tire shape segmentation accuracy of the network with CSPdarkNet5 bone is better than VGG16, ResNet50, and ResNet101 on the lane line an of the vehicle, and fits better with the lane line and tire shape.
Comparing ordinary convolution, ResNet, and CSPdarknet (the above lution structures are shown in Figure 10), it can be found that CSPdarknet c feature map to the channel, and only uses half of the original feature map to residual network for processing. In forward propagation, the other half is di by the channel with the output of the residual network at the end. The adva ing this are as follows: (1) Only half of the input is involved in the calculation, which can great amount of calculation and memory consumption; (2) In the process of back propagation, a completely independent gradien path is added, which can prevent feature loss caused by excessive con there is no reuse of gradient information.
(a) (b) (c) Take a video shot while driving using a single RTX2060 8G graphics card as an example: the video FPS is 25 frames, and the video resolution is 1920 × 1080, for a total of 12 s. The DSRL network takes 120 s; our network takes 65 s, a 46% reduction in time. The comparison of the segmentation results between the DSRL network and our network (the red part is the actual segmentation result) is shown in Figure 11: Take a video shot while driving using a single RTX2060 8G graphics card as an example: the video FPS is 25 frames, and the video resolution is 1920 × 1080, for a total of 12 s. The DSRL network takes 120 s; our network takes 65 s, a 46% reduction in time. The comparison of the segmentation results between the DSRL network and our network (the red part is the actual segmentation result) is shown in Figure 11: It can be seen from the above two sets of comparison charts that the fps of the DSRL network can only reach about 2 frames (up to 2.31 frames) in the actual driving video, whereas our network can achieve about 4 frames (up to 4.5 frames). The segmentation is smoother. From the above pictures, we can see that our network segmentation is faster and more accurate, and the segmentation effect is better for detailed parts such as lane lines.
Taking a single image with a resolution of 960 × 720 as input, a speed comparison between DSRL and our network segmentation is shown in Table 6. From the comparison in Table 6, we can see that the time used by our network is reduced compared with the DSRL network.  It can be seen from the above two sets of comparison charts that the fps of the DSRL network can only reach about 2 frames (up to 2.31 frames) in the actual driving video, whereas our network can achieve about 4 frames (up to 4.5 frames). The segmentation is smoother. From the above pictures, we can see that our network segmentation is faster and more accurate, and the segmentation effect is better for detailed parts such as lane lines.
Taking a single image with a resolution of 960 × 720 as input, a speed comparison between DSRL and our network segmentation is shown in Table 6. From the comparison in Table 6, we can see that the time used by our network is reduced compared with the DSRL network.

Conclusions
In view of the high demand for hardware equipment for training and using highresolution atlases, this paper proposes a new network model based on Dual Super-Resolution Learning (DSRL), an added high-resolution convolution module, and a discarded Feature Pyramid Network (FPN), which can effectively compensate for the downsampling of highresolution images while reducing the amount of computation. Features are missing, and the study found that downsampling reduces noise as a lower priority than details in the picture. Our network model can segment small features better than the DSRL network, and has lower hardware requirements and faster processing speed. In terms of the actual driving video segmentation time, time is reduced by 46%, from 120 s to 65 s, which can be used in actual driving. The recognition is smoother and more accurate during driving, which greatly reduces the delay caused by high-resolution input during actual driving, thus proving the effectiveness of our method. However, the delay still exists, the detailed segmentation of objects is still lacking, and the network structure can continue being improved.