Lane Image Detection Based on Convolution Neural Network Multi-Task Learning

: Based on deep neural network multi-task learning technology, lane image detection is studied to improve the application level of driverless technology, improve assisted driving technology and reduce trafﬁc accidents. The lane line database published by Caltech and Tucson company is used to extract the ROI (Region of Interest), scale, and inverse perspective transformation as well as to preprocess the image, so as to enrich the data set and improve the efﬁciency of the algorithm. In this study, ZFNet is used to replace the basic networks of VPGNet, and their structures are changed to improve the detection efﬁciency. Multi-label classiﬁcation, grid box regression and object mask are used as three task modules to build a multi-task learning network named ZF-VPGNet. Considering that neural networks will be combined with embedded systems in the future, the network will be compressed to CZF-VPGNet without excessively affecting the accuracy. Experimental results show that the vision system of driverless technology in this study achieved good test results. In the case of fuzzy lane line and missing lane line mark, the improved algorithm can still detect and obtain the correct results, and achieves high accuracy and robustness. CZF-VPGNet can achieve high real-time performance (26FPS), and a single forward pass takes about 36 ms or less.


Introduction
The perception module is one of the most important modules in an autopilot system, which is to identify key information in the driving scene and understand the road environment. For example, driving area detection, lane line detection, vehicle detection, pedestrian and traffic light detection, real-time position and vehicle speed are all responsible for the perception module. At present, there are three schemes for the perception module of autonomous driving technology. One is based on light detection and ranging (LIDAR) sensors, such as driving radar: one kind is based on an infrared sensor, and another is based on visual perception, such as on-board cameras. Since the lidar solution and infrared sensor solution are more expensive than the vehicle camera solution, and the visual perception solution can also achieve very good results, the lane detection in this article uses a visual perception solution [1].
As a task of computer vision, vision-based lane detection can be divided into image classification, image semantic segmentation and target detection. In recent years, with the improvement in computer processing ability, the data-driven technology of deep learning has been successfully applied in various fields. Among them, convolutional neural network (CNN) has made a breakthrough in the task of computer vision [2], and target detection. Some scholars have begun to use deep learning to solve the problem of lane line detection. Seokju Lee et al. adopted a new grid method to label lane lines (a grid is a rectangular frame, and the lane line is a grid label composed of points on the line) and modeled the lane line detection problem as a regression problem [3]. The author designed a multi-task convolutional network structure (VPGNet) and used the "vanishing point" information to further precisely limit the position of the lane line so that it can detect the lane line more accurately in real time. Pan et al. proposed a spatial convolutional neural network (SCNN), which explores the spatial correlation between row and column pixels in an image [4,5]. The authors also considered the rows and columns in the feature map as network layers, and thus designed a new network layer structure suitable for transmitting information in image rows and columns, realizing a convolutional network model suitable for detecting targets with slender continuous shapes (such as lane lines). In [6], a method that can directly predict the lane was proposed, which uses the differentiable property of the least squares method and combines the depth neural network to realize the end-to-end training of the lane detection network. Ref. [7] proposed a coefficient space convolutional neural network (SSCNN) based on visual deep learning. SSCNN has greatly improved the processing speed of lane line recognition compared with existing spatial CNN methods. Xiao et al. proposed an attention module (AMSC), which combines self attention and channel attention in parallel by using learnable coefficients, and applied it to the LargeFOV algorithm. An attention DNN (modified LargeFOV) for lane marking detection was proposed, which also has good performance in lane detection [8]. It can be seen that the lane detection algorithm based on deep learning is in the ascendant [9]. However, part of the method based on deep learning is to collect specific data sets and design specific neural network structures for this purpose, so this method is generally not very generalizable [10].
The traditional lane line detection method is mainly based on image processing technology to compare the lane road with the surrounding environment, using threshold segmentation to extract the effective features [11]. However, this method requires that the lane line has obvious characteristics, compared with the surrounding environment, and it is easy to be affected by the damage of pavement occlusion, with low accuracy. Deep learning refers to the construction of a multi-layer hidden layer artificial network model, combined with massive data sets, to extract more essential features of the target so as to improve the accuracy of the classification [12]. In this study, the network structure of three task branches is used to complete lane image detection combined with deep learning theory. Compared with the traditional lane line detection method, the lane line detection method based on deep learning has improved the robustness and accuracy, but the demand for data is also higher [13].

Image Preprocessing
Image preprocessing refers to the appropriate processing of the image in the data set. There is a lot of sky and other areas of the lane line in the data set, which will not only prolong the calculation time of the system detection, but also affect the accuracy of the lane line detection. In order to extract important features from the image for the classification decision, the image must be preprocessed [14] so that the processed image can meet the requirements of the lane line detection method of deep learning and save computing resources as much as possible. In this study, we will preprocess the image from the steps of image gray processing, target area extraction, image scaling, reverse perspective conversion and image flipping.

Image Grayscale
Since this study is not based on color information recognition, there is no need to use a color image. We grayscale the image, which not only greatly reduces the amount of calculation, but also makes the image easier to understand and analyze. All pixel information of the grayscale image is described by a quantized gray level, without color information. All pixel information of a color image (such as RGB) is composed of RGB, three primary colors. RGB is described by three different gray levels. The operation that we use to transform a color image into a grayscale image is called image grayscale. For an RGB image, when three components are the same, it is a grayscale image. The weighted average method is used to gray the color image. According to a certain weight, the values

Target Area Extraction
ROI is used to capture the area of interest, i.e., the lane line area, to generate a new image. By intercepting the target area, the redundant information of the image is reduced, and the key area of the image is highlighted. As shown in Figure 1, the sky accounts for a large proportion, so using the image directly for neural network calculation is not conducive to the efficiency of the algorithm. Using ROI for image extraction is important to reduce false path detection and improve the computational efficiency [15].

Target Area Extraction
ROI is used to capture the area of interest, i.e., the lane line area, to generate a new image. By intercepting the target area, the redundant information of the image is reduced, and the key area of the image is highlighted. As shown in Figure 1, the sky accounts for a large proportion, so using the image directly for neural network calculation is not conducive to the efficiency of the algorithm. Using ROI for image extraction is important to reduce false path detection and improve the computational efficiency [15].

Picture Zooming
If the ROI extracted image is still too large, resulting in a waste of system resources, the nearest neighbor interpolation, bilinear interpolation, and bicubic interpolation can be used to scale the image, also known as image resampling. Among them, the nearest neighbor interpolation method has a small amount of calculation, but it may appear jagged in the changed place. The bilinear interpolation method overcomes the shortcomings of the nearest neighbor interpolation method, but it may also have blurred image contours, so the bicubic interpolation is a more appropriate algorithm for image scaling. , The basic principle is as follows, assuming that the size of the source image is * , and the target image after scaling times is * , and the corresponding coordinate conversion is Equation (1). Each pixel of source image is known, and the pixels of target image are unknown. Each pixel value of target image is obtained by weighted superposition of the nearest 16 pixel points ( , = 0, 1, 2, 3) around the corresponding pixel point in source image , as shown in Equation (2). The weight is obtained by bicubic Equation (3), where the parameter d is the distance from the pixels around the target to the target pixels. usually takes a value of −0.5.

Picture Zooming
If the ROI extracted image is still too large, resulting in a waste of system resources, the nearest neighbor interpolation, bilinear interpolation, and bicubic interpolation can be used to scale the image, also known as image resampling. Among them, the nearest neighbor interpolation method has a small amount of calculation, but it may appear jagged in the changed place. The bilinear interpolation method overcomes the shortcomings of the nearest neighbor interpolation method, but it may also have blurred image contours, so the bicubic interpolation is a more appropriate algorithm for image scaling.
The basic principle is as follows, assuming that the size of the source image A is m × n, and the target image B after scaling K times is M × N, and the corresponding coordinate conversion is Equation (1). Each pixel of source image A is known, and the pixels of target image B are unknown. Each pixel value of target image B is obtained by weighted superposition of the nearest 16 pixel points p ij (i, j = 0, 1, 2, 3) around the corresponding pixel point in source image A, as shown in Equation (2). The weight is obtained by bicubic Equation (3), where the parameter d is the distance from the pixels around the target to the target pixels. c usually takes a value of −0.5.

Inverse Perspective Transformation
If the lane presents a convergence state, so that the original image is distorted by perspective, in order to reduce this distortion, an IPM (inverse perspective mapping) image is obtained by inverse perspective transformation, also known as the aerial view [16]. Compared with the original image, IPM presents the characteristics of the real world without any deformation. The collected lane lines are parallel and equal in width, which can greatly improve the accuracy of detection, as shown in Figure 2. The conversion principle of the original image to the bird's-eye view is the image coordinate system to the world coordinate system. The conversion of each pixel is described by Equations (4) and (5); (X, Y, Z) represents the world coordinate system; (U, V) represents the image coordinate system; C = (C x , C y , C z ) is the position of the camera in the world coordinate system; Al phaV and Al phaU respectively represent the angle aperture of the camera in the vertical and horizontal directions, calculated by Equation (6); F is the focal length of the camera; and H and W are the height of the photosensitive element of the camera. If the road image has a certain turning angle, X and Y also need to be multiplied by the sine or cosine of the compensation angle.

Basic Network Selection
A convolution neural network is developed from the neural network, which not only classifies the image, but also solves the regression problem. The combination of the two can achieve the purpose of image detection. In this study, the last several layers of convolutional neural network are changed to realize multi task learning. Multi task learning is the sharing of basic network feature extraction, so the selection of the basic network is very important. In this study, ZFNet is used to replace the basic networks of VPGNet, and their structures are changed, so the efficiency is improved.

ZFNet Network
ZFNet was proposed by Zelier et al. in 2013. ZFNet slightly changes the structure of AlexNet. We make the first five layers of ZFNet replace the basic network of VPGNet. Zelier believed that the step size and convolution kernel of the first layer of AlexNet network are too large, so the step size is changed from 4 × 4 to 2 × 2, and the convolution kernel is changed from 11 × 11 to 7 × 7. In order to ensure the consistency of the data input and output, the network parameters of some layers of ZFNet are modified [17]. In addition, the visualization of the first layer of convolution kernel has a large impact, so the first layer of the convolution kernel is normalized. Because of the smaller step size and convolution kernel of the first layer of ZFNet and the lower sampling degree of the input data, the whole network needs to occupy more video memory in training. In order to utilize GPU resources reasonably, the batch processing of ZFNet network training is reduced. After modification, the classification effect of ZFNet is improved obviously.

Multi Task Structure
In reality, many learning tasks are related and have shared knowledge patterns. Multitask learning is a machine learning framework that enables different learning tasks to share common knowledge and maintain their independence so as to improve the generalization performance, model learning efficiency and prediction accuracy of all tasks [18]. It adopts a FCN (fully convolutional network)-like network structure with three task branches and does not sample the picture completely. The first five layers sample the image 32 times down, and then use 1 × 1 convolution to sample the image 4 times up, so the processing advantage is faster than the calculation speed of full up-sampling, as shown in Table 1 and Figure 3. The network structure has three task modules: multi-label classification, grid box regression and object mask.   The multi-label classification is used to classify the images, and the probability map with the size of 60 × 80 × channel is the output. The coordinates of each channel with a probability greater than 0.5 are taken as the output of different categories, and up to four different categories of lane lines can be output [5].  Object Mask task module Object mask is a mask detector module, which makes the 4 × 4 window slide through the whole image. The window area is the center of the image. Object mask defines the minimum area that can be resolved by the neural network. Although some small objects may be ignored, it does not need to regress point by point, but regresses the 4 × 4 grid module, which can improve the operation speed of the system [19].  Grid box regression task module Grid box regression is used to detect and locate the lane line [3]. There are many adjacent grid modules on each lane. The distance regression method is used to return the adjacent grid modules to a target.

Data Layer
In this study, a grid is used to transform the point annotation of the lane line into grid annotation so as to increase the feature information of the lane line. The traditional detection algorithm uses a rectangle frame to mark the target, and one frame represents one object. However, in this study, due to its special location and shape, the lane line is not suitable to be labeled with a single frame, so many 8 × 8 grid modules are used to label the lane line. The adjacent grid module returns to an object, that is, a lane line. Each grid module on the lane line is similar to the rectangular box marking of single object detection, which enables the network to locate other objects on the lane at the same time.

Network Model Compression
A driverless vehicle system needs a mobile embedded platform, so a lane detection algorithm also needs to run in the mobile embedded platform. Although the traditional  Figure 3 shows the multitask network structure of our ZF-VPGNet. ZF-VPGNet performs three tasks: multi-label classification, object mask and grid box regression. The following are descriptions of these three tasks: •

Multi-Label classification task module
The multi-label classification is used to classify the images, and the probability map with the size of 60 × 80 × channel is the output. The coordinates of each channel with a probability greater than 0.5 are taken as the output of different categories, and up to four different categories of lane lines can be output [5].

• Object Mask task module
Object mask is a mask detector module, which makes the 4 × 4 window slide through the whole image. The window area is the center of the image. Object mask defines the minimum area that can be resolved by the neural network. Although some small objects may be ignored, it does not need to regress point by point, but regresses the 4 × 4 grid module, which can improve the operation speed of the system [19].

• Grid box regression task module
Grid box regression is used to detect and locate the lane line [3]. There are many adjacent grid modules on each lane. The distance regression method is used to return the adjacent grid modules to a target.

Data Layer
In this study, a grid is used to transform the point annotation of the lane line into grid annotation so as to increase the feature information of the lane line. The traditional detection algorithm uses a rectangle frame to mark the target, and one frame represents one object. However, in this study, due to its special location and shape, the lane line is not suitable to be labeled with a single frame, so many 8 × 8 grid modules are used to label the lane line. The adjacent grid module returns to an object, that is, a lane line. Each grid module on the lane line is similar to the rectangular box marking of single object detection, which enables the network to locate other objects on the lane at the same time.

Network Model Compression
A driverless vehicle system needs a mobile embedded platform, so a lane detection algorithm also needs to run in the mobile embedded platform. Although the traditional network model has high precision, it requires a lot of memory and parameters, so it is Electronics 2021, 10, 2356 7 of 12 not suitable for the embedded platform. However, in practical application, it usually is necessary to make some trade-offs. If the algorithm is applied to the mobile platform, memory consumption is the primary optimization goal, and a certain precision can be sacrificed in exchange for a smaller network model.
There are four main network compression methods: one is parameter pruning and sharing. In this study, based on the VPGNet network, the channel pruning method is used to compress the network model. The second is compact convolution filtering. By using small convolution kernels and conceptual structures, such as GoogleNet, multiple small convolution kernels can be combined together to achieve the effect of approaching large convolution kernels, while reducing parameters and computation. There are also two methods, namely, low rank factor decomposition and knowledge distillation [20].

Multi-Task Loss Function
The conventional method of multi-task learning loss calculation is to simply add the loss of each task or set a uniform loss weight. Further, the weight adjustment may be performed manually. The above methods are inefficient because different tasks lose the scale difference, which may be so large that the overall loss is dominated by a certain task. In order to ensure that the loss function can correctly affect the learning process of the network sharing layer, we consider the homoscedastic uncertainty between each task to set the weight of the loss function of different tasks, as shown in Formula (7).
Let us assume that f W (x) is the output of a neural network with a weight of W on the input x, σ is the observed network model noise parameter, and L i is the loss of different tasks, L(W, σ 1 , σ 2 , σ 3 ) is the multi-task loss function of the network model. The training loss of different tasks is shown in Figure 4. 10) sary to make some trade-offs. If the algorithm is applied to the mobile platform, m consumption is the primary optimization goal, and a certain precision can be sacrif exchange for a smaller network model. There are four main network compression methods: one is parameter prunin sharing. In this study, based on the VPGNet network, the channel pruning method to compress the network model. The second is compact convolution filtering. By small convolution kernels and conceptual structures, such as GoogleNet, multiple convolution kernels can be combined together to achieve the effect of approachin convolution kernels, while reducing parameters and computation. There are al methods, namely, low rank factor decomposition and knowledge distillation [20].

Multi-Task Loss Function
The conventional method of multi-task learning loss calculation is to simply a loss of each task or set a uniform loss weight. Further, the weight adjustment may formed manually. The above methods are inefficient because different tasks lose th difference, which may be so large that the overall loss is dominated by a certain t order to ensure that the loss function can correctly affect the learning process of t work sharing layer, we consider the homoscedastic uncertainty between each task the weight of the loss function of different tasks, as shown in Formula (7).
Let us assume that is the output of a neural network with a weight o the input x, σ is the observed network model noise parameter, and is the loss of ent tasks, , , , is the multi-task loss function of the network model. The ing loss of different tasks is shown in Figure 4.

Data Composition
A deep learning neural network needs a lot of data for training, and there are few open lane databases at present. In this study, 1225 pictures published by Caltech and 5599 pictures published by Tucson are used. Caltech's database is the most suitable lane line database for this study at present, which collects images of urban roads in four scenarios, as shown in Table 2 [21]. The data set contains many complicated data, such as interference signs, intense lights, shadows, and a large number of vehicles. Compared with the expressway, urban roads have more kinds of lane lines, more kinds of shelters and more complex road conditions. Tucson's published rough labeled lane line data set adopts the scene pictures when the vehicle is driving. It collects 20 frames of images for each second of driving video, and only marks the last image of each second. The database contains a large number of complex driving scenarios, including lane line images under good and moderate weather conditions, road lane images with 2/3/4 or more lanes, and images under different traffic conditions; the data set also contains about 30% curvilinear lane lines, so it is very suitable for this study.

Evaluation
We use the F1 score as an evaluation index to determine whether the lane mark is correctly detected. First, we calculate the intersection-over-union (IoU) between the ground truth and the prediction. Remember the true positive (TP) as the prediction that IoU is greater than a specific threshold, and vice versa as false positive (FP). At the same time,

Results and Discussion
We choose the ZFNet network as the basis of the network structure and customize and optimize the network to further reduce the complexity of the network. By observing the weights of the convolution kernel during the training of the network model, it is found that there are a large number of weights close to 0 in the 3rd to 8th convolutional layers of the network. Network sparsity brings a small amount of network performance improvement, but with it, there is an exponential increase in computation. After multiple training tests, we try to cut the number of input convolution kernels of the 3rd to 8th convolutional layers of the above two networks without excessively reducing performance, reducing the number of network weights close to 0 in the retrained network model, and we finally obtain a high-performance network model. Ubuntu 16.04 is used as the running platform, training with GPU RTX2080Ti; caffe is the deep learning framework; the learning rate is the adaptive learning rate ( is a small positive number to prevent division by zero divergence, Equation (10) It can be seen from the results that the system is robust to the interference of gr obstacles, light changes, occlusion, shadow and so on. The fuzzy lane line and mi lane line mark are shown in Figure 5c. The improved algorithm can still be used to d and obtain the correct results, which shows the feasibility of the algorithm. On the of the test results, the fitting operation can be carried out. However, it is difficult when two different types of roads or prediction results are wrong, and further fittin category is needed. Because the data in this study are not rich enough, there is an a mal situation that there is no output, but there is a marker when there is no road, w requires more data sets and the classification of road markers to eliminate this impa In order to test the superiority of the algorithm, part of Cordoval1 and Washing and Tucson scenarios in the open database of Caltech are selected for testing, an comprehensive index F1 score is used as the evaluation index. The test results are sh in Table 3 and Figure 6.  It can be seen from the results that the system is robust to the interference of ground obstacles, light changes, occlusion, shadow and so on. The fuzzy lane line and missing lane line mark are shown in Figure 5c. The improved algorithm can still be used to detect and obtain the correct results, which shows the feasibility of the algorithm. On the basis of the test results, the fitting operation can be carried out. However, it is difficult to fit when two different types of roads or prediction results are wrong, and further fitting by category is needed. Because the data in this study are not rich enough, there is an abnormal situation that there is no output, but there is a marker when there is no road, which requires more data sets and the classification of road markers to eliminate this impact.
In order to test the superiority of the algorithm, part of Cordoval1 and Washington1 and Tucson scenarios in the open database of Caltech are selected for testing, and the comprehensive index F1 score is used as the evaluation index. The test results are shown in Table 3 and Figure 6.  It can be concluded from Table 3 that the performance of the lane line detect rithm based on the improved network (ZF-VPGNet) is much higher than the tr lane line detection algorithm. On Cordoval1, our ZF-VPGNet improved its F1 3.8%, compared with other excellent network models; on Washington1, it has proved to some extent, compared with other excellent network models.
It can be seen from the data set that the Washington1 scene is more complica the Cordoval1 scene and the experimental results are consistent; the complex roa tions and road images under shadow have higher requirements on the networ As shown in Figure 6, although the compression of the network model leads to in detection performance, the compressed network model CZF-VPGNet has hig time performance (26 fps) and takes less time (36 ms) for a single forwarding. T pressed network model has higher performance in terms of comprehensive o speed and the proportion of memory.

Conclusions
By using the improved network labeling in this paper, the feature informati lane line is added, and the network can locate other objects on the lane line. U proved ZFNet and VGGNet networks instead of DriveNet and VPGNet netw experiments show that the improved network classification effect is better. At t time, in order to transplant the neural network to embedded devices, the VPGNet is compressed. We design a new convolution method to reduce the network pa and calculations. The accuracy of the compressed network model does not chang icantly, and the running speed is improved. Under the same conditions, the rec rate of this algorithm is high, and at the same time, the balance between recogni and running time is basically achieved. In order to verify the feasibility of the im network lane detection algorithm and the performance of the compressed networ this research is only carried out under usual road conditions. In the future rese It can be concluded from Table 3 that the performance of the lane line detection algorithm based on the improved network (ZF-VPGNet) is much higher than the traditional lane line detection algorithm. On Cordoval1, our ZF-VPGNet improved its F1 score by 3.8%, compared with other excellent network models; on Washington1, it has also improved to some extent, compared with other excellent network models.
It can be seen from the data set that the Washington1 scene is more complicated than the Cordoval1 scene and the experimental results are consistent; the complex road conditions and road images under shadow have higher requirements on the network model. As shown in Figure 6, although the compression of the network model leads to a decline in detection performance, the compressed network model CZF-VPGNet has higher real-time performance (26 fps) and takes less time (36 ms) for a single forwarding. The compressed network model has higher performance in terms of comprehensive operation speed and the proportion of memory.

Conclusions
By using the improved network labeling in this paper, the feature information of the lane line is added, and the network can locate other objects on the lane line. Using improved ZFNet and VGGNet networks instead of DriveNet and VPGNet networks, the experiments show that the improved network classification effect is better. At the same time, in order to transplant the neural network to embedded devices, the VPGNet network is compressed. We design a new convolution method to reduce the network parameters and calculations. The accuracy of the compressed network model does not change significantly, and the running speed is improved. Under the same conditions, the recognition rate of this algorithm is high, and at the same time, the balance between recognition rate and running time is basically achieved. In order to verify the feasibility of the improved network lane detection algorithm and the performance of the compressed network model, this research is only carried out under usual road conditions. In the future research, we will continue our research under more complex road conditions (such as bad weather, intersections, etc.), and we will conduct research on a random, untargeted, adversarial example [22] on the basis of this research. This is because in autonomous vehicles, some adversarial examples are likely to cause fatal accidents.