Research on Lane Line Detection Algorithm Based on Instance Segmentation

Aiming at the current lane line detection algorithm in complex traffic scenes, such as lane lines being blocked by shadows, blurred roads, and road sparseness, which lead to low lane line detection accuracy and poor real-time detection speed, this paper proposes a lane line detection algorithm based on instance segmentation. Firstly, the improved lightweight network RepVgg-A0 is used to encode road images, which expands the receptive field of the network; secondly, a multi-size asymmetric shuffling convolution model is proposed for the characteristics of sparse and slender lane lines, which enhances the ability to extract lane line features; an adaptive upsampling model is further proposed as a decoder, which upsamples the feature map to the original resolution for pixel-level classification and detection, and adds the lane line prediction branch to output the confidence of the lane line; and finally, the instance segmentation-based lane line detection algorithm is successfully deployed on the embedded platform Jetson Nano, and half-precision acceleration is performed using NVDIA’s TensorRT framework. The experimental results show that the Acc value of the lane line detection algorithm based on instance segmentation is 96.7%, and the FPS is 77.5 fps/s. The detection speed deployed on the embedded platform Jetson Nano reaches 27 fps/s.


Introduction
Lane line detection is a crucial component of the road surface information and environment perception used in autonomous driving technology, which contains semantic information about road areas, identifies the direction of travel, and enhances guidance data. Because deep learning and artificial intelligence have advanced so quickly, lane line detection technology is now able to provide automated driving vehicles with collision warning, lane departure warning, and auxiliary environment perception information, as well as to help the system realize lane path planning [1,2]. This increases the safety of automated driving.
Lane lines, the most important road traffic signs, constrain vehicles' driving paths. Traditional lane detection algorithms based on features and models and deep learningbased lane line detection algorithms are the two broad categories of current mainstream lane line detection algorithms.
The traditional lane line detection algorithms first needed to preprocess the image according to the specific use scene to eliminate noise interference; then, used the preset shape, color, or spatial features to match the road image for feature extraction; and finally, used the least-squares algorithm to simulate the lane line. Mammeri et al. [3] proposed a lane line detection system combining the most stable extremal region and Hough transform, which used matching features such as the color and shape of lane lines, to detect lane lines. Sotelo et al. [4] developed a road segmentation algorithm based on an HIS color space and a two-dimensional constrained space for obtaining the lane line information.

1.
Improve the RepVgg-A0 network to expand the receptive field of the network without increasing the amount of calculations, and propose a multi-size asymmetric shuffled convolution model to enhance the extraction of sparse and slender lane lines ability.

2.
An adaptive upsampling model is proposed, which allows the network to select the weight of the two upsampling methods at each position; at the same time, a lane line prediction branch is added to facilitate the output of lane line confidence.  3. Deploy the lane line detection algorithm to the embedded platform Jetson Nano, and use the TensorRT framework for half-precision acceleration to make its detection speed meet the needs of real-time detection.
This paper is arranged as follows: A lane line detection model based on instance segmentation is designed in Section 2. In Section 3, the lane line detection experiment is carried out by combining the Tusimple extended dataset and the video collected by the real car and deployed to the mobile terminal. Finally, in Section 4, the content of this paper is summarized and directions for future work are provided.

Design of Lane Line Detection Model
Refer to the encoder-decoder network structure [19] to build a lane line detection instance segmentation model. The lane line detection model framework based on instance segmentation includes an encoder, a feature enhancement model, a decoder, and a lane prediction branch. Firstly, the encoder part uses the improved lightweight network RepVgg-A0 to encode the road image; secondly, the feature enhancement model uses a multiscale asymmetric shuffled convolution model to enhance the ability to extract lane line features; further, the adaptive upsampling model is used as the decoder, the feature map is upsampled to the original resolution for pixel-level classification and detection, and the lane line prediction branch is added to output the lane line confidence.

Design of Encoder Network Structure
After the road image is input, the lightweight network RepVgg-A0 is used as the encoder of the model to initially extract the features of the lane line. RepVgg follows the lightweight model design guidelines proposed in ShuffleNet V2 [20] and proposes the idea of structural reparameterization. Different model structures are used in the training and inference stages, and different branch structures are cleverly fused during inference to reduce memory usage and speed up model inference. This is also one of the reasons why the lane line detection algorithm in this paper has a faster inference speed.
The lightweight RepVgg-A0 [21] network downsamples and compresses the input image to 1/32 of the original image through 3 convolutional layers with a step size of 2, reducing the image resolution while increasing the receptive field. However, too small a resolution will, as a result, cause the encoded image to lose a lot of spatial information, and it is difficult for the subsequent decoding process to repair this information, which affects the accuracy of lane line detection. Therefore, the RepVgg-A0 network structure is adjusted, and the step size of the convolution in the last 2 layers of the network is set to 1 so that the downsampling ratio is reduced from 32 times to 8 times. After the above operations, the size of the encoded feature map is relatively increased, and more original information about the lane line is retained, but another problem arises. The receptive field of the network becomes smaller, and it is difficult for it to learn global features. Therefore, the hole convolution is introduced in the last 2 layers of the network, and the conventional 3 × 3 convolutions in the last 2 layers of the RepVgg-A0 network are replaced by 3 × 3-hole convolutions with a hole rate of 2 and 4, without introducing additional calculations.
After the input 3-channel image is initially extracted by the improved RepVgg-A0 network, the number of channels becomes 1280, and the resolution is reduced to 1/8 of the original. To reduce the calculation amount of subsequent operations and fuse the extracted features at the same time, 1 × 1's convolutions are added to the last layer of the encoder to compress the number of channels to 128. After the above steps, the overall structure of the encoder is shown in Table 1, and its network structure is shown in Figure 1 below. encoder to compress the number of channels to 128. After the above steps, the overall structure of the encoder is shown in Table 1, and its network structure is shown in Figure  1 below.  Figure 1. Improved encoder network structure diagram.

Design of Feature Enhancement Model
Lane line detection is different from conventional object detection. A lane line usually spans the entire image, which requires the network to have a large enough receptive field. For the instance segmentation network, an effective way to increase the receptive field is to use a larger convolution kernel. Inspired by the ShuffleNet V2 network, this paper designs a multi-size shuffled convolution module containing 3 sizes of convolution kernels of 3 × 3, 5 × 5, and 7 × 7. Among them, 3 × 3 convolutions are used to extract the detailed features of lane lines, and 5 × 5 and 7 × 7 convolutions have larger receptive fields, which can capture larger-scale lane line features. The multi-size shuffling convolution module structure is shown in Figure 2a. After the feature map is input, firstly, the channel is divided into two branches, and the secondary branch performs the same mapping. Secondly, the main branch performs convolutions of three sizes of 3 × 3, 5 × 5, and 7 × 7 in sequence, and uses the FReLu activation function [22] to add nonlinear factors after each convolution. Finally, after splicing the main branch and the second branch channels, the full channel shuffling operation is performed to promote the fusion of feature information between channels.
The computational cost of using shuffled convolution modules is somewhat lower than using large convolution kernels directly, but the 5 × 5 and 7 × 7 convolutions still require a lot of calculations. In order to further simplify calculations, asymmetric convolutions are introduced in this paper. Asymmetric convolution reduces the amount of calculation by substituting k × 1 and 1 × k convolutions for traditional k × k convolutions. The convolution calculations and parameters for the standard k × k convolutions are as follows:

Design of Feature Enhancement Model
Lane line detection is different from conventional object detection. A lane line usually spans the entire image, which requires the network to have a large enough receptive field. For the instance segmentation network, an effective way to increase the receptive field is to use a larger convolution kernel. Inspired by the ShuffleNet V2 network, this paper designs a multi-size shuffled convolution module containing 3 sizes of convolution kernels of 3 × 3, 5 × 5, and 7 × 7. Among them, 3 × 3 convolutions are used to extract the detailed features of lane lines, and 5 × 5 and 7 × 7 convolutions have larger receptive fields, which can capture larger-scale lane line features. The multi-size shuffling convolution module structure is shown in Figure 2a. After the feature map is input, firstly, the channel is divided into two branches, and the secondary branch performs the same mapping. Secondly, the main branch performs convolutions of three sizes of 3 × 3, 5 × 5, and 7 × 7 in sequence, and uses the FReLu activation function [22] to add nonlinear factors after each convolution. Finally, after splicing the main branch and the second branch channels, the full channel shuffling operation is performed to promote the fusion of feature information between channels.
The computational cost of using shuffled convolution modules is somewhat lower than using large convolution kernels directly, but the 5 × 5 and 7 × 7 convolutions still require a lot of calculations. In order to further simplify calculations, asymmetric convolutions are introduced in this paper. Asymmetric convolution reduces the amount of calculation by substituting k × 1 and 1 × k convolutions for traditional k × k convolutions. The convolution calculations and parameters for the standard k × k convolutions are as follows: Sensors 2023, 23, 789

of 21
Among them are the height and width of the input feature map, respectively, and the channel numbers of the input and output feature maps, respectively. The asymmetric convolution parameters and calculations equivalent to k × k convolution are as follows: It can be seen from Formulas (1)-(4) that the larger the size of the convolution kernel, the more obvious the number of parameters and calculations that can be reduced by converting it into an asymmetric convolution. In addition, some studies have shown that the asymmetric effect of convolution applied to the middle layer of the network is better [23]. Therefore, the 5 × 5 and 7 × 7 convolutions in the multi-size shuffled convolution module are replaced by asymmetric convolution, and a multi-size asymmetric shuffled convolution module, as shown in Figure 2b, is designed. For fixed 46 × 80 × 128 input feature maps, the number of module parameters and calculations are reduced by 60.24% and 61.47%, respectively.
Among them are the height and width of the input feature map, respectively, and the channel numbers of the input and output feature maps, respectively. The asymmetric convolution parameters and calculations equivalent to k × k convolution are as follows: It can be seen from Formulas (1)-(4) that the larger the size of the convolution kernel, the more obvious the number of parameters and calculations that can be reduced by converting it into an asymmetric convolution. In addition, some studies have shown that the asymmetric effect of convolution applied to the middle layer of the network is better [23]. Therefore, the 5 × 5 and 7 × 7 convolutions in the multi-size shuffled convolution module are replaced by asymmetric convolution, and a multi-size asymmetric shuffled convolution module, as shown in Figure 2b,    Stack 6 multi-size asymmetric shuffled convolution models are used to form the feature enhancement model of the lane line detection model in this paper. Among them, the last 5 modules use hole convolution to further expand the receptive field, and the hole rates are set to 2, 4, 6, 8, and 10, respectively. The feature enhancement module further extracts the lane line information existing in the feature map output by the encoder and inputs the result into the lane line prediction branch and decoder structure, and its corresponding network structure is shown in Figure 3 below. inputs the result into the lane line prediction branch and decoder structure, and its corresponding network structure is shown in Figure 3 below.

Design of Decoder Network Structure
The decoder's job is to classify each pixel in the feature map by upsampling the lowresolution feature map, which contains rich feature information, to the size of the input image. The two most popular upsampling algorithms are bilinear interpolation and transposition convolution. However, these algorithms ignore the impact of the gradient in pixel values between adjacent points, which will degrade the sampled image's detailed features. It is also simple to ignore coarse-grained features and other issues. To solve the above problems, Zheng et al. [24] proposed a bilateral upsampling module, which directly adds bilinear interpolation and transposed convolution upsampling results, and achieved certain results, but did not consider the two kinds of upsampling based on the applicability of the method to a specific image area. To effectively extract image features, this paper suggests an adaptive upsampling module that enables the network to choose the weight of the two upsampling methods at each location.
The adaptive upsampling module structure is shown in Figure 4. After inputting the H x W x C feature map, firstly, use bilinear interpolation and transposed convolution to perform upsampling, and initially obtain two 2H × 2W × C/2 upsampling feature maps, E and F; then, splice E and F at the channel dimension to obtain a 2H × 2W × C feature map G, perform 3 × 3 convolutions on G to extract the spatial attention description S (2H × 2W × 2), and use the Softmax function to extract two attention weights of 2H × 2W × 1 for S; finally, the attention weights are weighted and summed with E and F, respectively, to obtain the final upsampling result (2H × 2W × C/2).

Design of Decoder Network Structure
The decoder's job is to classify each pixel in the feature map by upsampling the low-resolution feature map, which contains rich feature information, to the size of the input image. The two most popular upsampling algorithms are bilinear interpolation and transposition convolution. However, these algorithms ignore the impact of the gradient in pixel values between adjacent points, which will degrade the sampled image's detailed features. It is also simple to ignore coarse-grained features and other issues. To solve the above problems, Zheng et al. [24] proposed a bilateral upsampling module, which directly adds bilinear interpolation and transposed convolution upsampling results, and achieved certain results, but did not consider the two kinds of upsampling based on the applicability of the method to a specific image area. To effectively extract image features, this paper suggests an adaptive upsampling module that enables the network to choose the weight of the two upsampling methods at each location.
The adaptive upsampling module structure is shown in Figure 4. After inputting the H × W × C feature map, firstly, use bilinear interpolation and transposed convolution to perform upsampling, and initially obtain two 2H × 2W × C/2 upsampling feature maps, E and F; then, splice E and F at the channel dimension to obtain a 2H × 2W × C feature map G, perform 3 × 3 convolutions on G to extract the spatial attention description S (2H × 2W × 2), and use the Softmax function to extract two attention weights of 2H × 2W × 1 for S; finally, the attention weights are weighted and summed with E and F, respectively, to obtain the final upsampling result (2H × 2W × C/2).  The bilinear interpolation upsampling structure in the adaptive upsampling module is shown in Figure 5a. Firstly, 1 × 1 convolutions are used to reduce the number of channels to a half, and then bilinear interpolation is performed to double the size of the feature map. The upsampling structure of the transposed convolution is shown in Figure 5b. The transposed convolution with a step size of 2 is used to expand the size of the feature map by 2 times while compressing the number of channels, and then accessing two asymmetric The bilinear interpolation upsampling structure in the adaptive upsampling module is shown in Figure 5a. Firstly, 1 × 1 convolutions are used to reduce the number of channels to a half, and then bilinear interpolation is performed to double the size of the feature map. The upsampling structure of the transposed convolution is shown in Figure 5b. The transposed convolution with a step size of 2 is used to expand the size of the feature map by 2 times while compressing the number of channels, and then accessing two asymmetric convolutions-Non-bt-1D modules [25]. Figure 4. Adaptive upsampling module.
The bilinear interpolation upsampling structure in the adaptive upsampling module is shown in Figure 5a. Firstly, 1 × 1 convolutions are used to reduce the number of channels to a half, and then bilinear interpolation is performed to double the size of the feature map. The upsampling structure of the transposed convolution is shown in Figure 5b. The transposed convolution with a step size of 2 is used to expand the size of the feature map by 2 times while compressing the number of channels, and then accessing two asymmetric convolutions-Non-bt-1D modules [25]. The adaptive upsampling module is superimposed three times to form the decoder of the lane line detection model in this paper, and its corresponding network structure is shown in Figure 6. The input feature map is decoded by the decoder and upsampled to the original image size, and the number of channels is reduced to 7. The first channel is used to predict the background of the lane line, and the other channels directly predict the pixel coordinates of the lane line instance, which has a faster detection speed than the algorithm that is first semantically segmented and then fits the lane line. The adaptive upsampling module is superimposed three times to form the decoder of the lane line detection model in this paper, and its corresponding network structure is shown in Figure 6. The input feature map is decoded by the decoder and upsampled to the original image size, and the number of channels is reduced to 7. The first channel is used to predict the background of the lane line, and the other channels directly predict the pixel coordinates of the lane line instance, which has a faster detection speed than the algorithm that is first semantically segmented and then fits the lane line.

Design of Lane Line Prediction Branch
This paper develops a lane line prediction branch to assess the existence of each lane line and determine the degree of confidence in the existence of output lane lines. Figure   Figure 6. Decoder network structure.

Design of Lane Line Prediction Branch
This paper develops a lane line prediction branch to assess the existence of each lane line and determine the degree of confidence in the existence of output lane lines. Figure 7a depicts the network's internal structure, and Figure 7b depicts the network's external structure. Firstly, the number of channels is reduced to 7 through 1 × 1 convolution, and after activation by Softmax, average pooling with a step size of 2 is used to downsample to 23 × 40 × 7; then, two fully connected layers are used continuously and activated by ReLU and Sigmoid, respectively, and output length is a one-dimensional feature vector of 6, respectively, representing the probability of the existence of 6 pre-selected lane lines. In actual use, set a confidence threshold. When the confidence is greater than the threshold, it means that the lane line exists, otherwise it does not exist. This paper sets the threshold to 0.5.

Design of Lane Line Prediction Branch
This paper develops a lane line prediction branch to assess the existence of each lane line and determine the degree of confidence in the existence of output lane lines. Figure  7a depicts the network's internal structure, and Figure 7b depicts the network's external structure. Firstly, the number of channels is reduced to 7 through 1 × 1 convolution, and after activation by Softmax, average pooling with a step size of 2 is used to downsample to 23 × 40 × 7; then, two fully connected layers are used continuously and activated by ReLU and Sigmoid, respectively, and output length is a one-dimensional feature vector of 6, respectively, representing the probability of the existence of 6 pre-selected lane lines. In actual use, set a confidence threshold. When the confidence is greater than the threshold, it means that the lane line exists, otherwise it does not exist. This paper sets the threshold to 0.5.

Dataset and Preprocessing
This paper is based on the TuSimple [26] dataset, which comprises video images collected on American highways. There are 20 frames in each segment. The original dataset only marked the final frame of the 20 frames because there are many video frame data points. The first frame and the images of the tenth and eleventh frames in the middle are chosen for labeling to improve the dataset's generalizability. The labeling file is in json format, and for every ten pixels in the expanded (vertical) direction, a point is marked. There are 25,632 pictures of roads. Different from the original dataset, 14,504 pictures are selected for training, 2325 pictures are used for verification, and 8803 pictures are used for testing. To enhance the diversity of the data and improve the robust effect of the model, data enhancement processing is performed on the training set, including random rotation and random horizontal deflection. Figure 9 shows some common scenes in the dataset. Each image has 2 to 5 marked lane lines. In this paper, these discrete lane line coordinate points are connected to form an example image as a real mark.

Dataset and Preprocessing
This paper is based on the TuSimple [26] dataset, which comprises video images collected on American highways. There are 20 frames in each segment. The original dataset only marked the final frame of the 20 frames because there are many video frame data points. The first frame and the images of the tenth and eleventh frames in the middle are chosen for labeling to improve the dataset's generalizability. The labeling file is in json format, and for every ten pixels in the expanded (vertical) direction, a point is marked. There are 25,632 pictures of roads. Different from the original dataset, 14,504 pictures are selected for training, 2325 pictures are used for verification, and 8803 pictures are used for testing. To enhance the diversity of the data and improve the robust effect of the model, data enhancement processing is performed on the training set, including random rotation and random horizontal deflection. Figure 9 shows some common scenes in the dataset. Each image has 2 to 5 marked lane lines. In this paper, these discrete lane line coordinate points are connected to form an example image as a real mark.

Experiment Preparation
The server used in the experiment is the 11th Gen Intel (R) Core (TM) i5-11400H 2.70 GHz 2.69 GHz, 512 GB memory, and NVIDIA GeForce RTX3050 graphics process The operating system is Windows 10 professional version, the deep learning framewo is tensorflow2.4-GPU, and CUDA version is 11.0.

Experiment Preparation
The server used in the experiment is the 11th Gen Intel (R) Core (TM) i5-11400H @ 2.70 GHz 2.69 GHz, 512 GB memory, and NVIDIA GeForce RTX3050 graphics processor. The operating system is Windows 10 professional version, the deep learning framework is tensorflow2.4-GPU, and CUDA version is 11.0.
The video acquisition device used in the experiment is a front-view camera, as shown in Figure 10, with a resolution of 2592 × 1944. The experimental vehicle is a Volkswagen Sagitar, the embedded platform Jetson Nano is used for mobile deployment, and the operating system is ubuntu18.0.4. The details are shown in Figure 11.

Experiment Preparation
The server used in the experiment is the 11th Gen Intel (R) Core (TM) i5-11400H @ 2.70 GHz 2.69 GHz, 512 GB memory, and NVIDIA GeForce RTX3050 graphics processor. The operating system is Windows 10 professional version, the deep learning framework is tensorflow2.4-GPU, and CUDA version is 11.0.
The video acquisition device used in the experiment is a front-view camera, as shown in Figure 10, with a resolution of 2592 × 1944. The experimental vehicle is a Volkswagen Sagitar, the embedded platform Jetson Nano is used for mobile deployment, and the operating system is ubuntu18.0.4. The details are shown in Figure 11.

Experiment Preparation
The server used in the experiment is the 11th Gen Intel (R) Core (TM) i5-11400H @ 2.70 GHz 2.69 GHz, 512 GB memory, and NVIDIA GeForce RTX3050 graphics processor. The operating system is Windows 10 professional version, the deep learning framework is tensorflow2.4-GPU, and CUDA version is 11.0.
The video acquisition device used in the experiment is a front-view camera, as shown in Figure 10, with a resolution of 2592 × 1944. The experimental vehicle is a Volkswagen Sagitar, the embedded platform Jetson Nano is used for mobile deployment, and the operating system is ubuntu18.0.4. The details are shown in Figure 11.   Use each batch for training after updating the model parameters, and record it as a training session. Set the maximum number of iterations to 300, and the maximum number of training times to 80,000. When the number of training times is greater than this value, stop training. The learning rate is determined by the following formula: In the formula, L represents the learning rate, is the current training times, and is the highest training times.

Model Evaluation Index and Performance Comparison of Different Models
The performance evaluation of the lane line detection model in this paper is performed using the official evaluation method provided by TuSimple. Each detected lane line is represented by a set of x-axis coordinates with a fixed y-axis. The difference between the number of detected lane lines and the number of real lane lines cannot be greater than two, otherwise, it is judged that no lane line is detected. The evaluation indicators include accuracy rate (Acc), false positive rate (FP), false negative rate (FN), parameter amount (Params), floating point calculation amount (FLOPs), and running speed (FPS). Accuracy is calculated as follows: wherein N pred is the number of correctly detected lane line points and N gt is the number of real lane line points. The false positive rate and false negative rate are calculated as follows: Among them, F pred is the number of wrongly predicted lane lines, M pred means the number of real lane lines that have not been predicted, and the lower the values of FP and FN, the better the model performance.
To verify the performance of the model in this paper, it is compared with existing models (ResNet-18, ResNet-34 [27], Enet [28], LaneNet [29], SCNN [30], ENet-SAD [31], RESA-50 [32], SGLD-34 [33], Res34-VP [34]) that conducted comparative experiments on the TuSimple test set, and the results are shown in Table 3. As can be seen from Table 3, the lane detection model proposed in this paper is superior to the current excellent lane detection model in terms of accuracy, achieving the highest accuracy rate and the lowest FP value. Moreover, the amount of parameters and calculations of the model is only higher than that of the lightweight network Enet, and the reasoning speed is second only to Res18-Seg. The model in this paper can quickly and accurately detect lane lines with a small number of computing resources, achieve a balance between accuracy and speed, and meet the accuracy and real-time requirements of lane line detection. Therefore, on the whole, the lane detection model in this paper is superior to other lane detection models in terms of comprehensive performance.
To verify the influence of the adaptive upsampling module and the feature enhancement module on the overall performance of the lane line detection model, an ablation experiment was carried out to compare the accuracy before and after adding the adaptive upsampling module and the feature enhancement module. Table 4 records the results of the ablation experiment. It can be seen from Table 4 that after adding the adaptive upsampling module and the feature enhancement module, the accuracy of the model has been improved to varying degrees, which are 0.2% and 0.82%, respectively. After using the two modules comprehensively, the accuracy rate of the model is increased to 96.7%, which is 0.89% higher than the original, indicating that the above two modules can effectively improve the performance of the lane line detection model.

Comparison of Loss Function Curves
In lane line detection, the rationality, quality, and performance of the first nine lane line detection algorithms in Table 3 are tested to compare them with the model in this paper. Figure 12 is the verification curve of the loss value during the model training process. The maximum number of iterations set in the experimental environment is 300, and the maximum number of training is 80,000. The number of iterations is converted to 89 generations, that is, it is executed from 0 to 88 generations. The initial learning rate of the SGD optimizer is 0.02. When the above nine lane line detection algorithms converge to a specific stage, the convergence speed of the loss function decreases significantly due to the decline in the feature extraction ability of the model. After the convergence speed decreases, the loss error curves of the first nine lane line detection algorithms in Table 3 oscillate greatly during the training process. From the ENet-SAD, Res34-VP, RESA-50, SGLD-34, and Res18-Seg loss function curves, it can be seen that the loss function converges faster in the early stage and slows down in the late convergence process. However, the model in this paper has a steady downward trend, the oscillation amplitude is the smallest, and the convergence effect is the best, even with a small number of parameters, a stable training process can be achieved. Figure 13 shows that in the training phase, the verification dataset is used to crossvalidate each lane line detection model to generate a loss function verification curve. It can be seen from Figure 13 that ENet-SAD and RESA-50 present an overfitting phenomenon, according to the loss function curve which first decreases and then increases. At the same time, the other eight lane line detection models did not appear to be over-fitting during the training period, but during the training process, the loss function of the network such as Res34-VP has a certain degree of oscillation during the convergence process. Compared with the convergence effects of the remaining seven lane line detection models, the model in this paper has the best verification convergence.
For the ablation experiments in Table 4, the performance before and after adding the adaptive upsampling module and feature enhancement module is comapared, to conduct a comparative analysis with the model in this paper. Figure 14 is the verification curve of the loss value during model training, and the experimental environment settings are the same as above. For the baseline, when the loss function converges to a specific stage, the convergence speed of the loss function decreases significantly due to the decline of the feature extraction ability of the model. After the convergence speed decreases, the baseline, adaptive upsampling module, and feature enhancement module, as well as the fusion feature enhancement module and adaptive upsampling module all show a state of convergence. However, the model in this paper has a stable downward trend, and the oscillation amplitude is the smallest, so convergence works best.  Figure 13 shows that in the training phase, the verification dataset is used to validate each lane line detection model to generate a loss function verification cu can be seen from Figure 13 that ENet-SAD and RESA-50 present an overfitting phe non, according to the loss function curve which first decreases and then increases. same time, the other eight lane line detection models did not appear to be over during the training period, but during the training process, the loss function of t work such as Res34-VP has a certain degree of oscillation during the convergence p Compared with the convergence effects of the remaining seven lane line detection m the model in this paper has the best verification convergence. For the ablation experiments in Table 4, the performance before and after add adaptive upsampling module and feature enhancement module is comapared, to c a comparative analysis with the model in this paper. Figure 14 is the verification c   Figure 13 shows that in the training phase, the verification dataset is used t validate each lane line detection model to generate a loss function verification c can be seen from Figure 13 that ENet-SAD and RESA-50 present an overfitting ph non, according to the loss function curve which first decreases and then increases same time, the other eight lane line detection models did not appear to be ove during the training period, but during the training process, the loss function of work such as Res34-VP has a certain degree of oscillation during the convergence Compared with the convergence effects of the remaining seven lane line detection the model in this paper has the best verification convergence. For the ablation experiments in Table 4, the performance before and after add adaptive upsampling module and feature enhancement module is comapared, to a comparative analysis with the model in this paper. Figure 14 is the verification the loss value during model training, and the experimental environment settings same as above. For the baseline, when the loss function converges to a specific st convergence speed of the loss function decreases significantly due to the declin feature extraction ability of the model. After the convergence speed decreases, th   Figure 15 shows that in the training phase, the above four detection models are crossvalidated using the verification data set to generate a loss function verification curve. It can be seen from Figure 15 that the model of the fusion feature enhancement module shows a significant increase in the convergence speed in the later stage, and the convergence effect outperforms the one fused with the baseline as well as networks fused with the adaptive upsampling module. However, the model in this paper has more advantages than the first three networks, so it has the best verification convergence. line, adaptive upsampling module, and feature enhancement module, as well as sion feature enhancement module and adaptive upsampling module all show a convergence. However, the model in this paper has a stable downward trend, oscillation amplitude is the smallest, so convergence works best.  Figure 15 shows that in the training phase, the above four detection models ar validated using the verification data set to generate a loss function verification c can be seen from Figure 15 that the model of the fusion feature enhancement shows a significant increase in the convergence speed in the later stage, and the gence effect outperforms the one fused with the baseline as well as networks fus the adaptive upsampling module. However, the model in this paper has more adv than the first three networks, so it has the best verification convergence.

Comparison of Lane Line Detection Effects
To verify the effectiveness of the feature enhancement module and the adap sampling module, the feature maps before and after the feature enhancement mod the final detection results are visualized, as shown in Figure 16. Among them, Fig is the feature map before the encoder processing, Figure 16b is the feature map aft   Figure 15 shows that in the training phase, the above four detection models ar validated using the verification data set to generate a loss function verification c can be seen from Figure 15 that the model of the fusion feature enhancement shows a significant increase in the convergence speed in the later stage, and the gence effect outperforms the one fused with the baseline as well as networks fus the adaptive upsampling module. However, the model in this paper has more adv than the first three networks, so it has the best verification convergence.

Comparison of Lane Line Detection Effects
To verify the effectiveness of the feature enhancement module and the adap sampling module, the feature maps before and after the feature enhancement mod the final detection results are visualized, as shown in Figure 16

Comparison of Lane Line Detection Effects
To verify the effectiveness of the feature enhancement module and the adaptive upsampling module, the feature maps before and after the feature enhancement module and the final detection results are visualized, as shown in Figure 16. Among them, Figure 16a is the feature map before the encoder processing, Figure 16b is the feature map after being processed by the feature enhancement module, and Figure 16c is the input image. Figure 16b shows that the features extracted by the encoder are relatively scattered local features, while the feature enhancement module can capture the complete features of the lane lines, and the perceived ability of the lane lines is significantly enhanced. Figure 16b  processed by the feature enhancement module, and Figure 16c is the input image. Figure  16b shows that the features extracted by the encoder are relatively scattered local features, while the feature enhancement module can capture the complete features of the lane lines, and the perceived ability of the lane lines is significantly enhanced. Figure 16b is the mapping output by the adaptive upsampling module in the original image. It can be seen that the model can accurately detect the lane lines in the input image. To verify the effectiveness of the feature enhancement module and the adaptive upsampling module, the feature maps before and after the feature enhancement module and the final detection results are visualized, as shown in Figure 16. Among them, Figure 16a is the feature map before the encoder processing, Figure 16b is the feature map after being processed by the feature enhancement module, and Figure 16c is the input image. Figure  16b shows that the features extracted by the encoder are relatively scattered local features, while the feature enhancement module can capture the complete features of the lane lines, and the perceived ability of the lane lines is significantly enhanced. Figure 16b is the mapping output by the adaptive upsampling module in the original image. It can be seen that the model can accurately detect the lane lines in the input image.
To verify the performance of the lane line detection model in this paper, the remaining nine lane line detection models in Table 3, and the first three in Table 4, different modules are fused. The four cases of road shadow, road blur, road occlusion, and road slender and sparse characteristics in the testset are selected for instance segmentation analysis, as shown in Figure 17 below. After inputting pictures and fixed labels, for the four lane line detection models of ENet-SAD, Res34-VP, RESA-50, and SGLD-34, the instances in the To verify the performance of the lane line detection model in this paper, the remaining nine lane line detection models in Table 3, and the first three in Table 4, different modules are fused. The four cases of road shadow, road blur, road occlusion, and road slender and sparse characteristics in the testset are selected for instance segmentation analysis, as shown in Figure 17 below. After inputting pictures and fixed labels, for the four lane line detection models of ENet-SAD, Res34-VP, RESA-50, and SGLD-34, the instances in the four scenarios are segmented, and the segmented solid lines have defect losses in the four scenarios. For the instance segmentation of Res18-Seg, Res34-Seg, and ENet three lane line detection models in four scenarios, the segmented solid lines have defect losses in three scenarios. For the instance segmentation of LaneNet and SCNN lane line detection models in four scenarios, the segmented solid lines have defect losses in two scenarios. For the instance segmentation of the three network models in the four scenes in the case of ablation experiments, the solid lines of the segmentation have defect losses in three scenes, two scenes, and one scene, respectively. Compared with the performance of the first 12 models, when the model in this paper performs instance segmentation on the four scenarios, there is no defect loss, and the detection effect reaches the best level. Therefore, on the whole, the lane detection model in this paper is superior to other lane detection models in terms of comprehensive performance.
instance segmentation of the three network models in the four scenes in the case of ablation experiments, the solid lines of the segmentation have defect losses in three scenes, two scenes, and one scene, respectively. Compared with the performance of the first 12 models, when the model in this paper performs instance segmentation on the four scenarios, there is no defect loss, and the detection effect reaches the best level. Therefore, on the whole, the lane detection model in this paper is superior to other lane detection models in terms of comprehensive performance.

Lane Line Detection and Mobile Terminal Deployment in Different Scenarios
To further verify the effect of the lane line detection model in this paper, road video information is collected by the front car camera in different scenarios, such as normal roads, road congestion at night, road blocking, and night tunnels. At the same time, according to the results of instance segmentation under the TuSimple testset, the closest to the effect of the model in this paper is the network of the fusion feature enhancement module. Considering that there are many network models for comparison and reducing the repetition of experiments, the lane line detection model of the fusion feature enhancement module and the lane line detection model in this paper, for comparison and analysis in complex traffic scenarios, are shown in Figure 18.

Lane Line Detection and Mobile Terminal Deployment in Different Scenarios
To further verify the effect of the lane line detection model in this paper, road video information is collected by the front car camera in different scenarios, such as normal roads, road congestion at night, road blocking, and night tunnels. At the same time, according to the results of instance segmentation under the TuSimple testset, the closest to the effect of the model in this paper is the network of the fusion feature enhancement module. Considering that there are many network models for comparison and reducing the repetition of experiments, the lane line detection model of the fusion feature enhancement module and the lane line detection model in this paper, for comparison and analysis in complex traffic scenarios, are shown in Figure 18. To further test the performance of the lane detection model in this paper, it is deployed on the mobile terminal for verification. It can be seen from Table 3 that the parameter quantity of the lane line model in this paper is 9.57 M, which is very low. At the same time, since RepVgg-A0 is a lightweight network, different branch structures are subtly fused during inference, thereby compressing the parameters of the model. Therefore, the lane line detection model can be directly deployed to the embedded platform Jetson Nano, and the TensorRT framework can be used for half-precision acceleration to make its detection speed meet the requirements of real-time detection. Based on the complex traffic scene in Figure 18, the lane line detection model in this paper is deployed to the embedded platform Jetson Nano, and the displayed results are shown in Figure 19 below. line. From Figure 18d,f,h, it can be seen that the lane line detection model in this paper can detect lane lines accurately.
To further test the performance of the lane detection model in this paper, it is deployed on the mobile terminal for verification. It can be seen from Table 3 that the parameter quantity of the lane line model in this paper is 9.57 M, which is very low. At the same time, since RepVgg-A0 is a lightweight network, different branch structures are subtly fused during inference, thereby compressing the parameters of the model. Therefore, the lane line detection model can be directly deployed to the embedded platform Jetson Nano, and the TensorRT framework can be used for half-precision acceleration to make its detection speed meet the requirements of real-time detection. Based on the complex traffic scene in Figure 18, the lane line detection model in this paper is deployed to the embedded platform Jetson Nano, and the displayed results are shown in Figure 19 below.     Figure 19b,d,f,h, in which it can be seen that under the premise of accurate detection of lane lines in complex scenarios, the real-time detection speed of the Jetson Nano platform has reached above 27 fps/s. Although there is still a certain gap with the real-time detection speed under the Windows system, it can meet the real-time detection speed requirements under the deployment of the mobile terminal. Therefore, the lane line detection model in this paper is deployed on the mobile terminal and performs well.

Conclusions
In this paper, aiming at the problems of low lane line detection accuracy and poor real-time detection speed of existing lane line detection algorithms in complex traffic scenes, a lane line detection algorithm based on instance segmentation is proposed. The design method of this paper mainly includes optimizing the RepVgg-A0 network structure to expand the receptive field of the network; a multi-size asymmetric shuffled convolution model is proposed to enhance extraction of sparse and slender lane lines ability; an adaptive upsampling model is proposed, which allows the network to select the weight of the two upsampling methods at each position; a lane line prediction branch is added to facilitate the output of lane line confidence; and the lane line detection algorithm is deployed to the embedding of the standard platform Jetson Nano, using the TensorRT framework for halfprecision acceleration. The experimental results show that the lane line detection algorithm in this paper has an Acc value of 96.7% on the expanded TuSimple dataset and a realtime detection speed of 77.5 fps/s. The model is successfully deployed on the embedded platform Jetson Nano, and achieved a real-time detection speed of 27 fps/s, making it suitable for mobile terminal deployment. Therefore, the lane line detection algorithm in this paper is more suitable for current self-driving cars after being deployed on the mobile terminal, to improve the accuracy and safety of the automated driving perception part.
Due to the limitation of the experimental conditions, there is a gap between the realtime detection speed of the lane line algorithm deployed on the mobile terminal and the real-time detection speed under the Windows system. Therefore, the next step is to consider further compression of the model parameters, so that the real-time detection speed of the mobile terminal can be further improved without reducing the accuracy.