LaneFormer: Real-Time Lane Exaction and Detection via Transformer

: In intelligent driving, lane line detection is a basic but challenging task, especially in complex road conditions. The current detection algorithms based on convolutional neural networks perform well for simple scenes with plenty of light, and the lane lines are clean and unobstructed. Still, they do not perform well for complex scenes such as damaged, blocked, and lack-of-light scenes. In this article, we have exceeded the above restrictions and propose an attractive network: LaneFormer; We use an end-to-end network for up and down sampling three times each, then fuse them in their respective channels to extract the slender lane line structure. At the same time, a correction module is designed to adjust the dimensions of the extracted features using MLP, judging whether the feature is completely extracted through the loss function. Finally, we send the feature into the transformer network, detect the lane line points through the attention mechanism, and design a road and camera model to ﬁt the identiﬁed lane line feature points. Our proposed method has been validated in the TuSimple benchmark test, showing the most advanced accuracy with the lightest model and fastest speed.


Introduction
A key technology in the autonomous driving system is cameras for lane line detection.However, it is found that lane detection faces the challenge of complex scenes.Above all, the lane line is a slender structure with a few appearance clues.It is difficult for the current model to extract all the features of the lane line accurately.Secondly, under extreme weather conditions (rain, dim light, and snow), lane line wear, and obstruction, it is easy to fail to detect the shape and characteristics of lane lines.
Lane line detection algorithms are roughly divided into two categories: traditional algorithms and algorithms based on neural networks.Shen Y et al. [1] proposed a lane detection and recognition method based on the dynamic region of interest (ROI) selection and firefly algorithm.Determine the height and width of the ROI based on the vanishing point and lane lines.H. Jung et al. [2] proposed a lane line detection scheme that adapts to low computing power systems.Traditional methods [3,4] use Hough transform, canny edge detection, and Kalman filter to extract lane line features and then use B-spline to fit the lane line.Niu et al. [5] used Hough transform to extract lane lines and DBSCAN (density-based application spatial clustering) for clustering, but that model cannot fit well for the lane scene with large curvature.P. Smuda et al. [6] use a particle filter to fuse information from the digital map system and propose a new image-based road detection feature, but this method suffers from error accumulation.Yue Wang et al. [7] transformed the problem of lane detection into the problem of determining the control point set by maximum likelihood estimation.W Yue et al. [8] used the cubic B-spline curve to fit the center line; their method assumed that the two sides of the lane line are parallel, but their model did not perform well in unstructured road scenarios.Zheng F et al. and K.
Zhao et al. [9,10] use the Catmull-Rom spline in combination with the extended Kalman filter tracking to realize lane line detection; their model can identify different numbers of lane lines.The above methods require manual adjustment of parameters for different scenarios, which are inefficient and have poor results in complex scenarios.
The current widely used methods are neural networks to detect lane lines.J. Dai et al. [11] proposed a model composed of three networks: distinguished instances, estimated masks, and classified object networks.Gopalan et al. [12] used pixel hierarchical structure features to simulate contextual information and used particle filters to track lane markings.This has a good detection effect on worn and occluded lane lines.Qian, Y. et al. [13] improved the model's overall performance by joint training of lane lines and drivable areas and achieved good improvement.Ref. [14] used a Self Attention Refinement (SAD) design to make the network layer-by-layer attention, refining from top to bottom to learn features and to identify lane lines by encoding rich context; this algorithm combines global features and local features well.Xinlong Wang et al. [15] decouples the mask branch into a mask core and feature branch.This strategy can teach the convolution kernel and the feature, respectively.Lee et al. [16] proposed a unified end-to-end trainable multi-task network that jointly handled vanishing point-guided lane detection to some extent addressing lanes in rainy and low-light conditions.K.He et al. [17] were inspired by adding a branch to predict the object mask in parallel with the existing bounding box recognition branch.Zhang et al. [18] proposed a corrugated lane line detection network, which learns lane features through gradient maps.Fausto Milletari et al. [19] propose a 3D image segmentation method based on a fully convolutional neural network.Davy Neven et al. [20] segmented the lanes by instance and fitted the lane lines through the images after perspective transformation.M. Bertozzi et al. [21] built a detection model based on a stereo vision to detect lane positions in a structured environment.Pan et al. [22] proposed the Spatial CNN (SCNN) network with layer-by-layer convolution in feature maps, enabling messages to pass between pixels across rows and columns.Haris M et al. [23] proposed a model to learn to decode the lane structure and iteratively draw any number of lanes without the computational and time complexity of a recurrent neural network.The corrugated lane line detection network proposed by Zhang et al. [24] used fast connections and gradient maps to effectively learn lane line features, which could solve challenging scenarios such as occluded lane lines.Qin et al. [25] regarded lane line detection as a line-based selection problem of global features; selecting feature points based on rows greatly reduces the computational cost.Ren et al. [26] have introduced a region proposal network (RPN), which shares full-image convolutional features with the detection network.Nicolai Wojke et al. [27] integrated appearance information to improve SORT's (Simple Online and Realtime Tracking) performance.This model can track objects through longer occlusion time, thereby effectively reducing the number of identity switching.Linjie Yang et al. [28] used a single forward pass to adapt the segmentation model to the appearance of a specific object, which greatly reduced the computational complexity.Zhang et al. [29] established a multi-task learning framework to segment the lane area, detect the lane boundary simultaneously, and consider two geometric constraints.Developing convolutional neural network-based methods is more mature, and it is not easy to achieve a large improvement.Philion J et al. [30] use a joint fully convolutional network and unsupervised training to detect lane lines.
In the past few years, a Transformer network based on the Attention mechanism has rapidly developed in lane line detection [31][32][33].Liu, R., Yuan et al. [31] used an attention mechanism jointly with CNN to extract features to predict key points of lane lines with an improvement in accuracy.Chen, L. et al. [32] used an attention mechanism to map the extracted lane line points to the 3D space for fitting, which improves the accuracy but has an unacceptable computational overhead.Qiu, Q et al. [33] improve the accuracy by adding prior knowledge to the attention mechanism, which is effective but has too many hyperparameters, making the already hard-to-train model even harder to converge.
In this work, we proposed a network based on an attention mechanism called Lane-Former; the main contributions of this work are as follows: (1) We adopt multiple down-sampling and up-sampling for fusion in multiple stages to improve the ability to extract information on slender lane lines.(2) We use the correction function to visualize the ability of the network to extract features; when the correction function does not continue to decrease, it proves that the model has extracted enough features.(3) We use the attention module to detect lane lines and then the lane line model to fit the detected feature points.Finally, the fitted lanes are projected into the original image.
This paper is organized as follows.Section 2 elaborates on our lane line detection algorithm.Section 3 describes the experimental results, compares them with the current mainstream algorithms, and conducts ablation experiments.Section 4 concludes our work.

Overall Architecture
The architecture shown in Figure 1 consists of a backbone for feature extraction.We use three down-samplings, each down-sampling using a shared convolutional neural network δ for further feature processing, and each layer up-sampling, stitched with the features of the previous layer, and finally output the features F.Then, we adopt the feature correction module to check that our network has extracted enough features.We design the MLP to get F ; meanwhile, we use the same MLP work on the F to calculate F ; we measure the "difference" that we named loss L corr between F and F ; if the loss does not continue to reduce, it is considered that sufficient features have been extracted.The features F will be sent to the attention mechanism.The features are converted to sequence as a part of a value, we will design a position-coding sequence mixed with feature as key, and the query is composed of camera and road parameters.Finally, we use the lane fitting model to fit the points of predicted lane lines.
adding prior knowledge to the attention mechanism, which is effective but has too many hyperparameters, making the already hard-to-train model even harder to converge.
In this work, we proposed a network based on an attention mechanism called Lane-Former; the main contributions of this work are as follows: (1) We adopt multiple down-sampling and up-sampling for fusion in multiple stages to improve the ability to extract information on slender lane lines.(2) We use the correction function to visualize the ability of the network to extract features; when the correction function does not continue to decrease, it proves that the model has extracted enough features.(3) We use the attention module to detect lane lines and then the lane line model to fit the detected feature points.Finally, the fitted lanes are projected into the original image.
This paper is organized as follows.Section 2 elaborates on our lane line detection algorithm.Section 3 describes the experimental results, compares them with the current mainstream algorithms, and conducts ablation experiments.Section 4 concludes our work.

Overall Architecture
The architecture shown in Figure 1 consists of a backbone for feature extraction.We use three down-samplings, each down-sampling using a shared convolutional neural network  for further feature processing, and each layer up-sampling, stitched with the features of the previous layer, and finally output the features .Then, we adopt the feature correction module to check that our network has extracted enough features.We design the MLP to get F'; meanwhile, we use the same MLP work on the F' to calculate ′′; we measure the "difference" that we named loss ℒ between ′ and ′′; if the loss does not continue to reduce, it is considered that sufficient features have been extracted.The features ′ will be sent to the attention mechanism.The features are converted to sequence as a part of a value, we will design a position-coding sequence mixed with feature as key, and the query is composed of camera and road parameters.Finally, we use the lane fitting model to fit the points of predicted lane lines.

Backbone
The backbone is based on the ResNet network.We use 32 parallel groups, and each group is convolutional three times; the first convolution is 4 1 1 256, the second is 4 3 3 32, and the third is 256 1 1 4, and the output is 256 channels.After each 3 3 convolution, we replace the activation function with the Mish function to prevent the problem of gradient disappearance or gradient explosion that occurs when the data is too large or too small.We design a down-sample and up-sample mechanism to Figure 1.The overall architecture.δ is a CNN network for further feature extract, ⊗ is feature concatation.F is the finally feature, F is the result of conv processing F, and F is the result of conv processing F .K, Q, V, P represent the key, query, value and position encoding of the transformer network, respectively.

Backbone
The backbone is based on the ResNet network.We use 32 parallel groups, and each group is convolutional three times; the first convolution is 4 × 1 × 1 × 256, the second is 4 × 3 × 3 × 32, and the third is 256 × 1 × 1 × 4, and the output is 256 channels.After each 3 × 3 convolution, we replace the activation function with the Mish function to prevent the problem of gradient disappearance or gradient explosion that occurs when the data is too large or too small.We design a down-sample and up-sample mechanism to enhance further the model's ability to express obstructed and discontinuous lane lines.We divide the feature obtained from the backbone into three parallel channels for further processing.The first channel uses the shared extractor δ to extract features for the current feature.The second channel first down-samples and then uses the shared extractor δ for feature extraction; the third channel is to down-sample the current feature again, uses the same CNN for feature extraction, and then performs up-sampling, concatenating with the feature Appl.Sci.2022, 12, 9722 4 of 14 of the second channel.The obtained feature is up-sampled again and spliced with the first channel, and the final feature is used as the final output feature F of the backbone.

Feature Correction
Inspired by [34], we designed the feature correction network as shown in Figure 2. F is the output value of the previous module; we use the shared extractor δ to extract features further, adopt a small convolution kernel for fine-tuning, and get the feature F as the value vector in the attention mechanism.In order to ensure that the features are fully obtained as much as possible, use a MLP to act on F , and then measure the "difference" between F and F ; when the "difference" no longer changes (increase or reduce), it indicates that the F feature is fully extracted.We use the L1 loss function as the objective function to measure the "difference" between F and F , the formula is: bone.

Feature Correction
Inspired by [34], we designed the feature correction netwo  is the output value of the previous module; we use the share features further, adopt a small convolution kernel for fine-tunin as the value vector in the attention mechanism.In order to ens fully obtained as much as possible, use a  to act on  , an ference" between  and ′′; when the "difference" no longer duce), it indicates that the ′ feature is fully extracted.We use the objective function to measure the "difference" between  and

Transformer Encoder
In order to capture the contextual relationship, we use the p of ′ is , and the position code and value of ′ are concatena Transformer module： ,  , in Figure 3a, we denote , ′ The structure of the encoding module is shown in Figure

Transformer Encoder
In order to capture the contextual relationship, we use the position code P, the value of F is ϕ, and the position code and value of F are concatenated as the value V of the Transformer module: V = (P, ϕ), in Figure 3a, we denote P, F by S p , S ϕ , respectively.
where  ∈  represents the value sequences through a linear transformation on input row. stands for the attention map and measures non-local interactions to cap slender structures in global context.Considering that the lane line is a slender struc and this model system only detects lane lines, multi-head attention is used in the enco and decoding blocks for feature processing, which can reduce model parameters and putational overhead and promote the lightweight of the model.

Transformer Decoder
The transformer decoder also consists of multiple identical layers.The residual nection and layer normalization are used in the layers, as shown in Figure 3 (a) . i matrix of  , which is used to learn the characteristics of lane lines during trainin The structure of the encoding module is shown in Figure 3a.Each encoder module contains head-attention layers and an Add&Norm layer; after each attention mechanism is calculated, the features will be concat and normalized.The concat operation ensures that the subsequent dimensions are the same and that the next step can be performed.The normalization process can significantly improve target detection and speed up the operation.Next, the feature sequence will be sent to the feed-forward layer to change the input tensor's and send it to the decoder after normalization in the next layer.
We use the scoring function α to measure the relationship between the query and the key.In order to improve the efficiency of the calculation, we use the scaled dot-product attention scoring function.The query and key sequence were modified to the same length c, assuming the elements of query and key is an independent and identically distributed random variables; the encoder self-attention model performs scaled dot-product attention by Equation: where q ∈ R w×h , k ∈ R h×w represent the sequences of the query and key, respectively, α denotes the scoring function.q T represents the transpose matrix of q. c represents the length of the query and key, and the formula is normalized by dividing by √ c.During training, we use mini-batches to increase the speed of operations; the scaled dot-product attention is shown in Equation: where V ∈ R w×2h represents the value sequences through a linear transformation on each input row.A stands for the attention map and measures non-local interactions to capture slender structures in global context.Considering that the lane line is a slender structure and this model system only detects lane lines, multi-head attention is used in the encoding and decoding blocks for feature processing, which can reduce model parameters and computational overhead and promote the lightweight of the model.

Transformer Decoder
The transformer decoder also consists of multiple identical layers.The residual connection and layer normalization are used in the layers, as shown in Figure 3a.S q is the matrix of N × C, which is used to learn the characteristics of lane lines during training.S p and the initialized S q are sent to the mask attention mechanism layer to detect the current feature.After normalization, the attention formula is calculated.The result of the processing is the same as the coding block and is finally output through the fully connected layer.In the decoder attention, the query, key, and value all come from the output of the previous decoder layer.The key and value derived from the absolute position can only consider all positions before that position.This masked attention preserves the auto-regressive attribute, ensuring that the prediction only depends on the generated output features.

FFNS
In addition to attention sub-layers, each layer in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically.The network contains two linear transformations that map the input S d to a high-dimensional space (2, 64) × (64, 1024) = (2, 1024), then through a nonlinear function, and finally to the original dimension.The so f tmax function will output the label (lane or background), S d will be converted into a N × 4 matrix, and 4 represents the four prediction parameters of the curve fitting.Then, the average value is taken in each dimension.The fully connected layer formula is as follows: where w denotes the weight matrix of the fully connected layer and b represents the bias matrix.

Lane Line Fitting
Inspired by the geometric topology in [35], we designed the new road structure.We fit the detected points into a smooth curve; the prior model of the lane shape is defined as a polynomial.We use the least squares method to perform multiple cubic curve-fitting on the lane line detection points.A single-lane line on flat ground is: where (X, Z) represents a coordinate point on a flat road.k, m, n are the constant matrix and b is the compensation value.When the optical axis of the camera is parallel to the ground plane, the coordinate points on the ground plane should be expressed as: where k , m , n , b are the camera internal and external parameters, (u, v) is the coordinate point on the pixel level of the image.
If the angle between the optical axis of the camera and the ground plane is θ, then the actual coordinates on the image should be: where f is the focal length of the camera and (u , v ) is the coordinate of the corresponding geometrically transformed pixel point.First, we extract the current video frame and the images of the previous three frames, and perform least squares third-order polynomial fitting on the coordinate points extracted by the model: We will keep some parameters and the coefficient matrix used in the calculation.After that, all the lane lines in all the frames (frame t and the previous three frames) are combined, and then the results are averaged; final result is the average of the fitting results of the current frame and the previous three frames.

Fitting Loss
The loss of lane line detection is divided into two categories: the classification loss of the lane line type (lane line and background) and the regression loss of the lane line position referenced by the anchor.The first loss function uses the Cross-Entropy Loss function [36], and the second regression loss is Smooth L1 Loss.During training, we use distance to measure whether it is positive or negative, and the remainder N p&n is used for the multi-task loss defined as: The functions L c and L r are Cross Entropy loss and Smooth L1 loss, respectively, α i and α i are the classification output values and target values of the i-th point, β i and β i are the output values and target values of the regression output of the i-th point.The regression loss is measured by the distance and common coordinate of the estimated value and the true value.If the anchor is considered to be a negative number, its L r is equal to 0. Factor λ is used to balance the loss components, hyperparameter λ = 2.5.

Datasets
We use the TuSimple dataset to test our method.The TuSimple dataset contains 6408 annotated pictures, which are high-definition pictures (720 × 1280) decomposed from the video recorded by the front-view camera.It includes day and night photos of different road conditions and different weather on American highways.The data set is divided into 2704 test sets, 3521 training sets, and 345 validation sets.We use CULane to validate our method to evaluate the adaptive capability to new scenes.CULane is a largescale and challenging data set for academic research on lane detection.It was collected by cameras installed on six vehicles driven in Beijing, which collected more than 55 h of video and extracted 133, 235 frames.We divide the data set into 88,880 training sets, 9675 validation sets, and 34,680 test sets.The test set is divided into normal categories and 8 challenging categories.

Evaluation Indicators
In order to show the performance of the model and comparison of it with the previous method, we follow TuSimple's accuracy detection index.In order to judge whether a lane marking is successfully detected, we view the lane markings as lines with widths equal to 25 pixels and calculate the intersection-over-union (IOU) between the ground truth and the prediction.Predictions whose IOU are larger than a certain threshold are viewed as true positives (TP).The prediction accuracy is computed as: (10) Here, C clip is the number of correct points in the last frame of the video clip, and S clip is the number of ground truth points in the last frame of the clip.The predicted point is correct if the difference between the width of the ground truth and prediction is smaller than a threshold.On the CULane data set, we use F1 − measure as the evaluation indicator.
Here, TP is the number of positive examples, FP is the number of positive examples that were incorrectly classified, FN is the number of cases that were incorrectly classified as negative examples, and TN is the number of correctly classified as negative examples.

Experimental Parameters
For the Tusimple data set, we set the input resolution to 360 × 640, the learning rate to 5e − 4, the batch size as 20, the number of prediction curves N set as 6, and the number of training iterations as 400 k.We scale the raw data, rotation, image channel increase or decrease, and cropping for data augmentation.In addition, for the CULane data setting, the learning rate is set to e − 5, and the other parameters remain unchanged.Except for the ablation experiment, the hyperparameter settings of all experiments are the same.All results are tested on the dual 2080Ti graphics card platform.
In order to illustrate the performance of our model, we compare it with the excellent models for lane line detection: VPGNet [16], Lanenet [20], Ultra-Fast Lanenet [25], SAD [14], Fast-Draw [30], SCNN [22], LSTM [31].We test on the Tusimple data, comparing the dimensions of the frame rate, MACS, Para, PP, ACC, FP, FN, and place the comparison results in Table 1.  1 shows the performance comparison between our method and the current excellent lane line detection method.The evaluation index is based on the Tusimple data set.Compared with the LSTM network of the same transformer structure, our method has a slightly lower speed but a large improvement in accuracy.At the same time, the parameter page is lower than LSTM.Compared with CNN-based lane line detection frameworks (VPGNet, Lanenet, SAD, FastDraw, SCNN), our speed is between 5-50 frames faster than theirs, and our accuracy is equal to or even slightly higher than the above methods.

Comparison to State-of-the-Art Methods
Figure 4 is the visualization result of our lane line detection.We compared the LSTM model based on the transformer network and the Lanenet model using the convolutional neural network.It can be seen that our model can fit farther lane lines in the picture (a).For the second picture (b) of multi-lane curves in the scene, our method fits more accurately and there is no drift at the far end.Our method fits more accurately in the third picture (c) of the large curvature curve scene.The ResNet32 network can capture enough lane line features, and the attention mechanism supplements the lane line's slender structure and context information, so our model performs well in bends and scenes with strong light.9 of 14 page is lower than LSTM.Compared with CNN-based lane line detection frameworks (VPGNet, Lanenet, SAD, FastDraw, SCNN), our speed is between 5-50 frames faster than theirs, and our accuracy is equal to or even slightly higher than the above methods.

Comparison to State-of-the-Art Methods
Figure 4 is the visualization result of our lane line detection.We compared the LSTM model based on the transformer network and the Lanenet model using the convolutional neural network.It can be seen that our model can fit farther lane lines in the picture (a).For the second picture (b) of multi-lane curves in the scene, our method fits more accurately and there is no drift at the far end.Our method fits more accurately in the third picture (c) of the large curvature curve scene.The ResNet32 network can capture enough lane line features, and the attention mechanism supplements the lane line's slender structure and context information, so our model performs well in bends and scenes with strong light.

Position Encoding
The attention mechanism abandons sequential operations because of parallel computing.To use the order information of the sequence, we inject absolute or relative position information by adding position encoding to the input representation.The location code can be obtained by learning or being directly fixed.For the light and computational speed of the model, we use a fixed position encoding based on sine and cosine functions.Position coding formulas such as Equation ( 14). ,2 ,2 1 2 / 2 / sin( ) , P cos( ) 1000 1000 where ,  represent the rows and columns of pixels in the image coordinate system, and  is the dimension of the position embedding matrix, which is encoded after normalization.
For comparison, we use the models with and without position coding to conduct experiments; results are shown in Table 2a, without position (Average Precise) of the model can reach 32.4%, and AP with position can reach 35.5%.The performance of the second experiment (with position coding) is about 3 points higher than that of the first experiment (without position coding), which verifies the necessity of model input position coding.The reason is that position information establishes a relationship between input and output in the process of sequence supervision.The attention mechanism abandons sequential operations because of parallel computing.To use the order information of the sequence, we inject absolute or relative position information by adding position encoding to the input representation.The location code can be obtained by learning or being directly fixed.For the light and computational speed of the model, we use a fixed position encoding based on sine and cosine functions.Position coding formulas such as Equation (14).
where i, j represent the rows and columns of pixels in the image coordinate system, and d is the dimension of the position embedding matrix, which is encoded after normalization.
For comparison, we use the models with and without position coding to conduct experiments; results are shown in Table 2a, without position AP(Average Precise) of the model can reach 32.4%, and AP with position can reach 35.5%.The performance of the second experiment (with position coding) is about 3 points higher than that of the first experiment (without position coding), which verifies the necessity of model input position coding.The reason is that position information establishes a relationship between input and output in the process of sequence supervision.

Backbone Selection
The backbone used in our model is a modified ResNeXt50 network.We use different backbones for comparison; Table 2b shows the comparison in model performance, ResNet50 and inceptionv3 as backbone, respectively.With other parameters kept constant, the AP value of ResNeXt50 is the highest, 35.3%, 5.3% percentage points higher than the second value.It means that the deeper network of the rest network has a stronger ability to extract feature points.At the same time, the unique residual structure of the network can also better fit the data.

Transformer Encoder Module
We perform model performance verification of different numbers of coding modules on the Tusimple datasets.The heat map is shown in Figure 6.The depth of the color of the heat map is the confidence level of the feature, and the position in the heat map and the position in the original map are mapping relationships.The confidence and position of the extracted feature points corresponding to different decoding layers are displayed through the heatmap; the number of encoding modules is 2, 3, 4, and 5.It can be seen that when the number of encoding modules is 4, the performance of feature classification and regression is the best; therefore, we fix the number of coding modules to 4. The backbone used in our model is a modified ResNeXt50 network.We use different backbones for comparison; Table 2b shows the comparison in model performance, Res-Net50 and inceptionv3 as backbone, respectively.With other parameters kept constant, the  value of ResNeXt50 is the highest, 35.3%, 5.3% percentage points higher than the second value.It means that the deeper network of the rest network has a stronger ability to extract feature points.At the same time, the unique residual structure of the network can also better fit the data.

Transformer Encoder Module
We perform model performance verification of different numbers of coding modules on the Tusimple datasets.The heat map is shown in Figure 6.The depth of the color of the heat map is the confidence level of the feature, and the position in the heat map and the position in the original map are mapping relationships.The confidence and position of the extracted feature points corresponding to different decoding layers are displayed through the heatmap; the number of encoding modules is 2, 3, 4, and 5.It can be seen that when the number of encoding modules is 4, the performance of feature classification and regression is the best; therefore, we fix the number of coding modules to 4.  3 that the output layer

Transformer Decoder Module
Our encoder number is set to 4. We verify the model's performance by changing the number of layers in the decoder module.It can be seen from Table 3 that the output layer of each layer has the best performance, However, after the number of floors exceeds four, the performance is reduced, and we analyze that it is the performance degradation caused by the over-fitting of the model.We use several different curves to fit the characteristic points of the lane.The results are listed in Table 4.We find that the cubic curve fitting based on least squares has the best effect on occlusion and bending.For the straight line, our lane shape model works well, too.This result agrees with the previous work's conclusion that detecting the lane line's direction generally uses cubic curve fitting.This experiment also demonstrated that the cubic curve fitting could get a better result.The quadratic curve fitting cannot fit all the points, which the quartic curve fitting will count in the noise points.Both of them will reduce the accuracy.

Conclusions
In this paper, we have proposed an end-to-end lane line detection network: a Lane-Former network based on an attention mechanism that can directly visualize lane lines.We use the modified ResNet32 network as a backbone to extract shallow-level features and adopt multiple down-sampling and up-sampling for fusion in stages to improve the ability to extract information on slender lane lines.When the extracted features are not significantly increased, the ability to use the feature correction module to visualize the extraction of features sends the features to the attention module.Use the attention module to enhance the contextual information of the extracted features, and then use the lane line model to fit the detected feature points.Our network fully captures the contextual information during the training process and fits the detected road lane points well.Our method achieves state-of-the-art detection performance in terms of parameter numbers and running time; besides, our model is more stable and reliable.In the next task, we will explore the encoding method of location information to find a better encoding matrix that can improve the models' accuracy or speed.At the same time, we will study the reasoning ability of the model so that it can be based on a small part of the curve in the urbanization road to infer the line of the lane that is occluded.

Figure 1 .
Figure 1.The overall architecture. is a CNN network for further feature extract, ⊗ is feature concatation. is the finally feature,  is the result of conv processing , and  is the result of conv processing  .,, ,  represent the key, query, value and position encoding of the transformer network, respectively.

Figure 2 .
Figure 2. Shared extractor is , further extraction and fine-tuning of t refined feature.The  consists of two layers, which adjust the dime  so that it can correspond to the dimension of .

Figure 2 .
Figure 2. Shared extractor is δ, further extraction and fine-tuning of the features of F, F is the refined feature.The MLP consists of two layers, which adjust the dimension of the feature map of F so that it can correspond to the dimension of F.

Figure 3 .
Figure 3. Transformer encoder and transformer decoder module. ,  , and  represent th quence of feature, position, and query.The ⨂ represent the concat of the sequence.(a) En module (b) Decoder module.

Figure 3 .
Figure 3. Transformer encoder and transformer decoder module.S ϕ , S p , and S q represent the sequence of feature, position, and query.The ⊗ represent the concat of the sequence.(a) Encoder module (b) Decoder module.

Figure 4 .
Figure 4.The pictures on the CULane dataset to verify the performance of our model.From top to bottom, the model visualization effects of LSTM, Lanenet, and our method are shown.(a), (b), (c) is the model inference results of the tunnel exit, curve, and curve with large curvature, respectively.

Figure 5
Figure 5 shows the visual images of our model in different scenes on the Tusimple datasets and the CULane datasets.(a,b) is the scene on the CULane dataset.It can be seen that curves of different curvatures and vehicle occlusion conditions can be well simulated.Meanwhile, (c,d) is the scene on the Tusimple datasets; dashed lines and curves of different colors can also be well identified.

Figure 4 .
Figure 4.The pictures on the CULane dataset to verify the performance of our model.From top to bottom, the model visualization effects of LSTM, Lanenet, and our method are shown.(a-c) is the model inference results of the tunnel exit, curve, and curve with large curvature, respectively.

Figure 5 14 Figure 5 .
Figure 5 shows the visual images of our model in different scenes on the Tusimple datasets and the CULane datasets.(a,b) is the scene on the CULane dataset.It can be seen that curves of different curvatures and vehicle occlusion conditions can be well simulated.Meanwhile, (c,d) is the scene on the Tusimple datasets; dashed lines and curves of different colors can also be well identified.

Figure 5 .
Figure 5.Our method visualizes images in different scenarios on the CULane datasets (a,b) and Tusimple datasets (c,d).

Figure 6 .
Figure 6.Heat map of encoding modules. is the encoding modules number  2,3,4,5 .The encoder modules can capture contextual feature information and slender structures of lane lines.3.5.4.Transformer Decoder Module Our encoder number is set to 4. We verify the model's performance by changing the number of layers in the decoder module.It can be seen from Table3that the output layer

Figure 6 .
Figure 6.Heat map of encoding modules.N is the encoding modules number (N = 2, 3, 4, 5).The encoder modules can capture contextual feature information and slender structures of lane lines.

Table 1 .
Comparisons of accuracy (%) on TuSimple testing Set.The number of multiply-accumulate (MAC) operations is given in G.The number of parameters (Para) is given in M (million).The PP means the requirement of post-processing.

Table 2 .
Model metrics comparison in position encoding and backbone contrast.AP is average pre-

Table 2 .
Model metrics comparison in position encoding and backbone contrast.AP is average precision (%), AR means average recall (%).

Table 3 .
Quantitative evaluation of different transformer decoder module number on TuSimple validation set (%).The horizontal coordinate is transformer decoder module number, the vertical coordinate is transformer encoder number, and the evaluation metric is mAP (mean Average Precision).

Table 4 .
Quantitative evaluation of different lane shape models on TuSimple validation set (%).