Lane Mark Detection with Pre-Aligned Spatial-Temporal Attention

Lane mark detection plays an important role in autonomous driving under structural environments. Many deep learning-based lane mark detection methods have been put forward in recent years. However, most of current methods limit their solutions within one single image and do not make use of the de facto successive image input during the driving scene, which may lead to inferior performance in some challenging scenarios such as occlusion, shadows, and lane mark degradation. To address the issue, we propose a novel lane mark detection network which takes pre-aligned multiple successive frames as inputs to produce more stable predictions. A Spatial-Temporal Attention Module (STAM) is designed in the network to adaptively aggregate the feature information of history frames to the current frame. Various structure of the STAM is also studied to ensure the best performance. Experiments on Tusimple and ApolloScape datasets show that our method can effectively improve lane mark detection and achieve state-of-the-art performance.


Introduction
With the rapid development of autonomous driving technology, lane mark detection have made great progress in recent years. Accurate and robust lane mark detection is necessary to ensure the safety of autonomous navigation in terms of its capability to provide reliable route guidance and proper positioning for the vehicle. However, lane mark detection under complex scenes and various light conditions still remains a challenge.
Traditional methods for lane mark detection usually involve several basic procedures, including image pre-processing, feature extraction, and detection by fitting [1][2][3]. They heavily rely on highly-specialized and hand-crafted feature extraction [4][5][6]. Thanks to the emergence of deep neural network and large-scale datasets, deep learning methods have significantly improved the performance of lane mark detection. Liu et al. [7] proposed a style-transfer-based data enhancement method, using Generative Adversarial Networks (GANs) to solve the problem of lane detection in low-light conditions. RESA [8] shifted sliced feature map recurrently in vertical and horizontal directions to aggregate global information, which helps to conjecture lane marks with weak appearance coherences. To better infer lane mark positions under occlusion conditions, LaneATT [9] utilized an effective anchor-based attention mechanism to aggregate global information. However, most of the methods focus on detecting lane marks in a single image. Under complex environments, the appearance of lane marks can be frequently degraded by severe stains, heavy shadows, or serious occlusion, which can result in incomplete or even incorrect predictions for these single image-based methods. In practice, the image sequence acquired by the vehicle are continuous and there are large overlaps between adjacent frames, therefore the position of lane marks in neighboring frames are highly correlated. In other words, lane marks that cannot be precisely detected in a current single frame is able to be inferred from the information of former frames. This motivates us to investigate lane mark detection with multiple frames as input and explore the inherent spatial-temporal information within the sequence.
In this work, a novel method using multiple frames for improving lane mark detection is proposed. To maximize the enhancement for the features of a current key frame, we first perform multi-frame pre-alignment. While the camera calibration in [10] establishes one-to-one correspondence between the image plane and the ground, we project each history frame to the current key frame with the road areas aligned in the image plane. Moreover, to further aggregate spatial-temporal information, we propose an effective Spatial-Temporal Attention Module (STAM) and insert it into an encoder-decoder-based instance segmentation network. Taken multiple continuous images as inputs, sequential features of all input frames are extracted by the shared CNN encoders and then fed into the STAM. A two-branch decoder is adopted to reconstruct the aggregated information and predict lane marks of the current key frame. With richer information from continuous images, the proposed method is able to greatly improve lane mark predictions on challenging scenarios and achieve state-of-the-art performance.
The main contributions of this paper can be summarized as: • We regard lane mark detection as a time-series issue and propose to detect lane marks from successive pre-aligned multiple images. The frames are pre-aligned according to the ground plane before feeding to the network. By exploring the spatial-temporal information hidden in the multiple frames, the negative influence from complex scenarios like shadow, lane mark degradation, and vehicle occlusion could be largely mitigated; • A novel Spatial-Temporal Attention Module (STAM) is proposed and embedded in the encoder-decoder backbone. The module enhances the features of current frame by attentively aggregating spatial-temporal information from history frames. Various structures of the STAM and their performance are also studied; • Our network is implemented end-to-end and evaluated on two large-scale datasets: Tusimple and ApolloScape. Comprehensive experiments and ablation studies verified that the proposed model is effective and can achieve state-of-the-arts performance.

Related Work
Lane mark detection has been intensively researched in recent years. These methods can be roughly classified into traditional and deep learning approaches.
Traditional methods. Before the advent of deep learning, conventional solutions for lane mark detection often depend on hand-crafted features such as edge, color, and texture to identify lane segments [4][5][6]. Then, Hough transform [11] or curve fitting [12] is often adopted to eliminate outliers and form the final lane marks. Apart from geometric modeling, some methods formulate lane mark detection with energy minimization algorithms [13]. By defining unary/dual potentials and building an optimal association of multiple lane marks, Conditional Random Field (CRF) can be used to detect lane marks. For lane mark detection in successive frames, the particle or Kalman filter is widely used [14][15][16]. The particle filter is able to track multiple lanes. The Kalman filter helps to locate positions and estimate lane curvature with state vectors. However, the performance of the above methods would be easily mortified by complex environments and illumination variance.
Deep-learning-based methods. In recent years, many deep-learning-based methods on lane mark detection have been proposed. According to the representations of lane, the existing methods can be divided into four categories: Segmentation-based [8,[17][18][19][20], anchor-based [9,21,22], row-wise detection-based [23][24][25], and parametric regression methods [26,27]. Segmentation-based methods are the most popular and have an impressive performance. SCNN [18] employed slice-wise convolution in a segmentation module, passing a message from different directions to capture spatial continuity. EL-GAN [19] and SAD [20] respectively adopted GAN and knowledge distillation to improve lane mark segmentation. Despite their advantages, most segmentation-based methods are limited to detecting lane marks with a pre-defined number. Anchor-based methods focus on specifying the lane mark shape by regressing the position offsets relative to the predefined anchors. PointLaneNet [21] used point anchors to directly obtain the coordinates of lane mark points. Line-CNN [22] put forward a novel Line Proposal Unit (LPU) in terms of discrete direction classification and relative coordinate regression. LaneATT [20] extracted anchor-based features and utilized an attention mechanism. However, a fixed anchor shape would be inflexible to describe lane marks with a high degrees of freedom. Row-wise detection methods predict the most probable location of lane marks from row to row. Fast-Draw [23] introduced a learning-based approach to decode the lane mark structure without post-processing. UFSA [24] proposed a lightweight row-based selecting scheme in global image features, resulting in a high speed algorithm. E2E-LMD [25] predicted lane mark vertexes in an end-to-end manner. Parametric regression methods directly output parametric representations of lane marks. PolyLaneNet [26] learned to regress the lane mark polynomial curve equation. LSTR [27] formulated the lane mark shape model based on road structures and camera pose, using a transformer to capture a richer context.
In contrast to the above single-frame based methods, a few approaches consider the lane mark detection as a time-series problem. Zou et al. [28] proposed a hybrid architecture that seamlessly integrates the CNN (Convolutional Neural Network) [29] and RNN (Recurrent Neural Network) [30] to detect lane marks. Zhang et al. [31] added double Convolutional Gated Recurrent Units (ConvGRUs) into an encoder-decoder CNN. However, they only consider the lane detection as a two-class segmentation problem and did not provide instance segmentation for each lane. Moreover, in complex scenes such as lane occlusion by dynamic vehicles, they are also prone to produce erroneous false positive predictions. Our method takes instance-level discrimination into account and perform multi-frame pre-alignment before feeding them into the network. Instead of using RNN or any variants of RNN, we propose STAM to aggregate the spatial-temporal information to better deal with the challenging scenarios.

Proposed Methods
As detecting lane marks from individual images suffers from challenging situations such as heavy shadow, serious occlusion, and severe lane mark damage, we focus on lane mark detection under continuous driving scenes. Among consecutive images, lane marks in adjacent frames are inherently correlative. An overview of our proposed method is illustrated in Figure 1. The encoder-decoder network takes multiple pre-aligned consecutive frames as inputs and predicts lane marks on the current key frame F t in an instance segmentation manner. Sequential encoded features are aggregated by the proposed Spatial-Temporal Attention Module (STAM), followed by a decoder to receive the fusion feature.  Overview of the proposed method. Multiple pre-aligned consecutive frames are firstly sent to the shared encoder. Then, the features of current key frame F t are enhanced by attentively aggregating spatial-temporal information from history frames F t−i . After that, the two-branch decoder produces a binary lane mask and an N-dimensional embeddings per lane pixel. At last, the post-processing is applied to gain the final predictions.

Multi-Frame Pre-Alignment
To adequately enhance the features of current key frame and avoid introducing confusion among different images, alignment of multiple frames is necessary. This section will explain the procedures of multi-frame pre-alignment. The lane marks we are interested in are all on the ground plane. Assuming the ground area ahead of the vehicle is locally planar, a 2D homographic transformation can be set up for the ground area between neighboring frames. We assume the image rows under the predefined vanishing lines are the ground area and compute the homographic transformation by feature point matching. However, in practice the ground is often composed of a weak texture area, which means insufficient feature points could be extract, as shown in Figure 2a. We solve this problem by extracting evenly distributed ORB (Oriented FAST and Rotated BRIEF) [32] feature points. Specifically, we divide the area into 30 × 30 grids and detect FAST (Features from Accelerated Segment Test) [33] corners with Non-Maximum Suppression (NMS). If insufficient corners are found in the grids, the detector threshold is adjusted adaptively. After a certain number of FAST corners are extracted, the corresponding rotated BRIEF (Binary Robust Independent Elementary Features) [34] descriptors are computed. Then, we employ QuadTree to administrate the features, making them evenly distributed and having them meet the quantity requirements simultaneously. As shown in Figure 2, our method for feature points extraction works better than simply using the Opencv library. After feature extraction, we conduct feature point matching for each pair of images. RANSAC (RANdom SAmple Consensus) [35] is performed to compute the homographic matrix between the previous frame and current frame. Then we can warp the previous frames to the current frame, realizing the multi-frame pre-alignment. The visualization examples for the procedure of feature points matching and inter-frame warpping are presented in Figure 3, where we can observe that the lane marks of two frames are exactly aligned with each other. Note that all the aligned images should be padded to the same resolution before input to the network.

Instance Segmentation Network
For instance, segmentation of lanes, an encoder-decoder architecture is employed, which uses VGG16-based FCN [38] as the backbone. The encoder CNN extracts the sequential features for all input frames. The decoder CNN consists of a binary segmentation branch and a pixel embedding branch. The binary segmentation branch decides the class of background or lane mark, while the embedding branch further disentangles the segmented lane mark pixels into different lane instances. The binary segmentation branch is trained by the standard cross-entropy loss function, using bounded inverse class weighting [39] to handle classes (lane/background) unbalance.
The instance embedding branch is trained to assign a lane ID to each lane pixel so that the pixel embeddings belonging to the same lane are pulled closer, whereas those belonging to different lanes are pushed away. In this way, the pixel embeddings of the same lane will cluster together to generate unique instance. The clustering loss function [40] for the instance embedding branch is: L = αL var + βL dist + γL reg , where α, β, and γ are weighting coefficients, and the three loss items are: In Equation (1), C represents the number of lane mark clusters, N c denotes the number of elements in cluster c, x i is a pixel embedding, µc is the mean embedding of cluster c, δ v and δ d are thresholds, and · indicates the L 2 distance, [x] + = max(0, x). The variance term (L var ) applies a pull force on each pixel embedding towards the mean embedding of a cluster, which is only active when the embedding is farther than δ v from its cluster center. The distance term (L dist ) serves to push the cluster centers away from each other. The push force is only effective when the distance between these centers is closer than δ d .

Spatio-Temporal Attention Module
To effectively fuse the encoded features from a multi-frame, we propose a Spatial-Temporal Attention Module (STAM) and insert it between the encoder and decoder. The module extracts Channel Attention (CA) and Spatial Attention (SA) from previous frames and applies them on the current frame for feature aggregation. According to the different connection manner of the two attentions and their acting target frames, STAM can be constructed by three modes, i.e., parallel, serial, and mixed mode, as shown in Figure 4. We assume that the size of the input tensor is C × H × W, where C, H, W are the number of elements along the channel, height, and width dimension, respectively.  In parallel mode, CA and SA respectively take the feature of previous frame F t−i as input to generate temporal and spatial attention map in a parallel manner. Then, the two attention maps are multiplied to the feature of current frame F t followed by element-wise addition to produce the temporary fused feature F t−i,t . The temporary fused features generated by all of the previous frames are then further aggregated by where n indicates the number of input frames. The second attention fusion way is to successively aggregate the history frame in a serial mode. As shown in Figure 4b, the feature of history frame F t−i is firstly fed to CA, after applying the resulting attention to F t−i+1 , the intermediate result is further input to SA to generate a two-frame aggregated feature F t−i,t−i+1 . Then the result is regarded as input of the CA of the next frame and the aggregation starts until the current frame is processed. Note that the order of CA and SA is exchangeable. The third way is the mixed mode, where the attention is applied between each pair of F t−i and F t serially, while the final aggregation is implemented by summation just like in the parallel mode. The detailed experimental studies for the different modes are conducted in Section 4.2.
The specific architectures of CA and SA in STAM are illustrated in Figure 5. As shown in Figure 5a, the CA employs global average-pooling and global max-pooling to integrate spatial information of input features. After being processed by a shared Multi-layer Perception (MLP), the feature vectors are aggregated by element-wise summation to generate a channel attention M C (i): where σ indicates the sigmoid function and n is the number of continuous frames. For SA, average-pooling and max-pooling operations are applied along the channel axis. The pooled features are concatenated and transmitted to a standard convolution layer, producing a spatial attention map M S (i) as: where f 7×7 denotes a convolution operation with a 7 × 7 filter size.

Post-Processing
As we regard lane mark detection as an instance segmentation problem, the inference of the arbitrary number of lane marks is allowed and lane changes can be handled. Since the pixel embedding of the same lane mark has been assigned by the network, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [41] algorithm is applied to determine the clustering category and form the unique lane mark instance. To get the final detection result, precise coordinates of the lane mark have to be distilled from the candidate areas. Here, we first sample lane points along the y axis for every 10 pixels, then perform curve fitting for a simpler description of lane marks and filtering out the outliers.

Datasets
To extensively evaluate the proposed method, we conduct experiments on two datasets: Tusimple and ApolloScape. Both of the datasets provide image sequences for training and testing.
Tusimple. TuSimple [36] is widely used in the existing works of lane mark detection. It is collected on highway roads under nice weather conditions at different daytimes. The images have a resolution of 1280 × 720 and contain 2-5 lanes for detection. The dataset consists of 3626 and 2782 image sequences for training and testing, respectively. Each sequence comprises 20 continuous frames with only the last frame annotated by sampling points. To construct the ground-truth binary and instance segmentation map for training, we connect all of the annotated points together to form an intact curve per lane.
ApolloScape. ApolloScape [37] is a large scale dataset that is provided by Baidu corporation. It contains seven different tasks for autonomous driving including lane segmentation. For this task, a diverse set of stereo video sequences are recorded in urban traffic scenarios with high quality pixel-level annotations. The resolution of images in ApolloScape is 3384 × 2710. Since the Apolloscape lane dataset only provides pixellevel semantic annotations without instance-level discrimination, and we only focus on detecting lane marks rather than recognizing all of the 35 categories in the dataset. We selected 5519 frames and annotated them with sampling points interpolated by cubic spline. For each training image, the previous 4 frames are provided for input without labeling. The split dataset is divided into 3317 frames for training, 608 for validation, and 1595 for testing.

Implementation Details
Our model is implemented on Tensorflow [42] with GPU GTX 1080Ti. The network is trained with an embedding dimension of 4 with δ v = 0.5, δ d = 3, α = 1, β = 1, γ = 0.001. All images are rescaled to 512 × 256 with nearest interpolation. During the training process, we employ a SGD (Stochastic Gradient Descent) [43] optimizer with a base learning rate of 5 × 10 −3 , momentum of 0.9 and batch size of 4. A poly learning rate policy is used with power 0.9 and maximal iteration 100 K. We also applied data augmentation including random cropping, random horizontal flipping, and color augmentations.

Evaluation Criteria
For ablation studies and comparisons with other lane mark detection methods, different metrics are adopted to evaluate the results on each particular dataset.
Tusimple. Here, we follow the official evaluation criteria [36]. The predicted lanes are sampled by points with fixed intervals along the y axis. Predicted points whose distance to the ground truth is less than 20 pixels are regarded as the correct points. The accuracy is calculated as: where C im is the number of correct points and S im is the total number of lane points in the image. Lane marks with an accuracy greater than 85% are considered as True Positive (TP), otherwise False Positive (FP) or False Negative (FN). The F1-measure is taken as the primary evaluation metric, which is computed as: where Precision = TP TP+FP , Recall = TP TP+FN . ApolloScape. While Tusimple uses distance metric, evaluation on ApolloScape refers to the area-metric used in the CULane dataset [18]. Each lane marking is viewed as a 30-pixel-width line connecting the sampled lane points. We calculate the IoU (Intersectionover-Union) [44] between the ground-truth and prediction. In lane-wise fashion, predicted lane instance is counted as True Positive (TP) when its IoU is higher than a certain threshold. We consider 0.3 and 0.5 thresholds corresponding to loose and strict evaluations for the experiments on ApolloScape. The F1 score is also treated as the major evaluation metric, which is defined as mentioned earlier.

Ablation Study
To verify our method, we will make comprehensive ablation studies on the Tusimple dataset carried out in this section.
Effects of multi-frames. Firstly, we investigate the effectiveness of aggregating information from multiple frames. As shown in Table 1, compared with a single frame baseline, using multiple frames does help to increase the accuracy and reduce the wrong predictions. It can be explained that multi-frame fusion brings richer information and enhances the feature of current frame, which helps to improve the performance. Note that in the 2nd row of Table 1, we also list the results of baseline equipped with the proposed STAM for comparison. The results show that employing 4 frames can obtain the best performance with a F1 score 3.51% higher than the original single-frame baseline. Although adopting 5 frames has a comparable accuracy with 4 frames, we empirically use 4 frames in our method by considering the trade-off between the computing cost and performance. Effectiveness of each component. Here, we study the advantages of multi-frame prealignment and the proposed STAM. The performance of each component is summarized in Table 2. For the baseline, we take 4 frames as input without pre-alignment and directly fuse the extracted multi-frame features together by an element-wise sum. To make comparison, we perform multi-frame alignment and then replace element-wise sum with STAM step by step. As the result shows, both the proposed modules can enhance the F1 metric, which proves the capabilities of them. Different modes of STAM. We further try STAM with different modes. As introduced in Section 3.3, STAM has three modes, i.e., parallel, serial, and the mixed mode. Depending on whether CA is placed in front of SA, the serial and mixed modes have two configurations: "C-S" and "S-C". The results of these modes are compared in Table 3. As we can see, for the proposed STAM, the mixed mode with the C-S order is able to achieve the highest F1 score. Comparison with other aggregation strategies. To further verify the effectiveness of STAM, we compare it with other aggregation strategies. The results are presented in Table 4. In the first three rows, the features of multiple frames are aggregated respectively by simple element-wise summation, a double-layer ConvLSTM(Convolutional Long Short-Term Memory) [45] and a cosine-similarity-based weighted sum. The bottom three rows use attention aggregation mechanisms. ST-DANet(Spatial-Temporal Dual Attention Network) is based on DANet [46], using a pure matrix operation with softmax and two learnable weighted coefficients. ST-PSA(Spatial-Temporal Polarized Self-attention) refers to the PSA block [47], which employs convolution, pooling, and normalization operations to further enhance the representation capacity along the channel and spatial dimension. It can be discovered that using an attention mechanism could achieve higher F1 scores than other methods, among which the proposed STAM works best. In summary, we proved the effectiveness of using multiple frames, pre-alignment, and the STAM. The ablation results also show that an input of 4 frames and using mixed mode with the C-S order for STAM can achieve the best performance. Therefore, this setting is kept for later evaluation on the Tusimple dataset.

Experiments on Tusimple
We compare our method with other existing lane mark detection methods on the Tusimple dataset and the results are shown in Table 5. The highest rank is in bold and the second one is underlined. Our method is able to achieve competitive performance in terms of a high F1 value, which is very close to first place. Note that the proposed network is trained from scratch without any pre-trained models or extra training datasets.  Figure 6 presents visual comparisons to the methods with lane instance segmentation on the Tusimple dataset. It can be observed that our method has less wrong or missing detection, reaching a better consistency with the ground-truth. Compared with those single-frame-based methods, such as ENet [39], DenseNet [48], as well as our single-frame baseline, our segmentation results have a higher localization accuracy with thinner lane contours centralizing on the true lane areas. It depresses the possibility of wrongly predicting background pixels near the ground-truth as lane mark pixels, and reduces the fuzzy adhesive region between adjacent lane marks. Besides, our method is robust to segment the entire instance of lane marks when they are occluded by vehicles. Figure 6. The visualization results of lane mark detection on the Tusimple dataset. We compare the proposed method with ENet [39], DenseNet [48], and our single-frame baseline. The color of lane marks is random, only for distinguishing different lane mark instances.
When comparing with the best RNN-based multi-frame method [28], our method is able to overtake it in some challenging scenarios such as occluded lane marks caused by vehicles, as shown in Figure 7. To further quantitatively compare the robustness of our method and [28] under such cases, we selected 583 testing images with occlusion or shadow in Tusimple datasets for evaluation. Since the public source code of [28] does not provide instance segmentation among lanes, we added post-processing of instance segmentation above it. As shown in Table 6, although the resulting performance for total testing images is not as good as those published by the authors of [28], we only pay attention to performance degradation caused by the challenging occlusion or shadow situations. As shown in Table 6, when encounter challenging scenes, the performance of [28] decreases more than our method. The results indicate that our method does have high robustness under occlusion situations, thanks to the special design of the spatial-temporal fusion of multi-frames.  [28] in the Tusimple benchmark under occlusion situations. Table 6. Robustness comparison with the best method [28] in the Tusimple benchmark. The less the ∆F1 is, the higher robustness of the method. means reproduced results by using its source code.

Experiment on ApolloScape
To verify the effectiveness of the proposed method under urban environments, we further test our method in the ApolloScape dataset. As far as we know, few performances have been publicly reported on the ApolloScape Lane Segmentation dataset. Therefore, we only demonstrate the ablation results of our own method.
Firstly, we investigate the effect of fusing a different number of frames. As shown in Table 7, no matter how many frames are used, aggregating multiple frames works better than detecting lane marks in a single frame. For the ApolloScape dataset, adopting two frames can achieve optimal performance, with 3.87% and 5.02% gains on F1 scores when the threshold of IoU is 0.5 and 0.3, respectively. As the number of frames increases, the results tend to be worse. In comparison with the TuSimple dataset, a larger movement exists between the acquired neighboring images in ApolloScape, which may cause less correlations among the images. For ApolloScape, we also evaluate the impact of each proposed component (one at a time): Alignment of multiple frames and STAM. The ablation study results are shown in Table 8. For baseline, unaligned frames are taken as input, whose features are simply aggregated by element-wise sum. To verify the effects of the proposed modules step by step, we first align the multiple frames and then insert the STAM. As we can see, no matter which IoU threshold we adopt, both multi-frame alignment and STAM are beneficial to improve performance.
The visualization results on ApolloScape are demonstrated in Figure 8. Compared with the single-frame baseline, using multiple frames can better preserve the integrity and continuity of lane marks. Besides, integrated with richer information of multiple frames, our method shows strong robustness in challenging scenarios such as low illumination, vehicle occlusion, heavy shadow, and curve lanes.

Conclusions
In this work, we performed lane mark detection using multiple frames of continuous driving scenes rather than detecting the lane marks from one single image. With richer information extracted from multiple continuous images, the proposed method could achieve accurate and robust detection, despite serious vehicle occlusion, heavy shadow,s and severe lane mark abrasion in some difficult conditions.
To better utilize the spatial and temporal information from multiple frames, the history frames were pre-aligned with the current key frame before entering into the encoderdecoder instance segmentation network. The sequential encoded features were attentively aggregated using the proposed STAM, followed by the two-branch decoder and postprocessing to obtain the final lane mark predictions. In ablation studies, we verified the advantage of using multiple frames and the effectiveness of each proposed component. We also tried different modes of STAM and compared the STAM with other aggregating methods.
The evaluation results demonstrated that our method could achieve state-of-the-art performance, with higher F1 scores and fewer incorrect predictions than most of the single-frame methods. Furthermore, the proposed method also worked better than other multi-frame methods in some challenging scenarios, which shows stronger robustness.