1. Introduction
With the improvement of human living standards and the high level of material life requirements, autonomous driving has attracted more and more attention. In the field of autonomous driving, an important problem is automatic lane detection, which is one of the most challenging perceptual tasks at present [
1]. Detecting lanes is a subtask of advanced functions of autonomous driving, such as lane departure warning, Advanced Driving Assistance System (ADAS), lane-keeping, and path planning [
2]. In recent years, the methods for lane detection can be roughly divided into two categories: traditional machine vision and deep learning.
Manual feature extraction is a prerequisite of traditional machine vision [
3], including image preprocessing, edge detection, edge enhancement, Hough transform, and lane line fitting [
4], etc. For example, Borkar et al. [
5] used the inverse perspective transform to transform the image and then applied the random consistency to remove the outliers, and finally used the Kalman filter to predict the lane. In [
6], the Gaussian filter, improved Hough transform, and K-mean clustering algorithm were comprehensively carried out in the lane detection. These methods can reliably identify lanes in simple scenes, but they are possibly vulnerable to the shadows on the road and have poor accuracy in complex and variable driving scenes [
1].
When studies in the literature are examined, it appears that neural network has been widely applied in many fields because of its strong representation learning ability, e.g., module recognition [
7], ad click prediction [
8], accident detection [
9], and medical image segmentation [
10]. There has been increasing interest in the methods based on deep learning because of its strong robustness and reliability in the accuracy and speed of feature extraction. In [
11], a Spatial Convolutional Neural Network (SCNN) was proposed to process images to detect lanes. The feature map in SCNN is stratified according to rows and columns, and the output is obtained by convolution, nonlinear activation, and summation operations from four directions (up, down, left, and right), which strengthens the propagation of spatial information. This structure also has the flexibility to be embedded in other off-the-shelf networks. However, it fixes the number of lane lines that are detected and is disturbed by features that are similar to lane lines. In [
2], lane detection was achieved by a series of end-to-end neural networks, which used two Convolutional Neural Network (CNN) modules in a series to achieve instance segmentation, and then the CNN classifier was guided to detect the lane based on the index of the previous instance segmentation results and reproduce the final results to the original image. Neven, et al. [
12] applied a multi-classification network with two branch networks; meanwhile, the other network was trained to obtain the parameters of inverse perspective transformation to improve the robustness against the change of the ground plane.
Some scholars also use Generative Adversarial Networks (GAN) methods to provide a smooth tradeoff between accuracy and running speed. Zhang et al. [
3] proposed a Ripple-GAN network without post-processing, which integrated multi-semantic segmentation and Wasserstein generative adversarial training. In the first part of the network, white Gaussian noise is added to the source image as input, and two discriminators are designed to enhance the extracted lane line features. The second part uses the gradient diagram and the output from the first part as input to a network that combines residual learning [
13] and end-to-end architecture to output the final result. In this network, eliminating some feature reuse modules makes the network simpler and more efficient. To our knowledge, many methods that combine traditional machine vision with deep learning are also available. Reference [
1] proposed an adaptive and optimized machine vision method for lane segmentation, including color space transformation, perspective transformation, Hough transformation, lane fitting, and lane filling. The weak labels were then generated by a qualitative evaluation, which was used to train other advanced neural networks such as RESNET and SEGNET. In [
14], a combination of image preprocessing and neural network is applied to complete lane detection.
In practical application, lane detection is a sequential process, and some scholars have studied video lane detection from this perspective. In [
15], Zou et al. inserted the Convolutional Long Short-Term Memory (CONVLSTM) module between the encoder and decoder to fuse continuous frame information and enhance the ability to extract context information of the feature map. CONVLSTM is a kind of LSTM [
16] using the convolutional operation, which enables LSTM to process data with spatial structure. Reference [
17] embedded the Convolutional Gated Recurrent Units (CONVGRU) into the encoder to memorize and learn low-level features. The output of the encoder was then input into several CONVGRU modules to better process these spatial–temporal signals. In [
18], feature extraction is carried out in keyframes and then propagated to non-key frames by way of spatial warping, which greatly improves the computing speed. The keyframe selection mechanism has been facilitated by [
19] to obtain gains in computational efficiency. In [
20], Zhu et al. presented an interlacing model framework that can be used by multiple feature extractors simultaneously or independently. Then, the CONVLSTM is used for aggregation and optimization. Finally, reinforcement learning is also used to determine the order of these feature extractors to achieve a trade-off between running speed and accuracy.
In addition to the papers from the above research perspective, there are also scholars studying from the perspective that is more relevant to the driving habits of people. In this regard, Zhou et al. [
17] argued that lane detection should be combined with curb detection. They used a typical encoder–decoder network, followed by a series of convolution layers and SOFTMAX operation that was optimized with a custom class-weighting scheme for class segmentation and pixel classification.
Although studies on deep learning have been carried out in great depth, there is little literature studying the improvement of real-time performance in the pursuit of accuracy [
21]. The methods of deep learning are mostly based on the communication of pixel dimensions, which produce a great burden of calculation. Ref. [
22] showed a way of selecting lane position in the predefined rows of the image, instead of classifying each pixel of the lane based on the local receptive domain. Specifically, the correct lane position is selected from each row of the gridded image. However, the shape loss is introduced into the design of the loss function, which tends to predict straight lines in the constraint grid, and the effect of identifying curved lines is not ideal. In [
23], Ye et al. improved this approach by finding the correct lane position both horizontally and vertically, which boosted the ability of the original network to identify curves.
In [
24], a Self-Attention Distillation (SAD) module is added to the decoder of the E-NET, and the output of the network after training is fitted with a cubic spline curve to achieve the final result. The SAD module can reduce the number of layers while maintaining the accuracy of the network. Ref. [
25] pointed out the slice image and the expansion convolution to reduce the number of parameters to improve the operation speed. Liu et al. [
26] designed a network architecture that includes a feature exchanging module and feature fusing module. The feature fusing module enables multiple small convolution kernels to reduce the calculation of parameters, while the feature exchanging module uses spatial convolution and expansive convolution to make full use of the information in the network.
Inspired by the above work, the purpose of this study was to design a more lightweight and real-time network for lane detection. Consequently, combining the SAD module and attention mechanism, a lightweight neural network architecture is designed based on Deeplabv3plus [
27] to detect lanes. The main contributions of this paper are as follows:
(a) A lightweight and end-to-end network based on deeplabv3plus handles the problem of lane detection, which could enable better real-time performance than the several existing methods;
(b) The attentional mechanism and the attention distillation are applied in the proposed method to remedy the loss that is incurred by the reduction in the convolution layers;
(c) The effectiveness of the proposed method is verified by the different types of comparison among the several different networks. The comparison of the usefulness among these modules is shown in the ablation experiments.
The remainder of this paper is demonstrated as follows:
Section 2 introduces the related work about the proposed method.
Section 3 presents the whole structure of the proposed network.
Section 4 reports the results of the experiments.
Section 5 concludes the work of this paper and briefly analyses the limitations.