The Improved Deeplabv3plus Based Fast Lane Detection Method

Wang, Zhong; Zhao, Yin; Tian, Yang; Zhang, Yahui; Gao, Landa

doi:10.3390/act11070197

Open AccessArticle

The Improved Deeplabv3plus Based Fast Lane Detection Method

by

Zhong Wang

¹

,

Yin Zhao

¹,

Yang Tian

^2,3,*,

Yahui Zhang

^2,3 and

Landa Gao

⁴

¹

School of Vehicle and Energy, Yanshan University, Qinhuangdao 066004, China

²

School of Mechanical Engineering, Yanshan University, Qinhuangdao 066004, China

³

Hebei Innovation Center for Equipment Lightweight Design and Manufacturing, Qinhuangdao 066004, China

⁴

Research Institute of Highway, Ministry of Transport, Beijing 100088, China

^*

Author to whom correspondence should be addressed.

Actuators 2022, 11(7), 197; https://doi.org/10.3390/act11070197

Submission received: 17 June 2022 / Revised: 7 July 2022 / Accepted: 11 July 2022 / Published: 18 July 2022

(This article belongs to the Special Issue Design, Control, and Optimization of Powertrain for New Energy Vehicles)

Download

Browse Figures

Versions Notes

Abstract

:

Lane detection is one of the most basic and essential tasks for autonomous vehicles. Therefore, the fast and accurate recognition of lanes has become a hot topic in industry and academia. Deep learning based on a neural network is also a common method for lane detection. However, due to the huge computational burden of the neural network, its real-time performance is often difficult to meet the requirements in the fast-changing actual driving scenes. A lightweight network combining the Squeeze-and-Excitation block and the Self-Attention Distillation module is proposed in this paper, which is based on the existing deeplabv3plus network and specifically improves its real-time performance. After experimental verification, the proposed network achieved 97.49% accuracy and 60.0% MIOU at a run time of 8.7 ms, so the network structure achieves a good trade-off between real-time performance and accuracy.

Keywords:

deep learning; attentional mechanisms; attention distillation; lane detection; deeplabv3plus

1. Introduction

With the improvement of human living standards and the high level of material life requirements, autonomous driving has attracted more and more attention. In the field of autonomous driving, an important problem is automatic lane detection, which is one of the most challenging perceptual tasks at present [1]. Detecting lanes is a subtask of advanced functions of autonomous driving, such as lane departure warning, Advanced Driving Assistance System (ADAS), lane-keeping, and path planning [2]. In recent years, the methods for lane detection can be roughly divided into two categories: traditional machine vision and deep learning.

Manual feature extraction is a prerequisite of traditional machine vision [3], including image preprocessing, edge detection, edge enhancement, Hough transform, and lane line fitting [4], etc. For example, Borkar et al. [5] used the inverse perspective transform to transform the image and then applied the random consistency to remove the outliers, and finally used the Kalman filter to predict the lane. In [6], the Gaussian filter, improved Hough transform, and K-mean clustering algorithm were comprehensively carried out in the lane detection. These methods can reliably identify lanes in simple scenes, but they are possibly vulnerable to the shadows on the road and have poor accuracy in complex and variable driving scenes [1].

When studies in the literature are examined, it appears that neural network has been widely applied in many fields because of its strong representation learning ability, e.g., module recognition [7], ad click prediction [8], accident detection [9], and medical image segmentation [10]. There has been increasing interest in the methods based on deep learning because of its strong robustness and reliability in the accuracy and speed of feature extraction. In [11], a Spatial Convolutional Neural Network (SCNN) was proposed to process images to detect lanes. The feature map in SCNN is stratified according to rows and columns, and the output is obtained by convolution, nonlinear activation, and summation operations from four directions (up, down, left, and right), which strengthens the propagation of spatial information. This structure also has the flexibility to be embedded in other off-the-shelf networks. However, it fixes the number of lane lines that are detected and is disturbed by features that are similar to lane lines. In [2], lane detection was achieved by a series of end-to-end neural networks, which used two Convolutional Neural Network (CNN) modules in a series to achieve instance segmentation, and then the CNN classifier was guided to detect the lane based on the index of the previous instance segmentation results and reproduce the final results to the original image. Neven, et al. [12] applied a multi-classification network with two branch networks; meanwhile, the other network was trained to obtain the parameters of inverse perspective transformation to improve the robustness against the change of the ground plane.

Some scholars also use Generative Adversarial Networks (GAN) methods to provide a smooth tradeoff between accuracy and running speed. Zhang et al. [3] proposed a Ripple-GAN network without post-processing, which integrated multi-semantic segmentation and Wasserstein generative adversarial training. In the first part of the network, white Gaussian noise is added to the source image as input, and two discriminators are designed to enhance the extracted lane line features. The second part uses the gradient diagram and the output from the first part as input to a network that combines residual learning [13] and end-to-end architecture to output the final result. In this network, eliminating some feature reuse modules makes the network simpler and more efficient. To our knowledge, many methods that combine traditional machine vision with deep learning are also available. Reference [1] proposed an adaptive and optimized machine vision method for lane segmentation, including color space transformation, perspective transformation, Hough transformation, lane fitting, and lane filling. The weak labels were then generated by a qualitative evaluation, which was used to train other advanced neural networks such as RESNET and SEGNET. In [14], a combination of image preprocessing and neural network is applied to complete lane detection.

In practical application, lane detection is a sequential process, and some scholars have studied video lane detection from this perspective. In [15], Zou et al. inserted the Convolutional Long Short-Term Memory (CONVLSTM) module between the encoder and decoder to fuse continuous frame information and enhance the ability to extract context information of the feature map. CONVLSTM is a kind of LSTM [16] using the convolutional operation, which enables LSTM to process data with spatial structure. Reference [17] embedded the Convolutional Gated Recurrent Units (CONVGRU) into the encoder to memorize and learn low-level features. The output of the encoder was then input into several CONVGRU modules to better process these spatial–temporal signals. In [18], feature extraction is carried out in keyframes and then propagated to non-key frames by way of spatial warping, which greatly improves the computing speed. The keyframe selection mechanism has been facilitated by [19] to obtain gains in computational efficiency. In [20], Zhu et al. presented an interlacing model framework that can be used by multiple feature extractors simultaneously or independently. Then, the CONVLSTM is used for aggregation and optimization. Finally, reinforcement learning is also used to determine the order of these feature extractors to achieve a trade-off between running speed and accuracy.

In addition to the papers from the above research perspective, there are also scholars studying from the perspective that is more relevant to the driving habits of people. In this regard, Zhou et al. [17] argued that lane detection should be combined with curb detection. They used a typical encoder–decoder network, followed by a series of convolution layers and SOFTMAX operation that was optimized with a custom class-weighting scheme for class segmentation and pixel classification.

Although studies on deep learning have been carried out in great depth, there is little literature studying the improvement of real-time performance in the pursuit of accuracy [21]. The methods of deep learning are mostly based on the communication of pixel dimensions, which produce a great burden of calculation. Ref. [22] showed a way of selecting lane position in the predefined rows of the image, instead of classifying each pixel of the lane based on the local receptive domain. Specifically, the correct lane position is selected from each row of the gridded image. However, the shape loss is introduced into the design of the loss function, which tends to predict straight lines in the constraint grid, and the effect of identifying curved lines is not ideal. In [23], Ye et al. improved this approach by finding the correct lane position both horizontally and vertically, which boosted the ability of the original network to identify curves.

In [24], a Self-Attention Distillation (SAD) module is added to the decoder of the E-NET, and the output of the network after training is fitted with a cubic spline curve to achieve the final result. The SAD module can reduce the number of layers while maintaining the accuracy of the network. Ref. [25] pointed out the slice image and the expansion convolution to reduce the number of parameters to improve the operation speed. Liu et al. [26] designed a network architecture that includes a feature exchanging module and feature fusing module. The feature fusing module enables multiple small convolution kernels to reduce the calculation of parameters, while the feature exchanging module uses spatial convolution and expansive convolution to make full use of the information in the network.

Inspired by the above work, the purpose of this study was to design a more lightweight and real-time network for lane detection. Consequently, combining the SAD module and attention mechanism, a lightweight neural network architecture is designed based on Deeplabv3plus [27] to detect lanes. The main contributions of this paper are as follows:

(a) A lightweight and end-to-end network based on deeplabv3plus handles the problem of lane detection, which could enable better real-time performance than the several existing methods;

(b) The attentional mechanism and the attention distillation are applied in the proposed method to remedy the loss that is incurred by the reduction in the convolution layers;

(c) The effectiveness of the proposed method is verified by the different types of comparison among the several different networks. The comparison of the usefulness among these modules is shown in the ablation experiments.

The remainder of this paper is demonstrated as follows: Section 2 introduces the related work about the proposed method. Section 3 presents the whole structure of the proposed network. Section 4 reports the results of the experiments. Section 5 concludes the work of this paper and briefly analyses the limitations.

2. Related Work

2.1. Deeplabv3plus

Deeplab is a series of classical deep-learning semantic segmentation networks developed by Google with superior performance. Deeplabv3plus is an improved version of Deeplabv3 [28] and is state-of-the-art in this series. Its ability to restore image details makes the output image performance excellent, and it is the practical technology that is applied by many autonomous driving companies in the direction of lane detection and environment perception. The network architecture is complex and vast, but there are almost no obvious shortcomings in the semantic segmentation effect.

In the encoder of this network, the backbone network uses expanded convolution to increase the receptive field and improve the ability of information feature extraction. Then the low-level features of the backbone network enter the decoder, and the high-level semantic features are fused by an Atrous Spatial Pyramid Pooling (ASPP) module to better obtain the global semantic information of the image. The ASPP uses the convolution kernel of different expansion rates to carry out the convolutional operation on the feature map, which enhances the ability of global feature learning. In the decoder of the network, the output of ASPP is fused with low-level features by the concat operation and then restored to the same resolution as the original image through a series of convolution layers and upsampling. The original deeplabv3plus architecture is presented in Figure 1.

2.2. Attentional Mechanism

The attentional mechanism can be understood as a kind of weighting mechanism, which mainly comprises channel attention and spatial attention. Attention makes the network more focused on the channel or spatial pixel that is relevant to the goal. To name a few, Hu et al. [29] demonstrated a Squeeze-and-Excitation (SE) block that could expediently be embedded into other networks and brought the benefits for the channel feature learning. The SE block used global average pooling and a gating mechanism to obtain weights for different channels, and finally, the weights were multiplied by the original feature map. This block is lightweight and slightly increases the computational burden of the model. Woo et al. [30] showed a Convolutional Block Attention Module (CBAM) that combined the channel attention and the spatial attention and outperformed the SE block. In [31], a new variant of the RESNET combining the SE block was reported, dubbed ResNeSt. Grouping convolution and split attention were applied in the ResNeSt, which accelerated the performance without increasing the computational burden and made it more convenient to be conducted as the backbone for other machine vision tasks.

2.3. Attention Distillation

The attention distillation is extended from the knowledge distillation. There are two types of spatial attention maps, i.e., activation-based and gradient-based [32]. It is found that the gradient-based attention distillation is a little more error-prone than the activation-based attention distillation. Instead of using attention maps that are generated from a teacher network to train a student network in [32], Ref. [24] revealed the possibility of distilling attention layer by layer in the same network. The SAD module has no requirement for the extra labels and supervision and uses the low-level attention map to mimic the high-level attention map, in order to help train some lightweight networks for the improvement of the effect.

2.4. Depthwise Separable Convolution

Depthwise separable convolution [33] is a kind of powerful operation to reduce the number of parameters and alleviate the computational cost. Firstly, the depthwise convolution is applied to adjust the resolution of the output feature map, and then the pointwise convolution is used to obtain the expected channel. The Mobilenetv2 [34] has adopted this module and become a widely used and efficient lightweight network.

3. Methodology

The part of the neural network that majors in feature extraction is called the backbone. Generally speaking, the RESNET-101 or Xception is often adopted as the backbone in the original Deeplabv3plus architecture, but the large multi-layer networks have considerable computational parameters to decline the speed. The architecture of the deeplabv3plus was modified into a more lightweight version, while the proposed overall network still applied the encoder-to-decoder structure to better extract the features of the target. The decoder of the proposed method is the same as the counterpart of the deeplabv3plus. There is also a fusion between the low-level feature and the high-level feature by using the concat operation to extract the global information. The more lightweight RESNET-18 is applied as the backbone of the encoder and the ASPP module is eliminated. Following [24], the attention distillation is added only in the later training stage and takes up no memory footprint during the network operation stage. The supplementary ablation experiments about the determination of network architecture are expatiated in Section 4. The overall architecture is illustrated in Figure 2. Finally, the output of the network after training is the feature map of two channels: one is the binary map, and the other is the One-Hot coding map of the former. In this paper, the white area of the binary map represents the identified lanes, and the black area means the background. The backbone and the SAD module of the proposed method are described below.

3.1. Backbone

The depthwise separable convolution is also used in the RESNET-18 to reduce the computational cost. The SE block is followed after the proposed residual module to remedy the accuracy loss that is caused by the reduction in the convolutional layers. Specifically, the details of the backbone are shown below.

The conventional convolution operation could be simply expressed as:

\begin{matrix} u_{s t d} = w * X, \end{matrix}

(1)

where

*

refers to the convolutional operation,

w

denotes the convolution kernel, and

w \in R^{N \times C \times h \times w}

,

X

represents the input feature map, and

X \in R^{C \times H \times W}

,

u

is the output feature map and

u_{s t d} \in R^{N \times H^{'} \times W^{'}}

. The parameter number of the conventional convolution is

P_{1} = N \times C \times h \times w

.

The depthwise separable convolution operation can be written as follows:

u^{'} = w_{1} * X,

(2)

u_{s e p} = w_{2} * δ (u^{'}),

(3)

where

w_{1}

and

w_{2}

are the convolutional operation, and

δ

is the RELU function.

w_{1} \in R^{C \times 1 \times h \times w}

,

w_{2} \in R^{N \times C \times 1 \times 1}

,

u^{'} \in R^{C \times H^{'} \times W^{'}}

,

u_{s e p} \in R^{C \times H^{″} \times W^{″}}

. The parameter number of depthwise separable convolution is

P_{2} = C \times 1 \times h \times w + N \times C \times 1 \times 1

. Since the number of channel N in the output feature map is generally large, it could be empirically concluded that the second one,

P_{2}

, is less. In this paper, the convolution operation in the standard residual block is replaced by the depthwise separable convolution, which is illustrated in Figure 3. The way of the residual connection was proposed by [13], which may address the performance degradation that was caused by the increase in network layers to a certain extent. The comparison of the parameters number of the different networks is shown in Table 1.

Comparing quantitatively the SE block and the CBAM block, it is found that the former could achieve a better performance than the latter. Specific experimental details and results are given in the ablation experiments in Section 4. Consequently, the SE block is introduced in the attention module, and its structure is demonstrated in Figure 4. Specifically, the global average pooling is performed on the output result of the depthwise separable convolution layer for input feature map X as

u_{s e p} = f_{s e p} (X)

, and this operation can be expressed as:

u_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} u_{s e p} (i, j), u_{c} \in R^{C \times 1 \times 1} .

(4)

Then, the weight coefficient of each channel is obtained by two full connection layers and a sigmoid function, which is shown in the following formulas:

u_{s} = σ (w_{f 2} * δ (w_{f 1} * u_{c})),

(5)

where

σ

refers to the sigmoid function,

w_{f 2}

is the fully connected layers,

w_{f 1} \in R^{\frac{C}{r} \times C}

,

w_{f 2} \in R^{C \times \frac{C}{r}}

, r is the reduction ratio. Finally, the input

u_{s e p}

of the attention module is channel-wise multiplied by the learned channel weight

u_{s}

as:

\hat{X} = u_{s} u_{s e p} .

(6)

3.2. SAD

The SAD module is mainly valid in the training stage, which could help the whole model to converge quickly and strengthens the feature representation ability of the network. Following [24], the activation-based attention distillation is added to our network architecture. The final output feature maps of each layer in the backbone are expressed as

{\hat{X}}_{m}

, of which m is the number of the layers, m = 1, 2, 3, 4. The attention maps that are generated by

{\hat{X}}_{m}

are shown as follows:

A_{m} = f_{a t t e n t i o n} ({\hat{X}}_{m}) = Φ (B (\sum_{c = 1}^{C_{m}} {| {\hat{X}}_{m} |}^{p})),

(7)

where

C_{m}

represents the number of channels at layer m,

B (\cdot)

refers to the bilinear upsampling operation,

Φ (\cdot)

refers to the SOFTMAX operation,

p

is a positive number,

{\hat{X}}_{m} \in R^{C \times H \times W}

,

A_{m} \in R^{H \times W}

. Firstly, the absolute values to the power

p

of each channel of

{\hat{X}}_{m}

are summed to reduce the 3D feature map to the 2D one (reference [24] suggests better performance gains as p = 2), and then the bilinear upsampling operation is used to give the attention map of different layers the same resolution. Eventually, the SOFTMAX operation is conducted to scale these values down to a certain percentage. The distillation loss between two adjacent layers is:

L o s s_{d i s t i l} = \sum_{m = 1}^{M - 1} L_{2} (A_{m}, A_{m + 1}),

(8)

where

L_{2}

is the loss function of least square error, M is the largest number of layers and M = 4.

Except for the distillation loss, the cross-entropy loss is also applied to the loss function. Only two kinds of objects, background and lanes, are required to be identified by the network. The cross-entropy loss is formulated as:

L o s s_{s e g} = - y \log (\hat{y}) - (1 - y) \log (1 - \hat{y}),

(9)

where

y

denotes the distribution of the ground-truth labels and

\hat{y}

denotes the distribution that is predicted by the model. The total loss is comprised of two terms above:

L o s s = a L o s s_{s e g} + b L o s s_{d i s t i l} .

(10)

4. Experiments

The ablation experiments are also performed to find the optimal combination of the introduced modules. To verify the effectiveness of the proposed method, extensive experiments on the CULane dataset are demonstrated in this section.

4.1. Implementation Details

During the training stage, an Adaptive Moment Estimation (ADAM) [35] optimizer is used. Specifically, the update formula is depicted as:

g_{t} = \nabla_{θ} f_{t} (θ_{t - 1}),

(11)

m_{t} = β_{1} \cdot m_{t - 1} + (1 - β_{1}) \cdot g_{t},

(12)

v_{t} = β_{2} \cdot v_{t - 1} + (1 - β_{2}) \cdot {g_{t}}^{2},

(13)

\hat{m_{t}} = \frac{m_{t}}{(1 - {β_{1}}^{t})},

(14)

\hat{v_{t}} = \frac{v_{t}}{(1 - {β_{2}}^{t})},

(15)

θ_{t} = θ_{t - 1} - α \cdot \frac{\hat{m_{t}}}{\sqrt{\hat{v_{t}}} + ϵ},

(16)

in which

g_{t}

is the gradient of the objective function at time

t

,

m_{t}

is the first moment weighted mean of the historical gradient,

v_{t}

is the second moment weighted mean of the historical gradient,

\hat{m_{t}}

is the corrected bias of

m_{t}

,

\hat{v_{t}}

is the corrected bias of

v_{t}

,

θ_{t}

is the updated parameters,

ϵ

is an extremely small positive number, and

α

represents the learning rate. The sharp oscillations during the gradient update could be attenuated by

m_{t}

, while

v_{t}

may make the update process smoother. Finally,

\hat{m_{t}}

and

\hat{v_{t}}

are calculated to prevent

m_{t}

and

v_{t}

from converging to 0. In short, the ADAM uses the first-moment estimation and second-moment estimation of the gradient to dynamically adjust the learning rate of each parameter. In the process of training the neural network, most matrix operations can be processed by GPU with strong performance. All of our models are trained and verified with PyTorch and NVIDIA RTX 4000 GPU. All the input images are resized to 800 × 288 to save the random-access memory of the GPU. Parameters

a

and

b

of the loss function are set to 0.4 and 0.1, respectively.

Commonly used evaluation metrics, the accuracy and the Mean-Intersection-Over-Union (MIOU) are applied to evaluate the performance of neural networks. The formula of the accuracy is calculated as:

a c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N} .

(17)

MIOU is used to calculate the proportion of the intersection and union between the real and predicted values. Since only the background and the target lane are considered in these experiments, the MIOU is computed as follows:

M I O U = \frac{\frac{T P}{T P + F P + F N} + \frac{T N}{T N + F P + F N}}{2},

(18)

where

T P

is the true positive,

F P

is the false psitive,

T N

is the true negative, and

F N

is the false negative. The training set and validation set of the CULane dataset are used in the verification experiment in this paper. The CULane dataset was first published by [11], and since then it is often used in other literature. The description of the dataset is shown in Table 2. Most of the CULane validation set used in our paper are simple urban traffic scenes with fewer obstacles and shadows on the road.

4.2. Ablation Experiments

In order to determine whether the SE module and the CBAM module are appropriate for this study, the comparison experiment between them is conducted under the same condition. Only the SE block is replaced by the CBAM block in the proposed attention block, and other conditions remain unchanged. The structure of the CBAM block is shown in Figure 5 and the result is presented in Table 3.

The effectiveness of the different introduced modules is investigated in the ablation experiments: depthwise separable convolution, ASPP, SE, CBAM, SAD. Then, find out the optimal combination. All the networks of different combinations are trained at the same epochs on the training set under the same working environment. The results of the experiments are shown in Table 4. As shown in Table 4, the effectiveness of the proposed architecture is also proved.

4.3. Results

The runtimes of the several different methods are shown in Table 5. The runtime of the proposed method is computed as the average time of 100 runs. It is worth noting that our approach could achieve relatively better performance improvements.

For the same dataset, the comparison of the MIOU between the proposed method and the standard deeplabv3plus with RESNET-18 was shown in Table 6. An image of the dataset was recognized by the proposed method and depicted in Figure 6. From Figure 6, the improvement effect of the proposed method could be intuitively seen.

As can be seen in Figure 7, although both the proposed method and deeplabv3plus with RESNET18 converge at the same time, the former has smaller loss values so that the performance of the network is better. The comparison of the feature maps generated from the first convolution layer between the proposed method and deeplabv3plus with RESNET18 is shown in Figure 8. According to the visualization results, there are more bright parts in the display figure of the proposed method, which seems to show that this method significantly improves the feature expression ability of the network.

5. Conclusions

According to the result of the verification experiment, the network result that is proposed in this paper is feasible to some extent, and the performance improvement that is given by the attention mechanism and the attention distillation is proved in the ablation experiments. As can be seen from Table 1, Table 4 and Table 5, although the method that we proposed reduces the number of parameters, it still achieves a good performance (97.49% in accuracy and 8.7 ms in run time), which may be due to the benefic gain brought into the network by attentional mechanism and attention distillation. From Figure 6, it can also be intuitively found that the proposed method can detect lane lines that cannot be detected by the deeplabv3plus with RESNET18. Although few backbone network layers in the network structure are proposed in this paper, the attention mechanism and attention distillation improve the accuracy of the network with a small amount of computational burden. Therefore, we argue that attention-related algorithms may be a feasible way to improve real-time performance when the number of network parameters is reduced. From the perspective of economy and efficiency, the use of a lightweight neural network combined with an attention mechanism to detect lane lines may become the future development direction. However, due to space constraints, more test experiments on other evaluation metrics (such as F1-measure, Mean-Average-Precision, etc.) are not carried out. The network output is also a simple binary map. In the future, we plan to add the lane line fitting method to cast the lane lines onto the input image. In some extreme cases, such as heavy shadow, weak light, night, and so on, attention mechanisms to improve the accuracy of the neural network will also become our focus.

Author Contributions

Conceptualization, Z.W.; methodology, Z.W. and Y.Z. (Yin Zhao); software, Z.W.; validation, Z.W. and Y.Z. (Yin Zhao); formal analysis, Z.W.; writing—original draft preparation, Y.Z. (Yahui Zhang); writing—review and editing, Y.T. and L.G.; visualization, Y.Z. (Yahui Zhang); supervision, Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Yousri, R.; Elattar, M.A.; Saeed Darweesh, M. A deep learning-based benchmarking framework for lane segmentation in the complex and dynamic road scenes. IEEE Access 2021, 9, 117565–117580. [Google Scholar] [CrossRef]
Haixia, L.; Xizhou, L. Flexible lane detection using CNNs. In Proceedings of the 2021 International Conference on Computer Technology and Media Convergence Design (CTMCD 2021), Sanya, China, 23–25 April 2021; pp. 235–238. [Google Scholar] [CrossRef]
Zhang, Y.; Lu, Z.; Ma, D.; Xue, J.H.; Liao, Q. Ripple-GAN: Lane line detection with ripple lane line detection network and Wasserstein GAN. IEEE Trans. Intell. Transp. Syst. 2021, 22, 1532–1542. [Google Scholar] [CrossRef]
Narote, S.P.; Bhujbal, P.N.; Narote, A.S.; Dhane, D.M. A review of recent advances in lane detection and departure warning system. Pattern Recognit. 2018, 73, 216–234. [Google Scholar] [CrossRef]
Borkar, A.; Hayes, M.; Smith, M.T. A novel lane detection system with efficient ground truth generation. IEEE Trans. Intell. Transp. Syst. 2012, 13, 365–374. [Google Scholar] [CrossRef]
Chen, J.; Ruan, Y.; Chen, Q. A precise information extraction algorithm for lane lines. China Commun. 2018, 15, 210–219. [Google Scholar] [CrossRef]
Liu, F.; Zhang, Z.; Zhou, R. Automatic modulation recognition based on CNN and GRU. Tsinghua Sci. Technol. 2022, 27, 422–431. [Google Scholar] [CrossRef]
Cai, W.; Wang, Y.; Ma, J.; Jin, Q. CAN: Effective cross features by global attention mechanism and neural network for ad click prediction. Tsinghua Sci. Technol. 2022, 27, 186–195. [Google Scholar] [CrossRef]
Le, T.N.; Ono, S.; Sugimoto, A.; Kawasaki, H. Attention R-CNN for accident detection. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020; pp. 313–320. [Google Scholar] [CrossRef]
Deng, H.; Zhang, Y.; Li, R.; Hu, C.; Feng, Z.; Li, H. Combining residual attention mechanisms and generative adversarial networks for hippocampus segmentation. Tsinghua Sci. Technol. 2022, 27, 68–78. [Google Scholar] [CrossRef]
Pan, X.; Shi, J.; Luo, P.; Wang, X.; Tang, X. Spatial as deep: Spatial CNN for traffic scene understanding. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI 2018), New Orleans, LA, USA, 2–7 February 2018; pp. 7276–7283. [Google Scholar]
Neven, D.; De Brabandere, B.; Georgoulis, S.; Proesmans, M.; Van Gool, L. Towards end-to-end lane detection: An instance segmentation approach. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 286–291. [Google Scholar] [CrossRef] [Green Version]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
Oğuz, E.; Küçükmanisa, A.; Duvar, R.; Urhan, O. A deep learning based fast lane detection approach. Chaos Solitons Fractals 2022, 155, 111722. [Google Scholar] [CrossRef]
Zou, Q.; Jiang, H.; Dai, Q.; Yue, Y.; Chen, L.; Wang, Q. Robust lane detection from continuous driving scenes using deep neural networks. IEEE Trans. Veh. Technol. 2020, 69, 41–54. [Google Scholar] [CrossRef] [Green Version]
Sak, H.; Senior, A.; Beaufays, F. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv 2014, arXiv:1402.1128. [Google Scholar]
Zhou, H.; Wang, H.; Zhang, H.; Hasith, K. LaCNet: Real-time end-to-end arbitrary-shaped lane and curb detection with instance segmentation network. In Proceedings of the 2020 16th International Conference on Control, Automation, Robotics and Vision (ICARCV), Shenzhen, China, 13–15 December 2020; pp. 184–189. [Google Scholar] [CrossRef]
Zhu, X.; Xiong, Y.; Dai, J.; Yuan, L.; Wei, Y. Deep feature flow for video recognition. In Proceedings of the 2017 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4141–4150. [Google Scholar] [CrossRef] [Green Version]
Zhu, X.; Dai, J.; Yuan, L.; Wei, Y. Towards high performance video object detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7210–7218. [Google Scholar] [CrossRef] [Green Version]
Liu, M.; Zhu, M.; White, M.; Li, Y.; Kalenichenko, D. Looking fast and slow: Memory-guided mobile video object detection. arXiv 2019, arXiv:1903.10172. [Google Scholar]
Li, X.; Huang, Z.; Sun, X.; Liu, T. A fast detection method for polynomial fitting lane with self-attention module added. In Proceedings of the 2021 10th International Conference on Control, Automation and Information Sciences (ICCAIS), Xi’an, China, 14–17 October 2021; pp. 46–51. [Google Scholar] [CrossRef]
Qin, Z.; Wang, H.; Li, X. Ultra fast structure-aware deep lane detection. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 276–291. [Google Scholar] [CrossRef]
Ye, Z. Fast multi-direction lane detection. In Proceedings of the 2021 2nd International Conference on Computing and Data Science (CDS), Stanford, CA, USA, 28–29 January 2021; pp. 301–304. [Google Scholar] [CrossRef]
Hou, Y.; Ma, Z.; Liu, C.; Loy, C.C. Learning lightweight lane detection CNNS by self attention distillation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 7 October–2 November 2019; pp. 1013–1021. [Google Scholar] [CrossRef] [Green Version]
Guo, J.-M.; Markoni, H. Deep Learning Based Lane Line Detection and Segmentation Using Slice Image Feature. In Proceedings of the 2021 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Hualien City, Taiwan, 16–19 November 2021; pp. 1–2. [Google Scholar]
Liu, W.; Yan, F.; Tang, K.; Zhang, J.; Deng, T. Lane detection in complex scenes based on end-to-end neural network. In Proceedings of the 2020 Chinese Automation Congress (CAC 2020), Shanghai, China, 6–8 November 2020; pp. 4300–4305. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 833–851. [Google Scholar] [CrossRef] [Green Version]
Chen, L. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [Green Version]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef] [Green Version]
Networks, R.S.; Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Lin, H.; Zhang, Z. ResNeSt: Split-attention networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Zagoruyko, S.; Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017; pp. 1–13. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the 2017 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar] [CrossRef] [Green Version]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Kingma, D.P.; Ba, J.L. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar]

Figure 1. The original deeplabv3plus architecture.

Figure 2. The overall architecture of the proposed method.

Figure 3. The standard residual block of the raw RESNET-18 is on the left and the basic residual block of the proposed RESNET-18 is on the right.

Figure 4. The attention module.

Figure 5. The CBAM block (FC represents the fully connected layers). Note the omission of some activation functions.

Figure 6. The original input image and the recognition result. (a) input image; (b) deeplabv3plus with RESNET18; (c) the proposed method.

Figure 7. The loss values of different networks on the CULane train set. “Loss” denotes the

L o s s_{s e g}

.

Figure 7. The loss values of different networks on the CULane train set. “Loss” denotes the

L o s s_{s e g}

.

Figure 8. The comparison of the feature maps generated from the first convolution layer: (a) the proposed method; (b) deeplabv3plus with RESNET18.

Table 1. The comparison of the parameters number among the different networks. The depthwise separable convolution is used in the RESNET-18_sep. The depthwise separable convolution and the SE are applied in the RESNET-18_sep (SE).

Name	RESNET-101	RESNET-18	RESNET-18_sep	RESNET-18_sep (SE)
Parameters (m)	13.22	11.18	1.46	1.55

Table 2. Dataset detail.

Frame	Train	Validation	Test	Resolution
133,235	88,880	9675	34,680	1640 × 590

Table 3. Comparison between the SE block and the CBAM block on the CULane validation set.

Type	SE	CBAM
Accuracy (%)	97.49	97.33

Table 4. Comparison among the different combinations of the introduced modules on the CULane validation set. The baseline means the RESNET-18.

Baseline	Depthwise Separable Conv	SE	SAD	ASPP	Accuracy (%)
√				√	95.83
√		√		√	96.86 (+1.03)
	√	√		√	97.26 (+1.43)
	√		√	√	97.31 (+1.48)
	√	√	√	√	97.35 (+1.52)
	√	√	√		97.49 (+1.66)

Table 5. The comparison of the runtime.

Type	Run Time (ms)
RESNET-101	34.3
SCNN [11]	133.5
SAD [19]	13.4
Proposed method	8.7

Table 6. The comparison of the MIOU between the standard deeplabv3plus with RESNET-18 and the proposed method.

Type	MIOU (%)
Deeplabv3plus	46.8
Proposed method	60.0

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Zhao, Y.; Tian, Y.; Zhang, Y.; Gao, L. The Improved Deeplabv3plus Based Fast Lane Detection Method. Actuators 2022, 11, 197. https://doi.org/10.3390/act11070197

AMA Style

Wang Z, Zhao Y, Tian Y, Zhang Y, Gao L. The Improved Deeplabv3plus Based Fast Lane Detection Method. Actuators. 2022; 11(7):197. https://doi.org/10.3390/act11070197

Chicago/Turabian Style

Wang, Zhong, Yin Zhao, Yang Tian, Yahui Zhang, and Landa Gao. 2022. "The Improved Deeplabv3plus Based Fast Lane Detection Method" Actuators 11, no. 7: 197. https://doi.org/10.3390/act11070197

APA Style

Wang, Z., Zhao, Y., Tian, Y., Zhang, Y., & Gao, L. (2022). The Improved Deeplabv3plus Based Fast Lane Detection Method. Actuators, 11(7), 197. https://doi.org/10.3390/act11070197

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Improved Deeplabv3plus Based Fast Lane Detection Method

Abstract

1. Introduction

2. Related Work

2.1. Deeplabv3plus

2.2. Attentional Mechanism

2.3. Attention Distillation

2.4. Depthwise Separable Convolution

3. Methodology

3.1. Backbone

3.2. SAD

4. Experiments

4.1. Implementation Details

4.2. Ablation Experiments

4.3. Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI