Enhancing Suburban Lane Detection Through Improved DeepLabV3+ Semantic Segmentation

Cui, Shuwan; Yang, Bo; Wang, Zhifu; Zhang, Yi; Li, Hao; Gao, Hui; Xu, Haijun

doi:10.3390/electronics14142865

Open AccessArticle

Enhancing Suburban Lane Detection Through Improved DeepLabV3+ Semantic Segmentation

by

Shuwan Cui

¹,

Bo Yang

^1,2,

Zhifu Wang

^2,*

,

Yi Zhang

²,

Hao Li

¹

,

Hui Gao

³ and

Haijun Xu

^1,3

¹

School of Mechanical and Automotive Engineering, Guangxi University of Science and Technology, Liuzhou 545006, China

²

School of Mechanics and Vehicles, Beijing Institute of Technology, Beijing 100081, China

³

Liuzhou Wuling New Energy Automobile Co, Ltd., Liuzhou 545006, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(14), 2865; https://doi.org/10.3390/electronics14142865

Submission received: 16 June 2025 / Revised: 11 July 2025 / Accepted: 15 July 2025 / Published: 17 July 2025

Download

Browse Figures

Versions Notes

Abstract

Lane detection is a key technology in automatic driving environment perception, and its accuracy directly affects vehicle positioning, path planning, and driving safety. In this study, an enhanced real-time model for lane detection based on an improved DeepLabV3+ architecture is proposed to address the challenges posed by complex dynamic backgrounds and blurred road boundaries in suburban road scenarios. To address the lack of feature correlation in the traditional Atrous Spatial Pyramid Pooling (ASPP) module of the DeepLabV3+ model, we propose an improved LC-DenseASPP module. First, inspired by DenseASPP, the number of dilated convolution layers is reduced from six to three by adopting a dense connection to enhance feature reuse, significantly reducing computational complexity. Second, the convolutional block attention module (CBAM) attention mechanism is embedded after the LC-DenseASPP dilated convolution operation. This effectively improves the model’s ability to focus on key features through the adaptive refinement of channel and spatial attention features. Finally, an image-pooling operation is introduced in the last layer of the LC-DenseASPP to further enhance the ability to capture global context information. DySample is introduced to replace bilinear upsampling in the decoder, ensuring model performance while reducing computational resource consumption. The experimental results show that the model achieves a good balance between segmentation accuracy and computational efficiency, with a mean intersection over union (mIoU) of 95.48% and an inference speed of 128 frames per second (FPS). Additionally, a new lane-detection dataset, SubLane, is constructed to fill the gap in the research field of lane detection in suburban road scenarios.

Keywords:

lane detection; suburban roads; DeepLabV3+; DenseASPP; DySample; dataset

1. Introduction

Advanced driver-assistance systems (ADAS) utilize cameras to monitor the road environment in real time, providing functions such as lane keeping, emergency braking, and adaptive cruise control, which significantly reduce the occurrence of traffic accidents [1]. In terms of driving experience, lane-detection technology also plays an indispensable role. Lane changing is a cause of passenger discomfort, and precise lane-detection technology can optimize the logic of lane changing, reduce unnecessary or sudden lane-changing behaviors, make vehicle behaviors conform to the expected logic of human drivers, and thus reduce the confusion and discomfort brought to passengers by the vehicle’s “non-traditional operations” [2]. Lane semantic segmentation technology, as a critical component in the evolution from assisted driving to autonomous driving, can segment lane information in real time, helping vehicles plan paths more accurately and make safer driving decisions [3,4]. This high-precision road-recognition capability not only enhances the adaptability and decision-making efficiency of vehicles in complex road environments but also provides a solid foundation for the reliability of autonomous driving systems in the future [5]. Vision-based lane-detection technologies primarily target highly structured road environments, such as highways and urban arterial roads. These roads have standardized road markings, uniform background features, and clear geometric constraints. However, current technologies still face numerous theoretical and practical challenges in unstructured road scenarios [6]. Unstructured roads dominate suburban and rural road networks and are characterized by missing or faded road markings, diminished boundary features, and severe dynamic background interference. The environmental composition of suburban roads is complex and fundamentally different from that of urban roads. Traditional lane-detection methods, which are based on the assumption of structured roads, are difficult to apply directly to suburban scenarios [7]. Therefore, it is necessary to develop more robust and adaptive detection algorithms to address the unique challenges posed by suburban roads. Currently, lane-detection technologies for suburban roads face the following challenges:

Similarity of background features: dynamic background elements such as building exterior walls with highly similar colors to the road surface, mountains or rock structures with similar texture features to the distance, and farmland crops with similar features to the road surface create significant semantic confusion with the road area in visual features, seriously distracting the model from the allocation of attention to the target lane.
Blurring of geometric features: The degradation of fuzzy road boundaries, complex textured pavements, and lanes with various shapes makes it difficult for the feature extraction network to establish effective spatial context associations, resulting in a significant reduction in the reliability of the model’s environmental perception ability.

Traditional road detection algorithms typically rely on predefined feature assumptions, using fixed thresholds or rules to identify road areas, which makes it difficult to ensure stability and reliability under varying environmental conditions. In contrast, deep-learning-based road detection technologies have achieved revolutionary breakthroughs using convolutional neural networks (CNNs) [8]. CNNs can automatically learn the nonlinear topological features and high-level semantic information of road surfaces through a multilevel feature extraction mechanism without relying on manually preset geometric constraints or color assumptions. This powerful learning capability and scene adaptability feature make deep-learning-based road detection methods significantly superior to traditional methods in terms of accuracy, robustness, and generalization [9,10]. However, deep-learning methods require substantial computational resources during training and inference, making real-time deployment on mobile or embedded devices with limited resources challenging [11]. Therefore, reducing the computational complexity of the model while maintaining detection accuracy and enabling efficient operation in resource-constrained environments has become an important research direction. In the model deployment process, lightweight backbone network architectures, such as EfficientNet-B0 [12], MobileNetV2 [13], and ResNet-18 [14], can be adopted to address the challenges of limited computational resources and model performance requirements. These networks significantly reduce the computational complexity of the model while maintaining its strong feature extraction capabilities. To address the issue of background interference in complex scenes, advanced attention mechanisms can be introduced, including the focusing attention mechanism (FAM) [15], external attention mechanism (EAM) [16], and convolutional block attention module (CBAM) [17]. These mechanisms can effectively enhance the model’s feature response to the target areas while suppressing noise and irrelevant information [18,19,20,21]. In handling boundary blur scenarios, deep-learning-based semantic segmentation networks, such as PSP-Net [22], SegFormer [23], and DeepLabV3+ [24], have demonstrated superior performance. However, it is important to note that these methods typically require a large amount of annotated data for training and have high computational resource demands. Therefore, a balance between performance and resource consumption must be considered for practical applications [25,26].

In summary, this study proposes a lane and road detection method based on an improved DeepLabV3+ architecture, utilizing the lightweight MobileNetV2 network as the backbone. The innovative LC-DenseASPP module replaces the traditional ASPP [27] module, enhancing the network’s multi-scale feature fusion capability by optimizing the architecture of dilated convolution layers and incorporating the CBAM. Additionally, a dynamic upsampling module, DySample [28], is introduced, which leverages dynamic convolution kernels to optimize the upsampling process and improve the boundary segmentation accuracy. The main contributions of this study are as follows:

An improved DeepLabV3+ lane-detection model is proposed. In the encoder, the traditional ASPP module is replaced with an innovative LC-DenseASPP module, and the DySample module is introduced in the decoder. Experimental results demonstrate that the proposed model exhibits excellent performance in both detection accuracy and real-time capability.
Owing to the lack of publicly available datasets for suburban roads, we collected images of road scenes from suburban and rural areas in Liuzhou, Guangxi, and created a suburban road lane segmentation dataset named SubLane. These scenes were captured under clear weather conditions and primarily included roads with similar background features and roads with blurry geometric characteristics.

The remainder of this paper is organized as follows: Section 2 briefly reviews the related work. Section 3 introduces the SubLane dataset. Section 4 describes the proposed method in detail. Section 5 presents the experimental setup and results for the datasets. Finally, Section 6 concludes the study.

2. Related Work

Currently, lane-detection algorithms can be primarily categorized into two major classes: traditional methods and deep-learning-based methods. Among the mainstream techniques in deep-learning-based approaches, four main categories can be identified: keypoint-based methods, polynomial regression-based methods, object detection-based methods, and image segmentation-based methods. Each of these methods possesses unique characteristics, demonstrating distinct advantages in terms of detection accuracy, real-time performance, and robustness. This section will focus on these four mainstream techniques, systematically reviewing and analyzing the developmental status and innovative progress of related research work.

2.1. Keypoint-Based Methods

In studies on lane-detection methods based on keypoint detection, researchers have achieved precise detection and reconstruction of lane lines by locating the keypoints of lane lines. Early works, such as PINet [29], introduced geometric constraints between keypoints and further improved the detection robustness in complex scenes by designing an adaptive mechanism for selecting keypoints. In recent years, some studies have explored the combination of keypoint detection and graph neural networks (GNNs) [30]. For example, LaneATT [31] proposed a lane-detection method based on an attention mechanism and graph structure model, which constructs a graph structure for lane lines and utilizes graph convolutional networks to capture dependencies between keypoints, significantly enhancing the prediction capability of the lane topology. Additionally, to address the issue of scale variation in keypoint detection, the UFLD [32] proposed a multiscale feature fusion strategy that aggregates feature maps from different levels to improve the accuracy of keypoint localization. Although keypoint detection-based methods have advantages in terms of precision and flexibility, they still face challenges, such as high annotation costs for keypoints and inaccurate keypoint localization in complex scenes.

2.2. Polynomial Regression-Based Methods

For lane detection based on polynomial regression, researchers directly fit the polynomial coefficients of the lane to achieve accurate lane detection and reconstruction. PolyLaneNet [33] optimizes the polynomial regression fitting process by incorporating geometric prior knowledge of lane lines, thereby enhancing the robustness of the model in complex scenarios. Building on this, the PGA-Net [34] proposes a novel global attention mechanism combined with a mean curvature loss function, further improving the accuracy of polynomial regression. This method captures the global contextual information of lane lines through a global attention mechanism and optimizes the geometric shape of lane lines using a curvature loss function. PolyLaneNet++ [35], which is based on PolyLaneNet, introduces a spatiotemporal fusion mechanism by integrating lane-line information from multiple frames, thereby enhancing the adaptability of the model to dynamic scenes. This approach leverages the temporal changes in lane lines to improve detection stability and accuracy. Additionally, PRNet [36] proposes a polynomial regression network suitable for variable lane number detection, addressing the issue of uncertain lane numbers in lane detection by dynamically adjusting the polynomial order and number of lanes. This method adaptively adjusts the model output based on the actual number of lanes in the scene, thereby enabling more flexible lane detection under complex road conditions. Although polynomial regression-based methods are very flexible and efficient for lane detection, challenges such as polynomial order selection and inaccurate fitting in complex scenes remain.

2.3. Detection-Based Methods

In research on lane-detection methods based on object detection, researchers have achieved precise localization and reconstruction of lane lines by directly detecting keypoints or segments of the lanes. Early work, such as Line-CNN [37], proposed an end-to-end deep-learning network architecture that extracts lane features using convolutional neural networks (CNN) and directly outputs lane segments through a line proposal unit (LPU), avoiding complex post-processing steps and significantly improving detection efficiency. Subsequently, CondLaneNet [38] further optimized the lane-detection process by introducing conditional convolution, implementing a coarse-to-fine lane-detection framework, and enhancing the model’s robustness in complex scenarios. Some studies have explored the integration of multitask learning with object detection. For example, CLRNet [39] proposed a cross-locality relation network that captures the local relationships of lane lines across different video frames, effectively addressing the temporal consistency issue in video-based lane detection and further improving the detection accuracy. These methods combine object detection with geometric constraints and propose a lane-detection approach suitable for autonomous driving scenarios, which can achieve high-precision lane localization under complex road conditions. Although object detection-based methods demonstrate high performance in lane detection, they still face challenges, such as inaccurate detection in complex scenes and occlusion issues.

2.4. Segmentation-Based Methods

In studies on lane-detection methods based on semantic segmentation, researchers have achieved accurate detection and segmentation of lane lines through pixel-level classification. LaneNet [40] proposes a two-branch network structure based on instance segmentation, where one branch is used for pixel-level semantic segmentation and the other is used to distinguish different lane instances. By transforming the lane-detection problem into an instance segmentation task, this method can effectively deal with multi-lane scenes and perform well in real time. Subsequently, SCNN [41] proposed a spatial convolutional neural network that enhanced the continuity of the lane-line structure by performing horizontal and vertical information transfer on the feature map and significantly improved the detection accuracy in complex scenes. In recent years, several studies have explored the combination of semantic segmentation and geometric fitting techniques. For example, Wouter Van Gansbeke et al. [42] proposed an end-to-end lane-detection method that combines semantic segmentation with a differentiable least squares fitting method to optimize the geometric fitting effect of lane lines directly in a deep-learning framework. This method not only improves the accuracy of lane detection but also enhances the robustness of the model against noise and occlusion. In addition, Donghoon Chang et al. [43] combined instance segmentation and an attention mechanism to aggregate the global information of lane lines using a voting mechanism, which significantly improved the accuracy and stability of multi-lane detection. Future research may further explore multitask learning, self-supervised learning, and more efficient network architecture design to address these challenges and drive the further development of lane-detection technology.

Existing studies on lane detection have made significant progress in structured road scenarios, but there are still two key shortcomings in the unstructured scenario of suburban roads: first, the lack of datasets, as the mainstream publicly available datasets are mainly focused on urban roads and highways; and second, the balance between model lightweighting and high-precision issues, as the high-precision models in the existing deep-learning methods usually have a large number of parameters and a slow inference, which makes it difficult to be deployed in in-vehicle embedded devices.

3. Dataset and Evaluation Criteria

3.1. SubLane Dataset

At present, TuSimple, CULane, BDD-100k, CurveLanes, ApolloScape, Cityscapes, and other public and authoritative lane-detection datasets are widely used in research on urban road and highway scenes. These datasets provide important support for the development of lane-detection algorithms and autonomous driving technology and have achieved remarkable results on highly structured urban roads and expressways [44]. However, there remains a significant data gap in lane-detection research for suburban road scenes. The suburban road environment has unique characteristics, such as different road widths, fuzzy and irregular lane markings, and dynamic diversity of the background environment. These characteristics make it difficult for existing datasets to fully cover the complex scenes of suburban roads. Publicly available datasets for lane detection on suburban roads are insufficiently reliable and representative. This lack of data seriously restricts the application and development of autonomous driving technology in suburban road environments, particularly because of the higher requirements for algorithm robustness and adaptability owing to the complexity and variability of suburban roads. To fill this research gap and promote the development of autonomous driving technology in suburban road scenarios, we constructed a lane-detection dataset, SubLane, specifically for suburban road scenarios. The construction process of SubLane fully considered the particularity and complexity of suburban road environments with the goal of providing a high-quality and diverse data resource for the research community to use.

Specifically, the data were collected in Yufeng District, Liuzhou, Guangxi, China. We designed and deployed a professional data-acquisition vehicle system for real-vehicle data acquisition. As shown in Figure 1, the system is equipped with a high-resolution camera capable of comprehensively capturing road-feature information. By optimizing the mounting position and angle of the camera, we ensured that the captured images clearly and completely reflect the actual situation of the road. In the process of building the dataset, we adopted an incremental strategy: we first built a basic version of the dataset to ensure that it had a reasonable structure and basic functionality. Subsequently, through iterative optimization and expansion, the content and dimensions of the dataset were gradually enriched to meet the in-depth needs of subsequent research. This phased approach not only ensures the controllability of the initial work but also reserves sufficient space for future expansion. During data collection, we strictly controlled the environmental conditions to ensure the consistency and reliability of the dataset. All image data were collected under clear weather and good lighting conditions to avoid fluctuations in image quality caused by changes in weather or illumination. This design not only improves the utility of the dataset but also ensures that the data can truly reflect the typical scenes of suburban roads. To further enrich the diversity of the dataset, relevant suburban road images were obtained from public network resources and fused with the original dataset after screening. This mixed-data strategy effectively improves the scene coverage of the dataset. Figure 2 shows a partial image of the dataset. In the future, we plan to further expand the scale and coverage of the dataset to include more complex scenes (such as night and rainy days) to improve the universality and practicability of the dataset and provide stronger support for the full implementation of autonomous driving technology.

In the data annotation link, we used the Labelme tool to manually annotate the collected images in fine detail to ensure the accuracy and reliability of the annotation results. Specifically, we classified the elements in the image into two categories: lane regions (labeled red) and background regions (labeled black). The dataset format used in this study was the VOC dataset format [45], which is a widely recognized standard data format in the field of computer vision that has been widely used, especially in tasks such as object detection, image classification, and semantic segmentation. The uniformity of the VOC format makes it easy to compare and reproduce results across different studies, and it supports various computer vision tasks, such as object detection, image classification, and semantic segmentation. In addition, the clear directory structure and standardized annotation format of the VOC format facilitate model training and evaluation. Based on the above annotations and format specifications, our final constructed SubLane dataset comprised 2301 high-quality suburban road images, each with a resolution of 1920 × 1080. To ensure scientific model training and evaluation, we split the dataset into training and test sets at a ratio of 8:2. Of these, 1840 images were used for model training and 461 images were used for testing evaluation.

3.2. Evaluation Criteria

In this study, we used Accuracy,

F 1

, and

m I o U

as evaluation metrics to comprehensively measure the performance of the model on the SubLane dataset. Because this study deals with a binary classification problem, that is, classifying pixels in an image into lanes (red) and background (black), the evaluation metrics are calculated based on the following basic definitions: true positives (TP) represent the number of pixels correctly identified as lanes by the model, true negatives (TN) represent the number of pixels that the model correctly identifies as the background, false positives (FP) are the number of pixels that incorrectly identify the background as a lane, and false negatives (FN) are the number of pixels where the model incorrectly identifies the lane as the background.

P r e c i s i o n

reflects the fraction of pixels that the model predicts to be positive (lanes) that are true positives. It is calculated as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(1)

R e c a l l

measures the fraction of true-positive examples (lanes) that the model can correctly identify as a percentage of all true-positive examples. It is calculated as follows:

R e c a l l = \frac{T P}{T P + F N}

(2)

F 1

is the harmonic mean of precision and recall, which is used to comprehensively evaluate the performance of the model. It is calculated as follows:

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(3)

m I o U

is a commonly used evaluation metric in semantic segmentation tasks to measure the degree of overlap between the segmentation results predicted by the model and the true labels. It is calculated as follows:

m I o U = \frac{T P}{T P + F P + F N}

(4)

4. Design of Segmentation Module

4.1. Improved DeepLabV3+ Network Structure

DeepLab is a series of deep-learning models proposed by Google for semantic segmentation that is widely used in image segmentation [46]. Among them, DeepLabV3+ effectively solves the deficiency in the spatial detail recovery of DeepLabV3 by introducing a decoder structure, and it significantly improves the accuracy of boundary segmentation. However, existing models still have some limitations when dealing with the objects studied in this paper. Therefore, this study proposes an improved DeepLabV3+ model, the structure of which is illustrated in Figure 3.

First, Xception [47] of DeepLabV3+ has a complex structure and a large number of parameters, making it difficult to deploy in embedded systems. The lightweight network MobileNetV2 was used as the backbone network in this study to reduce the computational complexity and number of parameters and improve the deployment efficiency of the model. Second, the Atrous Spatial Pyramid Pooling (ASPP) module was improved. ASPP is a multi-scale context information capture structure proposed in DeepLabv2Dense, which encodes multi-scale information through the parallel use of dilated convolution with different dilation rates. However, ASPP requires a large dilation rate to obtain a sufficiently sensitive field when processing high-resolution images; as the dilation rate increases (such as dilation rate > 24), the effect of dilated convolution gradually weakens, resulting in a decline in the modeling ability. To solve this problem, a new structure, the Lightweight LC-DenseASPP module, was proposed in this study. The structure transmits the output of each dilated convolutional layer to subsequent layers via dense connections, ensuring that subsequent layers can acquire progressively larger receptive fields while avoiding the issue of nuclear degradation in the ASPP module. To further reduce the computational complexity, the original six-layer parallel dense connected structure was reduced to three layers, and the CBAM was introduced after each layer of dilated convolution to adaptively refine features and enhance the model’s attention to important features. In addition, to mitigate the grid effect that may occur when the dilation rate is large, this study replaced the dilated convolution with an image-pooling operation of the last layer with a dilation rate of 24, which generates smoother feature maps by aggregating local areas so as to effectively avoid the grid effect. Finally, the upsampling method of DeepLabV3+ was optimized. The original model was upsampled using bilinear interpolation and a transposed convolution. Although the performance was good, there are still some shortcomings in the detailed recovery. Therefore, this study replaced it with DySample. DySample can adjust the upsampling process adaptively according to the content of the input feature map through the dynamically generated convolution kernel to better recover the detailed information. Compared to bilinear interpolation, DySample can significantly reduce the blurriness in the upsampling process, particularly at the edges of objects and in small structural regions. In addition, the dynamic convolution kernel generation process of DySample is less computational, and the sampling efficiency is better than that of the transposed convolution. In summary, the improved DeepLabV3+ model proposed in this study significantly improves the computational efficiency and segmentation accuracy of the model by using a lightweight backbone network, optimizing the ASPP structure, and introducing the DySample upsampling method, which is particularly suitable for embedded system deployment and high-resolution image processing tasks. The code and dataset are as follows: https://github.com/boyoung617/Improved_DeepLabV3-_Model.git (accessed on 14 July 2025).

4.2. Selection of Backbone Network MobileNetV2

MobileNetV2 is a lightweight convolutional neural network architecture designed for mobile and embedded devices that reduces computation and parameter count while maintaining high accuracy. The core idea is to optimize network performance through depthwise separable convolution and inverted residual structures. Depthwise separable convolution greatly reduces the computation and parameter count while maintaining feature extraction capability. Depthwise separable convolution requires significantly less computation than standard convolution.

The inverted residual structure of MobileNetV2 first increases the dimensions and then reduces them. This structure enables nonlinear transformation in a high-dimensional space, enhances feature representation ability, and reduces information loss. In the backward residual structure, the ReLU activation function is not used after the last 1 × 1 convolution, but linear activation is used instead. This is because using ReLU in low-dimensional spaces results in information loss, whereas linear activation preserves more information. The structure of MobileNetV2 is shown in Figure 4.

The inverted residual structure uses

R e L U 6

as the activation function.

R e L U 6

is a rectified linear unit activation function that limits the output of the ReLU to between zero and six. It is mathematically defined as follows:

R e L U 6 (x) = \min (\max (0, x), 6)

(5)

First, a 1 × 1 convolution is used to extend the channel dimension of the input feature mapping, and then ReLU6 is used for feature nonlinear processing. Subsequently, a 3 × 3 depth separable volume is achieved with the ReLU6 activation function to fully extract and nonlinearize the features. Finally, a 1 × 1 convolution is used for feature fusion and dimension reduction. The overall structure of MobileNetV2 is stacked with multiple inverted residual modules. Table 1 shows the network structure of MobileNetV2 when the input resolution is 320 × 320.

4.3. LC-DenseASPP Network Structure

4.3.1. Dilated Convolution

Dilated convolution is a powerful tool that can extend the receptive field while maintaining the feature map resolution. As shown in Figure 5, the dilated convolution controls the size of the receptive field by introducing dilation rates.

As the dilation rate increases, the receptive field also expands, allowing the neural network to capture more comprehensive contextual information. However, a low dilation rate may not make full use of the context information, whereas a high dilation rate may cause training difficulties and negatively affect model performance. Therefore, optimizing the expansion rate can improve the network performance without significantly increasing the computing costs. The mathematical expression for dilation convolution is as follows:

y [i] = \sum_{k = 1}^{K} x [i + r \cdot k] \cdot w [k]

(6)

where

x

is the input feature map;

y

is the output feature map;

w

is the weight of the convolution kernel;

r

is the dilation rate, which controls the sparsity of the convolution kernel;

k

is the size of the convolution kernel; and

i

is the spatial position index of the output feature map.

4.3.2. CBAM Module

CBAM improves model performance by adaptively learning the importance of different locations and channels in the feature mapping. It captures more discriminative information without adding complexity to the network. The architecture of the CBAM is shown in Figure 6.

The formula for channel attention is as follows:

M_{c} (F) = σ (MLP (AvgPool (F)) + σ (M L P (M a x P o o l (F))))

(7)

where

σ

is the sigmoid activation function,

M_{c} (F)

is the channel attention weight, and

F

is the input feature map. The channel attention weight

M_{c} (F)

is multiplied by the input feature map

F

channel-by-channel to obtain the feature map weighted by channel attention:

F^{'} = M_{c} (F) \otimes F

(8)

\otimes

denotes channel-by-channel multiplication. In the channel focus module, average and maximum pooling operations are performed in parallel. The generated feature vectors are then processed using a multilayer perceptron (MLP), which typically consists of two fully connected layers. The first layer reduces the feature mapping dimension to reduce the computational complexity. The second layer restores the feature map to its original dimension by matching the number of input channels. The learned channel weight is multiplied by each channel of the original input feature map to realize weighted processing of the feature map.

The formula for spatial attention is as follows:

M_{s} (F^{'}) = σ (f^{7 \times 7} ([A v g P o o l (F^{'}); M a x P o o l (F^{'})]))

(9)

where

f^{7 \times 7}

is a convolution operation. Multiply the spatial attention weight

M_{s} (F^{'})

with

F^{'}

position-by-position to obtain the final output feature map:

F^{″} = M_{s} (F^{'}) \otimes F^{'}

(10)

\otimes

denotes multiplication position by position. In the spatial attention module, two feature maps obtained from average and maximum pooling are fused. The fused feature maps are sent to the convolution layer for learning, and the weights of each spatial position are determined. The obtained weight value for each spatial position is then multiplied by the original feature map to perform spatial weighting, emphasizing the key regions and suppressing the non-key regions to enhance feature expression.

4.3.3. LC-DenseASPP Module

By combining multi-scale feature extraction and an attention mechanism to improve the traditional DenseASPP module [48], the LC-DenseASPP module, with better performance in semantic segmentation tasks, was obtained. Figure 7 shows the structure of the LC-DenseASPP module. Specifically, the input feature map first undergoes three layers of parallel dilated convolution operations, with dilation rates of 6, 12, and 18, to capture multi-scale context information from local to global. The CBAM is then introduced after the dilated convolution operation at each layer to adaptively enhance important features and suppress redundant information through channel and spatial attention mechanisms, thus improving the feature expression ability. The fourth layer adopts an image-pooling operation to capture the global context information of the input feature map through global average pooling and uses a 1 × 1 convolution to upsample after dimensionality reduction to keep the space size consistent with the output of the first three layers. The output of the first three layers of the CBAM and the output of the fourth layer of image pooling are then combined in the channel dimension to form the feature representation of the multi-scale fusion. Finally, a 1 × 1 convolution is used to reduce the dimension of the spliced feature map, and an output feature map with high resolution and rich multi-scale context information is obtained. This design not only effectively solves the problem of the fixed kernel and lack of global context information in traditional sampling methods but also significantly improves segmentation accuracy in complex scenes through a dynamic feature enhancement mechanism. Experiments demonstrate that the DenseASPP module performs well in multi-scale image segmentation and can significantly improve the performance of semantic segmentation models.

4.4. DySample Network Structure

In semantic segmentation tasks, upsampling is a key step in restoring low-resolution feature maps to high resolution. Traditional sampling methods (such as bilinear interpolation and transposed convolution) have the following limitations: First, they use a fixed upsampling kernel, which cannot adapt to the scale and shape of different targets. Second, these methods rely only on local pixel information and lack global context information. Finally, in complex scenes, traditional methods may lead to the loss of detailed information, which affects segmentation accuracy. The core of DySample is the generation of a dynamic upsampling kernel using a lightweight convolutional network. The upsampled kernel is dynamically generated based on the content of the input feature maps. Therefore, it can adapt to the scale and shape of various targets. Figure 8 shows the network structure of DySample, which was designed to overcome the limitations of traditional methods and improve segmentation accuracy.

Suppose the input feature map is

F \in R^{H \times W \times C}

where

H

and

W

are the height and width of the feature map, respectively, and

C

is the number of channels. The process of dynamic upsampling kernel generation can be expressed as follows:

K = f_{g e n} (F)

(11)

where

f_{g e n}

is a lightweight convolutional network (usually 1 × 1 convolutions) that is used to generate the dynamic upsampling kernel

K

.

K \in R^{H \times W \times k_{h} \times k_{w}}

, and where

k_{h}

and

k_{w}

are the height and width of the upsampling kernel.

After generating dynamic upsampling kernels, DySample uses these kernels to perform a weighted summation of upsampled input feature maps to produce high-resolution output feature maps. The calculation process of the output feature map

O \in R^{s H \times s W \times C}

after upsampling (where

s

is the upsampling multiple) can be expressed as follows:

O (i, j) = \sum_{m} \sum_{n} F (m, n) \cdot K (i, j, m, n)

(12)

where

O (i, j)

is the value of the output feature map at position

(i, j)

,

F (m, n)

is the value of the input feature map at position

(m, n)

, and

K (i, j, m, n)

is the weight of the upsampling kernel at position

(i, j)

at position

(m, n)

.

The upsampled feature map usually needs to be fused with a high-resolution feature map (usually from an encoder) to recover the detailed information. The fusion can be either concatenation or element-wise addition. The high-resolution feature map is

F_{h i g h} \in R^{s H \times s W \times C_{h i g h}}

The process of feature fusion can be expressed as follows:

F_{o u t} = g (O, F_{h i g h})

(13)

where

g

is the fusion function, which can be concatenation or element-wise addition.

If concatenation is used, then

F_{o u t} \in R^{s H \times s W \times (C + C_{h i g h})}

.

F_{o u t} = C o n c a t (O, F_{h i g h})

(14)

If element-wise addition is used, then

F_{o u t} \in R^{s H \times s W \times C}

.

F_{o u t} = O + F_{h i g h}

(15)

5. Experiment

5.1. Experiment Setup

The experimental setting of this study included environmental configuration, parameter settings, and training strategies. The experimental environment was based on the Python 3.8.19 and PyTorch 1.12.1 frameworks to ensure the generalization ability of the model. Training and testing were performed on a high-performance PC with an Intel Core i9-14900HX processor and an NVIDIA GeForce RTX 4070 Ti SUPER graphics card. The training process was divided into freezing and thawing stages. In the freezing stage, the weight of the backbone network was fixed, and only some parameters were fine-tuned; in the unfreezing stage, the backbone network was unfrozen, and the entire model was trained. By fixing random seeds (seed = 11), the complete reproducibility of the experimental results was ensured. The resolution of the input image was set to 320 × 320 to balance computational efficiency with the feature extraction capabilities. The optimizer used stochastic gradient descent (SGD) with momentum set to 0.9 and weight decay to 1 × 10⁻⁴ to enhance the convergence stability of the model and prevent overfitting. The detailed and clear configurations are presented in Table 2. In the lane-detection task, if the inference speed of the model reached 30 frames per second, the model met the real-time requirement.

5.2. Training and Results

To speed up the training process and make the training loss converge quickly, the network was first initialized with pre-trained weights, and then the network was post-processed by transfer learning. The transfer learning process was divided into two stages. In the first stage, the weights of the network model were initialized using the Pascal VOC 2007 dataset. The pre-training weights were obtained by freezing the batch-normalization layer and decoder part of the LC-DenseASPP module and were continuously updated during feature transfer. In the second stage, the model was trained using the SubLane dataset. The features extracted by MobileNetV2 were common to the entire model. By freezing the backbone network at the beginning of training, the training of the model can be accelerated, and the risk of weight damage during training can be minimized. In the later training process, the backbone network thaws and actively participates in the training of the entire model. The change in the loss value during model training and verification after 100 training sessions is shown in Figure 9. As shown in the figure, with an increase in training time, the training and verification loss curves converge rapidly and remain stable. The mIoU of lane-image segmentation using the improved DeepLabV3+ model reached 95.48%.

In this study, the improved DeepLabV3+ model and other mainstream semantic segmentation models were used to test, compare, and analyze lane images. Figure 10 shows a comparison of the segmentation effects of each model on the lane image, including the original input image, ground-truth label, and segmentation results of different models. The experimental results show that the improved DeepLabV3+ model performs well in lane boundary recognition and drivable area prediction, and its segmentation results are highly consistent with the real labels, which is significantly better than the other comparison models.

5.3. Experiment and Analysis

In order to ensure the scientific and comparable nature of the experiments, this study introduces a corresponding baseline model as the reference standard, the core value of which is to provide a benchmark for the subsequent comparative analysis; its model structure is shown in Figure 11. By comparing the performance with the baseline model, we can clearly quantify the specific enhancement achieved by the model at different stages after gradually improving each module.

In this study, the DeepLabV3+ model was improved in multiple stages, and its performance was significantly improved. The mIoU of the baseline model was 95.10%, the inference speed was 158 FPS, and the number of parameters was 11.782 M. By introducing the LC-DenseASPP module, the mIoU of the model increased to 95.41%, the inference speed was maintained at 157 FPS, and the number of parameters was reduced to 10.332 M, which indicates that the module can improve the segmentation accuracy and effectively reduce the complexity of the model. Subsequent to combining the CBAM attention mechanism, the mIoU of the model was further increased to 95.44%, but the inference speed was slightly decreased to 123 FPS, and the number of parameters was almost unchanged (10.330 M), indicating that the CBAM has a certain impact on the computational efficiency while enhancing the feature expression ability. Finally, after the introduction of the DySample upsampling method, the mIoU of the model reached 95.48%, the inference speed increased to 128 FPS, and the parameter number increased slightly to 10.416 M, indicating that DySample can optimize computational efficiency while improving accuracy. Overall, the improved model significantly improved the segmentation accuracy and reduced the number of parameters while maintaining a high inference speed, demonstrating its superior performance in real-time semantic segmentation tasks. Table 3 shows the ablation experiments.

The experimental results show that the proposed model performs well in semantic segmentation tasks. As shown in Table 4, the mPA value of our model reached 97.72%, which was significantly better than the comparison models, such as Unet, PSPnet, HRnet, and DeepLabV3+. Specifically, our model improves the F1 score and accuracy by 0.65% compared with DeepLabV3+, which fully reflects its significant advantages in segmentation accuracy. In addition, the number of parameters of the proposed model was 10.416 M, and the inference speed reached 128 FPS. Although the number of parameters is approximately twice that of DeepLabV3+, the inference speed is approximately twice that of DeepLabV3+, which is of great value for the real-time performance of lane-detection tasks. Although Segformer is slightly better than our model in terms of performance, our model achieves a better balance between inference speed and parameter number, showing stronger competitiveness. In general, our model effectively improves segmentation accuracy, robustness, and real-time performance by improving strategies, which verifies its practicability and advancement in semantic segmentation tasks.

We note that on the SubLane dataset, the mPA of SegFormer (97.82%) is slightly higher than that of our model (97.72%). However, we need to emphasize the following points in order to fully and objectively present the differences between the two:

Real-time performance advantage: Our model achieves an inference speed of 128 FPS, which is more than twice that of SegFormer (59 FPS). This significant speedup is critical for practical autonomous driving applications, where real-time responsiveness (typically requiring ≥30 FPS) directly impacts vehicle safety and decision-making latency.
Parameter efficiency: While SegFormer has 13.678 M parameters and our model has 10.416 M parameters, the key advantage lies in the balance between accuracy and computational complexity. For embedded systems in vehicles with limited hardware resources, our model’s lower parameter count and higher speed make it more deployable without significantly compromising accuracy (a difference of only 0.1% in mIoU).
Scenario adaptability: SegFormer, as a transformer-based model, excels in general semantic segmentation but may not be specifically optimized for lane detection in suburban scenes. Our model’s improvements (LC-DenseASPP, CBAM, and DySample) are tailored to address suburban road challenges (e.g., blurred boundaries, dynamic backgrounds), which is reflected in its robust performance on lane-specific segmentation.

As shown in Figure 12, this study compares the IoU values of our model with those of other mainstream models in road and background segmentation tasks. The experimental results demonstrate that our model can capture the details of the road region more accurately in the task of road segmentation, and the segmentation results are more precise and have higher robustness. Simultaneously, our model also exhibits excellent generalization ability in background segmentation tasks and can effectively cope with semantic segmentation challenges in complex scenes. By optimizing the network structure and introducing efficient modules, our model significantly improves the segmentation performance, provides reliable technical support for practical application scenarios such as lane detection, and fully verifies the effectiveness and practicability of its improvement strategy.

5.4. Study Limitations

Despite the promising results achieved in this study, several limitations should be acknowledged to guide future research directions.

First, the proposed model’s performance was primarily validated under clear weather conditions. The SubLane dataset, constructed to focus on suburban road scenarios, only includes images captured in good lighting and clear weather. This limits the generalizability of the model to adverse environmental conditions such as rain, fog, snow, or low-light (dawn/dusk) situations, where lane boundaries and background features may be further obscured or distorted. Such conditions could significantly degrade the model’s segmentation accuracy, as the LC-DenseASPP module and DySample upsampling may struggle to distinguish lane features from weather-induced noise.

Second, the current model architecture, while optimized for real-time performance (128 FPS), still has room for improvement in terms of extreme lightweight deployment. Although MobileNetV2 reduces computational complexity compared to heavier backbones, the integration of the LC-DenseASPP module and CBAM attention mechanism results in a parameter count (10.416 M) that may be challenging for resource-constrained edge devices (e.g., low-power automotive-embedded systems). Balancing accuracy and efficiency for such platforms remains an unresolved challenge.

Third, the SubLane dataset, while addressing a critical gap in suburban lane-detection research, has a relatively small scale (2301 images) compared to large-scale urban datasets like BDD-100k or Cityscapes. This limited size may restrict the model’s ability to capture the full diversity of suburban road variations, such as rare but critical scenarios (e.g., temporary construction zones, rural-urban transition areas, roads with non-standard materials like gravel). Expanding the dataset to include more diverse samples and edge cases is essential for enhancing the model’s robustness.

Fourth, the model’s performance relies heavily on the effectiveness of the LC-DenseASPP module in capturing multi-scale context, but it may still struggle with highly ambiguous lane patterns, such as lanes with intermittent markings or those adjacent to visually similar off-road regions (e.g., dirt shoulders with texture similar to the road surface). The current attention mechanism (CBAM) primarily refines channel and spatial features but lacks explicit modeling of long-range dependencies between distant lane segments, which could be critical for maintaining continuity in such ambiguous cases.

These limitations highlight the need for future work, including expanding the dataset to cover adverse weather and diverse suburban scenarios, exploring more lightweight network designs for edge deployment, and integrating advanced context-modeling mechanisms to handle highly ambiguous lane patterns.

6. Conclusions

In this study, we propose an improved DeepLabV3+ model tailored for lane detection in suburban road environments, where road markings are often scarce and boundary information is complex. By addressing the limitations of the traditional ASPP module in DeepLabV3+, we introduced a series of enhancements, including the integration of DenseASPP with the CBAM attention mechanism and the adoption of DySample for dynamic upsampling. These improvements significantly enhanced the model’s ability to capture multi-scale context information, refine feature representations, and recover detailed boundary information, leading to superior segmentation accuracy and computational efficiency. The experimental results demonstrate that our proposed model achieved an mIoU of 95.48% and a processing speed of 128 frames per second, striking an effective balance between segmentation accuracy and real-time performance. Furthermore, the introduction of the SubLane dataset, which was specifically designed for suburban road scenarios, fills a critical gap in the existing research and provides a valuable resource for future studies in this domain.

Compared with other state-of-the-art models, our improved DeepLabV3+ exhibits significant advantages in terms of segmentation accuracy, robustness, and real-time performance. The model’s ability to handle complex suburban road environments with blurred boundaries and dynamic backgrounds makes it a promising solution for autonomous driving applications. Future work will focus on further optimizing the model for deployment on resource-constrained devices and expanding the SubLane dataset to include more diverse road conditions and weather scenarios than those currently included.

Author Contributions

S.C.: Responsible for funding support and experimental equipment support. B.Y.: Responsible for experimental design, conducting experiments, collecting necessary data, and writing the manuscript. H.G. and H.L.: Responsible for collecting original experimental data images. H.X.: Responsible for experimental code writing. S.C. and Z.W.: Responsible for the revision and review of manuscripts. B.Y. and Y.Z.: Responsible for the debugging of experimental programs. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by [Guangxi Science and Technology Program] grant number [AD25069026] and [Guangxi Science and Technology Major Program] grant number [AA23062024]. This work was also supported by the Guangxi Bagui Young Scholars Project.

Data Availability Statement

The data that support the findings of this study are available from the Second Author, [Yang Bo], upon reasonable request. The email address is 20230103077@stdmail.gxust.edu.cn.

Conflicts of Interest

Authors Hui Gao and Haijun Xu were employed by the company Liuzhou Wuling New Energy Automobile Co, Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.The authors declare that there are no conflicts of interest, either financial or non-financial, related to the submitted work. All co-authors have agreed to the publication of this manuscript.

References

Waykole, S.; Shiwakoti, N.; Stasinopoulos, P. Review on Lane Detection and Tracking Algorithms of Advanced Driver Assistance System. Sustainability 2021, 13, 11417. [Google Scholar] [CrossRef]
Mims, L.K.; Gangadharaiah, R.; Brooks, J.; Su, H.; Jia, Y.; Jacobs, J.; Mensch, S. What Makes Passengers Uncomfortable in Vehicles Today? An Exploratory Study of Current Factors that May Influence Acceptance of Future Autonomous Vehicles. In Proceedings of the WCX SAE World Congress Experience, Detroit, MI, USA, 18–20 April 2023. [Google Scholar]
Chen, W.; Wang, W.; Wang, K.; Li, Z.; Li, H.; Liu, S. Lane departure warning systems and lane line detection methods based on image processing and semantic segmentation: A review. J. Traffic Transp. Eng. Engl. Ed. 2020, 7, 748–774. [Google Scholar] [CrossRef]
Yao, S.; Guan, R.; Huang, X.; Li, Z.; Sha, X.; Yue, Y.; Lim, E.G.; Seo, H.; Man, K.L.; Zhu, X.; et al. Radar-Camera Fusion for Object Detection and Semantic Segmentation in Autonomous Driving: A Comprehensive Review. IEEE Trans. Intell. Veh. 2024, 9, 2094–2128. [Google Scholar] [CrossRef]
Pavel, M.I.; Tan, S.Y.; Abdullah, A. Vision-Based Autonomous Vehicle Systems Based on Deep Learning: A Systematic Literature Review. Appl. Sci. 2022, 12, 6831. [Google Scholar] [CrossRef]
Tang, J.; Li, S.; Liu, P. A review of lane detection methods based on deep learning. Pattern Recognit. 2021, 111, 107623. [Google Scholar] [CrossRef]
Rasib, M.; Butt, M.A.; Riaz, F.; Sulaiman, A.; Akram, M. Pixel Level Segmentation Based Drivable Road Region Detection and Steering Angle Estimation Method for Autonomous Driving on Unstructured Roads. IEEE Access 2021, 9, 167855–167867. [Google Scholar] [CrossRef]
Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G.; Cai, J.; et al. Recent advances in convolutional neural networks. Pattern Recognit. 2018, 77, 354–377. [Google Scholar] [CrossRef]
Dong, S.; Wang, P.; Abbas, K. A survey on deep learning and its applications. Comput. Sci. Rev. 2021, 40, 100379. [Google Scholar] [CrossRef]
Zou, Q.; Jiang, H.; Dai, Q.; Yue, Y.; Chen, L.; Wang, Q. Robust Lane Detection from Continuous Driving Scenes Using Deep Neural Networks. IEEE Trans. Veh. Technol. 2020, 69, 41–54. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Cheng, Z.; Bai, F.; Xu, Y.; Zheng, G.; Pu, S.; Zhou, S. Focusing Attention: Towards Accurate Text Recognition in Natural Images. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Guo, M.-H.; Liu, Z.-N.; Mu, T.-J.; Hu, S.-M. Beyond Self-Attention: External Attention Using Two Linear Layers for Visual Tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 5436–5447. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
An, S.; Liao, Q.; Lu, Z.; Xue, J.-H. Efficient Semantic Segmentation via Self-Attention and Self-Distillation. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15256–15266. [Google Scholar] [CrossRef]
Hu, K.; Xu, K.; Xia, Q.; Li, M.; Song, Z.; Song, L.; Sun, N. An overview: Attention mechanisms in multi-agent reinforcement learning. Neurocomputing 2024, 598, 128015. [Google Scholar] [CrossRef]
Guo, M.-H.; Xu, T.-X.; Liu, J.-J.; Liu, Z.-N.; Jiang, P.-T.; Mu, T.-J.; Zhang, S.-H.; Martin, R.R.; Cheng, M.-M.; Hu, S.-M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Brauwers, G.; Frasincar, F. A General Survey on Attention Mechanisms in Deep Learning. IEEE Trans. Knowl. Data Eng. 2023, 35, 3279–3298. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Advances in Neural Information Processing Systems, Proceedings of the NeurIPS 2024, Vancouver, BC, Canada, 10–15 December 2024; Curran Associates Inc.: Nice, France, 2021; pp. 12077–12090. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Minaee, S.; Boykov, Y.; Porikli, F.; Plaza, A.; Kehtarnavaz, N.; Terzopoulos, D. Image Segmentation Using Deep Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3523–3542. [Google Scholar] [CrossRef] [PubMed]
Guo, Y.; Liu, Y.; Georgiou, T.; Lew, M.S. A review of semantic segmentation using deep neural networks. Int. J. Multimed. Inf. Retr. 2018, 7, 87–93. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to Upsample by Learning to Sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023. [Google Scholar]
Ko, Y.; Lee, Y.; Azam, S.; Munir, F.; Jeon, M.; Pedrycz, W. Key Points Estimation and Point Instance Segmentation Approach for Lane Detection. IEEE Trans. Intell. Transp. Syst. 2022, 23, 8949–8958. [Google Scholar] [CrossRef]
Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph neural networks: A review of methods and applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
Tabelini, L.; Berriel, R.; Paixao, T.M.; Badue, C.; De Souza, A.F.; Oliveira-Santos, T. Keep Your Eyes on the Lane: Real-Time Attention-Guided Lane Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Qin, Z.; Wang, H.; Li, X. Ultra Fast Structure-Aware Deep Lane Detection. In Computer Vision—ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 276–291. [Google Scholar]
Tabelini, L.; Berriel, R.; Paixão, T.M.; Badue, C.; De Souza, A.F.; Oliveira-Santos, T. PolyLaneNet: Lane Estimation via Deep Polynomial Regression. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 6150–6156. [Google Scholar]
Li, Q.; Yu, X.; Chen, J.; He, B.-G.; Wang, W.; Rawat, D.B.; Lyu, Z. PGA-Net: Polynomial Global Attention Network with Mean Curvature Loss for Lane Detection. IEEE Trans. Intell. Transp. Syst. 2024, 25, 417–429. [Google Scholar] [CrossRef]
Yang, C.; Tian, Z.; You, X.; Jia, K.; Liu, T.; Pan, Z.; John, V. Polylanenet++: Enhancing the polynomial regression lane detection based on spatio-temporal fusion. Signal Image Video Process. 2024, 18, 3021–3030. [Google Scholar] [CrossRef]
Wang, B.; Wang, Z.; Zhang, Y. Polynomial Regression Network for Variable-Number Lane Detection. In Computer Vision—ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 719–734. [Google Scholar]
Li, X.; Li, J.; Hu, X.; Yang, J. Line-CNN: End-to-End Traffic Line Detection with Line Proposal Unit. IEEE Trans. Intell. Transp. Syst. 2020, 21, 248–258. [Google Scholar] [CrossRef]
Liu, L.; Chen, X.; Zhu, S.; Tan, P. CondLaneNet: A Top-To-Down Lane Detection Framework Based on Conditional Convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Dong, L.; Zhang, H.; Ma, J.; Xu, X.; Yang, Y.; Wu, Q.M.J. CLRNet: A Cross Locality Relation Network for Crowd Counting in Videos. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 6408–6422. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Ren, W.; Qiu, Q. LaneNet: Real-Time Lane Detection Networks for Autonomous Driving. arXiv 2018, arXiv:1807.01726. [Google Scholar] [CrossRef]
Pan, X.; Shi, J.; Luo, P.; Wang, X.; Tang, X. Spatial as Deep: Spatial CNN for Traffic Scene Understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar] [CrossRef]
Van Gansbeke, W.; De Brabandere, B.; Neven, D.; Proesmans, M.; Van Gool, L. End-to-end Lane Detection through Differentiable Least-Squares Fitting. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Chang, D.; Chirakkal, V.; Goswami, S.; Hasan, M.; Jung, T.; Kang, J.; Kee, S.-C.; Lee, D.; Singh, A.P. Multi-lane Detection Using Instance Segmentation and Attentive Voting. In Proceedings of the 2019 19th International Conference on Control, Automation and Systems (ICCAS), Jeju, Republic of Korea, 15–18 October 2019; pp. 1538–1542. [Google Scholar]
Li, J. Lane Detection with Deep Learning: Methods and Datasets. Inf. Technol. Control 2023, 52, 297–308. [Google Scholar] [CrossRef]
Everingham, M.; Eslami, S.M.A.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. DenseASPP for Semantic Segmentation in Street Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]

Figure 1. Data acquisition vehicle and camera fixed position diagram.

Figure 2. Images of part of the SubLane dataset. (a) Images taken from real roads, and (b) images downloaded from the internet.

Figure 3. Improved DeepLabV3+ network architecture.

Figure 4. Inverted residual structure in MobileNetV2.

Figure 5. Ordinary convolution and dilated convolution. (a) 3 × 3 ordinary convolution, and (b) 3 × 3 dilated convolution and dilation rate 1.

Figure 6. Structure of the CBAM.

Figure 7. Structure of LC-DenseASPP.

Figure 8. Sampling point generator in DySample.

Figure 9. The change in the loss value during model training and validation.

Figure 10. Comparison of segmentation effects of different semantic segmentation models on the SubLane dataset.

Figure 11. Baseline network architecture.

Figure 12. A comparison plot of the IoU values of the semantic segmentation model on the SubLane dataset.

Table 1. MobileNetV2 network structure table when the input image size is 320 × 320 × 3.

Layer	Input	Operator	s	r	n	Output
Convolution	320 × 320 × 3	Conv2D 3 × 3	2	-	1	320 × 320 × 16
Bottleneck 1	320 × 320 × 16	Inverted Residual Block	1	Yes	2	160 × 160 × 24
Bottleneck 2	160 × 160 × 24		2	No	3	80 × 80 × 32
Bottleneck 3	80 × 80 × 32		1	Yes	4	40 × 40 × 64
Bottleneck 4	40 × 40 × 64		1	Yes	3	40 × 40 × 96
Bottleneck 5	40 × 40 × 96		2	No	3	20 × 20 × 160
Bottleneck 6	20 × 20 × 160		1	Yes	1	20 × 20 × 320
Convolution	10 × 10 × 1024	Conv2D 1 × 1	1	-	1	10 × 10 × 1280

s represents the length of the convolution operation and determines the size of the output feature map. r stands for residual connection, and unlike ResNet, there is a shortcut connection only if stride = 1 and the input eigenmatrix has the same shape as the output eigenmatrix. n indicates the number of times the bottleneck module repeats in a specific period. The final layer outputs a 20 × 20 × 1280 feature map for the improved DeepLabV3+ LC-DenseASPP module.

Table 2. Table of experimental parameter settings.

Keys	Values
Input shape	320 × 320
Initial epoch	0
Freeze epoch	50
Unfreeze epoch	100
Initial learning rate	7 × 10⁻³
Minimum learning rate	7 × 10⁻⁵
Optimizer type	SGD
Momentum	0.9
Weight decay	1 × 10⁻⁴
Learning rate decay strategy	cos

Table 3. Ablation experiments between modules.

Method	IoU/%		mIoU/%	FPS/s	Params/M
Method	Background	Road	mIoU/%	FPS/s	Params/M
Baseline	95.24	94.96	95.10	158	11.782
+DenseASPP	95.52	95.31	95.41	157	10.332
+DenseASPP&CBAM	95.52	95.35	95.44	123	10.330
+DenseASPP&CBAM&DySample	95.56	95.39	95.48	128	10.416

Table 4. Summary of the performance of the semantic segmentation model in the SubLane dataset.

Method	Backbone	Classes	Recall/%	Precision/%	F1/%	mPA/%	Accuracy/%	Params/M	FPS
U-Net	VGG	Background	97.81	96.26	96.90	96.88	96.91	24.891	37
U-Net	VGG	Road	95.96	97.63	96.90	96.88	96.91	24.891	37
PSP-Net	ResNet50	Background	97.45	95.68	96.41	96.38	96.42	46.739	50
PSP-Net	ResNet50	Road	95.31	97.23	96.41	96.38	96.42	46.739	50
HR-Net	HRNet-W18	Background	97.32	96.87	96.98	96.99	97.00	29.538	11
HR-Net	HRNet-W18	Road	96.65	97.13	96.98	96.99	97.00	29.538	11
SegFormer	EfficientNet-B0	Background	97.74	98.02	97.81	97.82	97.82	13.678	59
SegFormer	EfficientNet-B0	Road	97.90	97.60	97.81	97.82	97.82	13.678	59
DeepLabV3+	MobileNetV2	Background	97.04	97.24	97.05	97.05	97.05	5.813	62
DeepLabV3+	MobileNetV2	Road	97.06	96.86	97.05	97.05	97.05	5.813	62
Ours	MobileNetV2	Background	96.62	98.87	97.70	97.72	97.69	10.416	128
Ours	MobileNetV2	Road	98.83	96.48	97.70	97.72	97.69	10.416	128

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, S.; Yang, B.; Wang, Z.; Zhang, Y.; Li, H.; Gao, H.; Xu, H. Enhancing Suburban Lane Detection Through Improved DeepLabV3+ Semantic Segmentation. Electronics 2025, 14, 2865. https://doi.org/10.3390/electronics14142865

AMA Style

Cui S, Yang B, Wang Z, Zhang Y, Li H, Gao H, Xu H. Enhancing Suburban Lane Detection Through Improved DeepLabV3+ Semantic Segmentation. Electronics. 2025; 14(14):2865. https://doi.org/10.3390/electronics14142865

Chicago/Turabian Style

Cui, Shuwan, Bo Yang, Zhifu Wang, Yi Zhang, Hao Li, Hui Gao, and Haijun Xu. 2025. "Enhancing Suburban Lane Detection Through Improved DeepLabV3+ Semantic Segmentation" Electronics 14, no. 14: 2865. https://doi.org/10.3390/electronics14142865

APA Style

Cui, S., Yang, B., Wang, Z., Zhang, Y., Li, H., Gao, H., & Xu, H. (2025). Enhancing Suburban Lane Detection Through Improved DeepLabV3+ Semantic Segmentation. Electronics, 14(14), 2865. https://doi.org/10.3390/electronics14142865

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Suburban Lane Detection Through Improved DeepLabV3+ Semantic Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Keypoint-Based Methods

2.2. Polynomial Regression-Based Methods

2.3. Detection-Based Methods

2.4. Segmentation-Based Methods

3. Dataset and Evaluation Criteria

3.1. SubLane Dataset

3.2. Evaluation Criteria

4. Design of Segmentation Module

4.1. Improved DeepLabV3+ Network Structure

4.2. Selection of Backbone Network MobileNetV2

4.3. LC-DenseASPP Network Structure

4.3.1. Dilated Convolution

4.3.2. CBAM Module

4.3.3. LC-DenseASPP Module

4.4. DySample Network Structure

5. Experiment

5.1. Experiment Setup

5.2. Training and Results

5.3. Experiment and Analysis

5.4. Study Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI