You are currently viewing a new version of our website. To view the old version click .
World Electric Vehicle Journal
  • Article
  • Open Access

23 April 2024

PortLaneNet: A Scene-Aware Model for Robust Lane Detection in Container Terminal Environments

,
,
,
,
and
1
College of Engineering Science and Technology, Shanghai Ocean University, Shanghai 201306, China
2
Shanghai East Container Terminal Co., Ltd., Shanghai 200137, China
3
School of Intelligent Manufacturing and Control Engineering, Shanghai Polytechnic University, Shanghai 201209, China
*
Author to whom correspondence should be addressed.

Abstract

In this paper, we introduce PortLaneNet, an optimized lane detection model specifically designed for the unique challenges of enclosed container terminal environments. Unlike conventional lane detection scenarios, this model addresses complexities such as intricate ground markings, tire crane lane lines, and various types of regional lines that significantly complicate detection tasks. Our approach includes the novel Scene Prior Perception Module, which leverages pre-training to provide essential prior information for more accurate lane detection. This module capitalizes on the enclosed nature of container terminals, where images from similar area scenes offer effective prior knowledge to enhance detection accuracy. Additionally, our model significantly improves understanding by integrating both high- and low-level image features through attention mechanisms, focusing on the critical components of lane detection. Through rigorous experimentation, PortLaneNet has demonstrated superior performance in port environments, outperforming traditional lane detection methods. The results confirm the effectiveness and superiority of our model in addressing the complex challenges of lane detection in such specific settings. Our work provides a valuable reference for solving lane detection issues in specialized environments and proposes new ideas and directions for future research.

1. Introduction

In recent years, the synergy between the burgeoning automotive industry and artificial intelligence technology has led to significant advancements in computer vision technologies, which have become a pivotal part of autonomous driving and advanced driver-assistance systems (ADAS). Lane detection, serving as a fundamental requirement in the sensory modules of driving systems, is primarily executed through vision cameras mounted on vehicles, utilizing these advanced technologies for assistive positioning and navigation. Currently, lane detection technology utilizes the powerful iterative learning capabilities of deep learning and extensive data resources, achieving significant progress in common scenarios such as urban roads, characterized by high detection accuracy and low-latency real-time functionality.
The allure of automated container terminals has expanded due to their attributes of high efficiency, energy savings, reliability, and economic benefits, leading to an increasing number of ports globally adopting automation upgrades and transformations. The introduction of Automated Container Trucks (ACTs) has been a critical factor in the transition of traditional ports from being geographically confined comprehensive logistics hubs to dynamic catalysts for regional growth [1]. The automation of container terminals relies not only on efficient logistics management systems but also on precise and reliable autonomous driving technologies, including advanced lane detection, obstacle recognition, and vehicle positioning capabilities. These technologies ensure that Automated Container Trucks can operate safely and effectively in the complex environment of container terminals, thereby enhancing the operational efficiency and safety of the entire terminal. Therefore, optimizing lane detection technology for port environments to meet their unique operational conditions is not only a necessity for technological development but also a crucial step towards advancing port automation and enhancing regional logistics efficiency. This transformation is heavily reliant on the continuous innovation in autonomous driving technologies and assistance systems, which are essential for the efficient and safe operation of ACTs within the complex environment of container terminals. The continuous innovation in autonomous driving technologies and assistance systems plays a key role in this transition.
However, challenges persist in special environments like enclosed terminal areas. On one hand, the frequent intense operations of horizontal transport vehicles and heavy machinery in terminals can lead to temporary obstructions and wear on lane markings, as shown in Figure 1a,b, presenting non-standardized lane layouts. This poses higher demands on the adaptability and error tolerance of the model. Additionally, the terminal environment’s susceptibility to complex and variable weather conditions like dense fog and rain or snow affects the visibility of lane markings, further complicating detection efforts. Therefore, facing the dual challenges of lane markings themselves and variable weather, lane detection models need to possess high robustness and anti-interference capabilities to ensure effective vehicle operation in various special environments, which is vital for ensuring the operational safety and improving the efficiency of terminal operations. On the other hand, terminal environments feature unique characteristics, such as the complex ground markings shown in Figure 1c–e, which may visually resemble lane lines but serve entirely different functions. This urgently necessitates the development of new lane detection methods capable of accurately identifying lanes and effectively distinguishing these different types of ground markings. Traditional lane detection methods, often based on handcrafted features, may perform well in some environments but face limitations in the complex conditions of terminals.
Figure 1. Challenges encountered in lane detection at container terminals.
In light of these considerations, this paper innovatively proposes a lane detection method named PortLaneNet, optimized for the special environment of enclosed terminals, breaking through the limitations of traditional lane detection models. At the core of PortLaneNet is the introduction of a Scene Prior Perception Module, which provides crucial a priori information for lane detection. Furthermore, by integrating an attention mechanism, the model can focus more effectively on key areas crucial for lane detection.
Our contributions are summarized as follows:
Novel Lane Detection Method: Addressing the special requirements of complex environments at container ports, this paper builds upon the LaneATT model to design the PortLaneNet model’s architecture for ports’ diverse and complex environmental challenges, enhancing the model’s adaptability and accuracy in complex scenarios.
Scene Prior Perception Module: By incorporating feature fusion technology, the model is jointly trained with scene information and the lane detection task, enabling it to fully utilize high-level semantic information in the images, thereby further improving detection accuracy and adaptability.
Superior Performance: Through rigorous experiments, PortLaneNet has been demonstrated to outperform existing lane detection models in enclosed terminal environments, effectively addressing the detection challenges presented by such specialized settings.
The remainder of this paper will detail PortLaneNet’s research methods and experimental results. First, we review recent advancements in the field of lane detection. Then, we delve into the architectural design and methodology of the proposed model. Following that, we present a series of experiments conducted in enclosed container terminal environments and their outcomes, validating our model’s superior performance over existing methods in such specialized scenes. Finally, we discuss the research findings and propose potential directions for future work.

3. Methodology

In this chapter, we present PortLaneNet, a real-time lane detection model specifically designed for dock environments. PortLaneNet takes the RGB images (I ∈ R3×Hi×Wi) captured by a monocular camera mounted on a dock container truck as input, and predicts the lanes L = {l1, l2, …, lN}. The overall architecture of the algorithm is shown in Figure 2. At the input layer, we utilize RGB images captured by a camera. Feature extraction is performed by a ResNet-based backbone, and a pyramid structure is employed to extract and integrate features from high to low levels, denoted as L0, L1, and L2, respectively. The Scene Prior Perception Module processes features from the L0 layer to perform a scene classification task. Learnable anchors are used for anchor-based feature pooling, further enhancing the model’s performance. The spatial attention module is responsible for feature fusion, combining FL0 and FL2 feature maps through weighted summation and element-wise multiplication to generate the ultimate fused feature map Ffinal. Finally, the model outputs for the lane detection classification task (Lcls) and regression task (Lreg) are produced.
Figure 2. Overview of the proposed PortLaneNet framework.

3.1. Lane Representation

We adopt a point set representation for lane lines, specifically, each lane line is composed of coordinates of N points: l = {P0, P1, …, PN}, where each point Pi = (xi, yi) represents a key point on the lane line. This representation method directly regresses on points, avoiding bias in parameter fitting, and since each point is independent, it has stronger robustness to deformed lanes. Compared with other representation methods, such as parameter curve representation, it requires less computation and is easier to optimize. In the implementation process, we uniformly sample N points in the y-axis direction to give the point set a certain regularity. During the training process, we directly regress the offset of each point to make the predicted point set as close as possible to the true lane line.

3.2. Backbone

In this study, the backbone of our model relies on ResNet [26] for feature extraction, specifically employing variants such as ResNet-18 and ResNet-34. Known for their deep residual learning framework, these architectures incorporate residual blocks with skip connections that facilitate the flow of gradients, effectively addressing the challenge of vanishing gradients during the training of deep neural networks. The selection of ResNet-18 and ResNet-34 is motivated by their capability to offer robust feature extraction while maintaining a lower computational complexity, making them ideal for real-time image processing tasks.
Furthermore, we have designed a pyramid structure [27] to extract and integrate features of different levels. In this design, we have three levels of features from high level to low level, denoted as L0, L1, and L2, respectively. To enhance the model’s performance, we perform pooling on the features of the L0 and L2 layers and fuse these features, improving the model’s representational capability. Moreover, to augment the model’s understanding of the scene, we input the L0 layer’s features into a Scene Prior Perception Module for conducting a scene classification task. This approach allows the model to utilize low-level features to extract basic structural information of the image, and also understand the image’s more complex semantic information via high-level features, thereby improving the accuracy and robustness of our model.

3.3. Feature Pooling Based on Learnable Anchors

In our design, we have adopted a crucial characteristic from the DILane model, which involves the use of a limited number of learnable anchor points as opposed to the conventional practice of using a large number of predefined anchor points. As the distribution of lane positions exhibits certain statistical regularities, this strategy allows our model to adapt more flexibly to the characteristics of lane lines in a confined container terminal environment. These anchor points are dynamically adjusted during the training process through the backpropagation algorithm, in order to align more accurately with the lane lines. We believe that this approach not only facilitates more accurate lane localization by our model and avoids misidentification of other ground markings as lane lines, but also enhances the computational efficiency of the model.
In the task of lane line detection, we define each anchor point as a four-dimensional vector, r e g i = S i x , S i y , θ i , l e n i , where S i x , S i y are the normalized x and y coordinates of the starting point, θ i represents direction, and len i represents length. For each fixed y k i = k H i K 1 , each predicted x-coordinate can be calculated through the following formula:
x ^ k i = 1 t a n θ i y k i S i y + S i x + Δ x k i
where Δ x k i is the predicted offset relative to the anchor point. This parameter allows our model to dynamically adjust the position of each anchor point corresponding to the lane line, thereby capturing the actual trajectory of the lane line more accurately. These learnable anchor point parameters, reg i , are dynamically updated during the training process via the backpropagation algorithm. We have chosen a straightforward initialization strategy to allow for quicker algorithm convergence. As for the pooling process, we have referred to the pooling method used in the LaneATT, which realizes a single-stage detector by pooling with the anchor points themselves. This design enables our model to make more effective use of the global information in the feature map, not just the boundary information, thus further improving the model’s performance.

3.4. Attention Module

In the architecture of PortLaneNet, we have integrated a feature fusion strategy predicated on a spatial attention mechanism [28]. This strategy significantly enhances the model’s expressive power and its ability to discern lane line features. The spatial attention mechanism plays a pivotal role in the phase of feature fusion, where it can autonomously learn and allocate weights to different input feature maps in the final fusion outcome. The underlying premise of this approach is that high-level features carry a wealth of semantic information, while low-level features encapsulate a myriad of detail-oriented information. By fusing these two types of features, we can obtain a feature representation that is both semantically rich and detail-preserving, which is of paramount importance for the task of lane detection. The input to our spatial attention module consists of two sets of feature maps pooled from both the upsampled results from L0 and L2. These feature maps undergo an initial transformation through a fully connected (FC) layer, subsequently reshaped to achieve spatial congruence. This process leads to the formation of an initial fused feature map, F i n i t , by performing a weighted summation of FL0 and FL2 feature maps:
F i n i t = F L 0 F L 2
where ⊕ signifies the weighted summation operation, integrating these distinct sets of feature maps into a unified representation.
To determine the significance of each pixel within this fused map, pixel-level self-attention scores, S, are computed, highlighting areas of interest:
S = s o f t m a x W s · F i n i t + b s
In this formula, W s and b s are the weight matrix and bias term of the spatial attention module, respectively. The softmax function is applied to these self-attention scores, normalizing them into a probability distribution that ranges between 0 and 1. Each pixel’s resulting attention weight thereby indicates its relative importance in the context of feature fusion.
The culmination of this process is the application of the spatial attention weight map, S, to the initial fused feature maps through element-wise multiplication, leading to the generation of the final fused feature map F f i n a l :
F f i n a l = S F L 0 F L 2
The operation represents the Hadamard product, or element-wise multiplication, ensuring that the contribution of each pixel is adjusted according to its derived importance, as indicated by the spatial attention weight map.
Through the implementation of this spatial attention module, PortLaneNet achieves a dynamic and adaptive feature fusion process. This not only elevates the model’s performance by enabling a more discerning perception of lane lines in varied and complex scenarios but also significantly improves the model’s interpretability. Researchers and practitioners can thus gain deeper insights into the model’s decision-making process, particularly how it discerns and prioritizes different features during lane detection tasks.

3.5. Scene Prior Perception Module

The Scene Prior Perception Module stands as a pivotal component within our PortLaneNet framework, offering crucial high-level contextual semantic insights crucial for refining lane detection. This module is adept at discerning broader scene contexts within the input image, identifying specific scenarios such as “container areas”, “main roads”, among others. Recognizing these distinct scene categories is essential for lane detection, given that different environments possess unique lane layouts and characteristics. This contextual comprehension is invaluable, aiding the model in distinguishing actual lane lines from other ground markings, especially since lane patterns may vary across different scenes.
A significant innovation in our model is the fusion of scene-specific features derived from the Scene Prior Perception Module with attention-driven features from the primary lane detection pathway. These feature sets operate within distinct representational spaces: scene-specific features provide a global scene understanding, acting as a form of prior knowledge, while attention-driven features capture more localized, detailed aspects crucial for lane detection. To harness both types of information effectively, we adopt a fusion strategy akin to multi-modal fusion by concatenating these features. This fusion ensures that the lane detection model benefits from a comprehensive scene context alongside the critical granular details essential for precise lane delineation.
The mathematical foundation for explaining how the Scene Prior Perception Module enhances lane detection accuracy through additional information can be elucidated as follows [29,30]:
Initially, let xN(mx, P22) represent a set of features (additional information) from a convolution block, where N(mx, P22) denotes a normal distribution with mean mx and covariance P22. Let yN(my, P11) be the output set, with eN(0, σ2I) representing Gaussian white noise, where I is the identity matrix. Here, my = E[y] = is the mean of y, with E denoting the expectation.
In the absence of additional information, the expectation of the output is x ^ k , assuming P22 > 0 (positive definite) and that x can be expressed through a complex nonlinear function f, i.e., x = f(y) + e.
Given x, y, e each follow a Gaussian distribution, their joint distribution is given by:
y x ~ N m y m x , P 11 P 12 P 21 P 22
Considering a new random variable y|x as a new measurement given additional information x, based on Bayesian principles, marginal distribution, and properties of normal distributions, it too follows a Gaussian distribution:
P y x = P y , x P x ~ N m y x , P
where my|x = E[y|x] = x ^ k + 1 defines the new expectation of the output given additional information.
By least-squares estimation, this simplifies to:
x ^ k + 1 = x ^ k + P 12 P 22 1 x m x
The error is defined as: y′ = (y|x) x ^ k + 1 . The expectation of this error is:
E [ y ] = E [ y x ] E [ x ^ k + 1 ]   = x ^ k + 1 x ^ k + 1 = 0
indicating that the estimation is unbiased.
Moreover, the discrepancy between the new measurement given additional information and the original output expectation is:
E y x E x ^ k = x ^ k + 1 x ^ k = P 12 P 22 1 x m x 0
demonstrating that, with additional information, the original output expectation is biased. This framework can also be interpreted within a convolutional neural network, where one convolution block acts as the additional information, and the classification block acts as the new measurement given this information. In this manner, the integration of the Scene Prior Perception Module provides a deeper understanding of complex scene contexts for the lane detection model, improving detection accuracy by incorporating additional information.
Specifically, the Scene Prior Perception Module’s original output, C, is mapped into a richer representational space using a fully connected layer, denoted as FC(C). This enriched scene representation encapsulates a deeper understanding of the scene context. This enhanced representation is then concatenated with the primary output of the lane detection model, producing two augmented outputs: Lcls for the classification task and Lreg for the regression task. Incorporating this module endows PortLaneNet with a comprehensive understanding of the scene, bolstering its capability to differentiate and accurately detect lanes in complex and varied scenarios typical of port environments.

3.6. Loss Function

Each anchor’s classification output, Lcls, is a (k + 1)-dimensional vector, where k is the number of lane types, and the additional dimension represents “background” or invalid proposals. Each anchor’s regression output Lreg is an (Npts + 1)-dimensional vector, where Npts is the number of horizontal offsets, and the additional dimension represents the proposed length.
Our model employs a multi-task loss function that optimizes both the classification and regression tasks simultaneously. Specifically, Focal Loss [31] is used to calculate the classification loss Lcls for each anchor, and Smooth L1 Loss is used to calculate the regression loss Lreg for each anchor, defined as follows:
L = λ i L c l s p i , p i * + i L r e g r i , r i *
where L c l s p i , p i * is the Focal Loss for the i-th anchor, and L r e g r i , r i * is the Smooth L1 Loss for the i-th anchor.   p i * and r i * are the true classification label and regression target for the i-th anchor, respectively. λ is a weight factor used to balance the classification loss and regression loss. Focal Loss is primarily used to solve the imbalance problem between positive and negative samples. In our model, it is applied to calculate the classification L c l s . Focal Loss is designed to reduce the contribution of easily classified samples (mainly negative samples) to the total loss, enabling the model to focus more on those hard-to-classify samples (mainly positive samples) during the training process. This is crucial for lane line detection tasks because, in most cases, lane line samples (positive samples) are much fewer compared to background samples (negative samples). Smooth L1 Loss, on the other hand, is used to calculate the regression loss Lreg for positive samples. This loss function measures the discrepancy between the lane line parameters (including horizontal offset and length) predicted by the model and the actual values. Smooth L1 Loss has a large penalty for large errors but a small penalty for small errors. This allows the model to predict the parameters of the lane lines more accurately while ensuring stability.

4. Experiments

In this section, we first introduce the production process of our proprietary dataset and its efficient evaluation metrics. Following this, we provide detailed information about the experimental setup and specific details of the experiment. Finally, we present the experimental results of our proposed method and other similar methods on this dataset.

4.1. Dataset and Evaluation Metrics

To collect the PortLane dataset, we installed a camera with a resolution of 1080 × 1920 on a tractor truck within the operation area and conducted continuous image collection for a week. The dataset not only covers various areas of daily operations, including main roads, crane areas, container areas, and parking lots, but also includes typical images under various climatic types such as sunny, rainy, and cloudy days. As shown in Table 1, we selected 1504 images for data annotation, of which 1204 images were used as the training set and 300 images as the test set. We referred to the mainstream dataset Tusimple [32] to establish the annotation format and evaluation metrics of the dataset. In addition to annotating lane lines, we also annotated scene labels for each image to pre-train the Scene Prior Perception Module. Figure 3 shows in detail the compositional proportion of each type of scene in the dataset.
Table 1. Lane detection dataset description.
Figure 3. Examples and distribution of the dataset classified by scene.
We adopted accuracy as the main evaluation metric for PortLane datasets, which is defined as follows:
a c c u r a c y = c l i p C c l i p c l i p S c l i p
In this formula, C c l i p represents the count of lane points accurately predicted within the clip, while S c l i p signifies the total points count within the clip. A point prediction is deemed accurate if it falls within a 20-pixel range of the actual ground truth. A lane prediction is classified as a true positive, significant for FDR and FNR metrics, only when its correctly predicted points exceed 85%. Additionally, we present the rate of false positive (FP) and false negative (FN), where F P = F p r e d N p r e d , F N = M p r e d N g t .
To further enhance the comprehensiveness of our evaluation, we introduced the F1 score as an additional metric. The F1 score is calculated as follows:
F 1 = 2 · Precision Recall Precision + Recall
where Precision is the ratio of correctly predicted positive observations to the total predicted positives, and Recall is the ratio of correctly predicted positive observations to all observations in the actual class. The F1 score serves as a harmonic mean of Precision and Recall, providing a balance between the Precision and the Recall of our model. By incorporating the F1 score, we aim to offer a more balanced and nuanced understanding of the model’s performance, especially in scenarios where the class distribution is imbalanced.

4.2. Experiments Settings

The experiments in this study were conducted on a high-performance laptop equipped with an AMD Ryzen 7 5800H processor and NVIDIA GeForce RTX 3060 graphics card, running the Windows 11 operating system. This hardware configuration not only ensures efficient model training and testing but also meets the computational demands of deep learning models.
In terms of model selection, we employed the pre-trained ResNet18 as our backbone network. The choice of ResNet18 over deeper networks like ResNet34 was a deliberate balance between accuracy and computational cost. While deeper networks could potentially offer slightly higher accuracy, they significantly increase computational and memory requirements. This is particularly critical in resource-constrained onboard systems, such as the automated driving systems of container trucks. ResNet18 maintains a high level of accuracy while offering lower computational complexity and memory demand, which is key for applications operating in resource-limited environments.
All input images were uniformly resized to 320 × 800 pixels, balancing the capture of image details with computational efficiency. To optimize the learning process, the Adam optimizer was utilized with an initial learning rate set at 2 × 10−3, and the model was trained over 150 epochs. Moreover, we applied a cosine annealing strategy for learning rate adjustment, aiding in finer model tuning during the later stages of training.
Regarding data processing, we applied a series of data augmentation techniques to the PortLane dataset, including random affine transformations and horizontal flipping. These techniques enhance the diversity of the data and improve the model’s adaptability to varying environmental conditions. The Intersection over Union (IoU) threshold for Non-Maximum Suppression (NMS) was set at 0.5, effectively removing duplicate detections while retaining critical lane line information.
Collectively, these settings were designed to ensure the model’s efficiency and accuracy in lane detection tasks, while also considering the repeatability of the experiment and practical applicability in real-world applications.

4.3. Algorithm Process

The specific training process of the PortLaneNet model is illustrated in Figure 4:
Figure 4. Training process diagram of the PortLaneNet model.
(1)
Data Collection and Preprocessing
The core framework of the PortLaneNet model is initialized based on the pre-trained ResNet-18, which is the backbone network part shown in the diagram. Data collection involves acquiring images of various scenes within container port environments. Data preprocessing includes data cleaning, annotation, and other necessary preprocessing steps to ensure the quality and consistency of the input data.
(2)
Feature Extraction and Scene Understanding
Starting from the initialized backbone network, the model forward-propagates layer by layer, enhancing the understanding of complex scene semantics through the Scene Prior Perception Module and attention mechanisms. These modules work together to improve the model’s accuracy in detecting lane line features.
(3)
Optimization and Evaluation
The network weights are updated using the Adam optimizer, dynamically adjusting the learning rate based on the gradient of the loss function and the estimates of the first and second moments. Figure 5 illustrates the change in the total loss function during the training process of the PortLaneNet model, compared with the baseline method LaneATT. As depicted, the loss curve of PortLaneNet drops rapidly in the initial training phase, indicating that the model is quick to learn useful features and adapt to the characteristics of lane lines in container terminal environments. In contrast, while LaneATT also shows a swift decline in loss early on, its rate of decrease slows down in later training, possibly due to its model structure and pre-trained features not being as targeted as those of PortLaneNet. Notably, in the mid-to-late stages of training, the loss curve of PortLaneNet presents a more stable downward trend compared to LaneATT, suggesting that PortLaneNet possesses superior fitting abilities and generalization performance when faced with complex samples from the training set. This advantage is attributable to its integrated design that incorporates the Scene Prior Perception Module. This design ensures that PortLaneNet can more accurately capture and utilize key information affecting lane detection, enabling more precise lane line detection in the complex and variable environment of container terminals.
Figure 5. Loss curve comparison between the PortLaneNet model and the LaneATT baseline during training.
(4)
Model Performance Tuning
Based on the model’s performance on the validation set, adjustments were made to model performance, including learning rate adjustments, changes to data augmentation strategies during training, and fine-tuning of network parameters. Through continuous iteration, the model gradually achieved optimal performance.

4.4. Result

In this study, we selected LaneATT as our baseline method and compared our proposed approach with other methods based on both ResNet-18 and ResNet-34 backbones. As demonstrated in Table 2, our model shows a significant improvement in accuracy primarily due to the integration of the Scene Prior Perception Module, which considerably enhances the recognition and interpretation of high-level semantics in complex ground markings. However, the incorporation of this module results in a decreased detection speed compared to the baseline methods. With ResNet-18 as the backbone, the model’s speed decreased from 121 frames per second (FPS) to 103 FPS. In the operational context of port areas, this detection speed level is more than capable of meeting the operational requirements. Given that vehicles in port environments do not demand extremely high processing speeds, prioritizing accuracy is crucial for ensuring safety and efficiency in such scenarios. Our model was compared with four recent mainstream efficient algorithms, and the results, achieved under the same experimental conditions, show that the model integrating the Scene Prior Perception Module outperforms LaneATT by approximately 2.6% in accuracy. Therefore, this method can better understand the semantic information of complex scenes and improve recognition accuracy.
Table 2. Comparison with mainstream algorithms on our dataset.
The results showcased in Figure 6 emphatically substantiate our model’s adeptness at navigating complex environments, as reflected by the precision and F1 scores in Table 2. Our model remains unfazed in the face of scenarios that typically challenge visual perception systems: intense sunlight glare (Dazzle), reflective wet surfaces (Wet Lane), partially obscured views (Occluded Lane), and even where the lane integrity is compromised due to breakage (Breakage Lane). Despite these adverse conditions, PortLaneNet accurately discerns and traces the correct lane lines. Moreover, the model excels in contexts with unconventional lane indicators, such as mixed ground markings (Mixed Markings), open spaces with no clear lane demarcations (No Lane), and roads with directional arrows (Arrow). It differentiates between actual lane lines and other similar-looking patterns, maintaining a high detection accuracy. Collectively, these examples in Figure 6 highlight PortLaneNet’s robust feature extraction and contextual comprehension, confirming its potential for real-world deployment where diverse and unpredictable elements are commonplace.
Figure 6. Visualization results of PortLaneNet on our dataset.

5. Conclusions and Discussion

In this paper, we have proposed a real-time lane detection model, PortLaneNet, tailored for enclosed port environments. Our main contribution includes the introduction of a Scene Prior Perception Module, which incorporates scene classification tasks with the lane detection model through feature fusion. This approach significantly enhances the model’s capacity for semantic understanding of complex and dynamic scenes, greatly improving its robustness and reliability. This innovation, coupled with the integration of an attention mechanism, enables PortLaneNet to adeptly navigate and interpret the unique ground markings and environmental conditions characteristic of port settings, setting it apart from existing methodologies. Experiments conducted on our proprietary PortLane dataset demonstrate that our method shows significant advantages compared to other lane detection algorithms, especially in dealing with ground markings unique to port environments. This research provides valuable insights into applying lane detection tasks in enclosed port settings. The Scene Prior Perception Module offers an important priori contextual information for lane detection, while the attention module enables aggregation of global contexts, both enhancing the model’s capability in deciphering complex scenes. The model exhibits strong robustness when encountering obscured, worn-out lane lines or ground markings that bear semblance to lane lines.
This paper also presents areas that warrant further improvements. First, the accuracy of the Scene Prior Perception Module directly affects the final lane detection results, thus further optimizing the performance of the Scene Prior Perception Module is crucial. Secondly, the model’s performance still declines under extreme lighting conditions. Future work may consider incorporating features with greater illumination robustness. In addition, for heavily occluded lane lines, more advanced feature extraction techniques and network structures can be introduced to improve the adaptability of the model.
Overall, this paper provides an effective methodological reference for the application of lane detection tasks in enclosed terminal environments. Future research can build on the foundation of this paper, continuing to explore how to establish a model’s deep semantic understanding ability of complex scenes to adapt to the needs of different application scenarios. It is believed that with the continuous enhancement of the model’s expressive capabilities, end-to-end lane detection systems will be widely applied, better serving fields such as autonomous driving, and promoting the development of the transportation industry.

Author Contributions

Methodology, H.Y., Z.K. and X.Z.; Software, Z.K.; Resources, C.Z.; Writing—original draft, Z.K.; Writing—review & editing, H.Y., Y.Z. and X.Z.; Visualization, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

Chenhe Zhang is an employee of Shanghai East Container Terminal Co., Ltd. The paper reflects the views of the scientists, and not the company.

References

  1. Vinh, N.Q.; Kim, H.-S.; Long, L.N.B.; You, S.-S. Robust Lane Detection Algorithm for Autonomous Trucks in Container Terminals. J. Mar. Sci. Eng. 2023, 11, 731. [Google Scholar] [CrossRef]
  2. Pan, X.; Shi, J.; Luo, P.; Wang, X.; Tang, X. Spatial as deep: Spatial CNN for traffic scene understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  3. Xu, H.; Wang, S.; Cai, X.; Zhang, W.; Liang, X.; Li, Z. CurveLane-NAS: Unifying lane-sensitive architecture search and adaptive point blending. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XV 16. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 689–704. [Google Scholar]
  4. Hou, Y.; Ma, Z.; Liu, C.; Loy, C.C. Learning lightweight lane detection CNNS by self attention distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1013–1021. [Google Scholar]
  5. Yu, F.; Wu, Y.; Suo, Y.; Su, Y. Shallow Detail and Semantic Segmentation Combined Bilateral Network Model for Lane Detection. In Proceedings of the IEEE Transactions on Intelligent Transportation Systems, Bilbao, Spain, 24–28 September 2023; Volume 24, pp. 8617–8627. [Google Scholar] [CrossRef]
  6. Qiu, Q.; Gao, H.; Hua, W.; Huang, G.; He, X. Priorlane: A prior knowledge enhanced lane detection approach based on transformer. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 5618–5624. [Google Scholar]
  7. Abualsaud, H.; Liu, S.; Lu, D.B.; Situ, K.; Rangesh, A.; Trivedi, M.M. LaneAF: Robust Multi-Lane Detection with Affinity Fields. IEEE Robot. Autom. Lett. 2021, 6, 7477–7484. [Google Scholar] [CrossRef]
  8. Ko, Y.; Lee, Y.; Azam, S.; Munir, F.; Jeon, M.; Pedrycz, W. Key points estimation and point instance segmentation approach for lane detection. IEEE Trans. Intell. Transp. Syst. 2021, 23, 8949–8958. [Google Scholar] [CrossRef]
  9. Philion, J. FastDraw: Addressing the long tail of lane detection by adapting a sequential prediction network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11582–11591. [Google Scholar]
  10. Wang, J.; Ma, Y.; Huang, S.; Hui, T.; Wang, F.; Qian, C.; Zhang, T. A keypoint-based global association network for lane detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1392–1401. [Google Scholar]
  11. Qu, Z.; Jin, H.; Zhou, Y.; Yang, Z.; Zhang, W. Focus on local: Detecting lane marker from bottom up via key point. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14122–14130. [Google Scholar]
  12. Feng, Z.; Guo, S.; Tan, X.; Xu, K.; Wang, M.; Ma, L. Rethinking efficient lane detection via curve modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17062–17070. [Google Scholar]
  13. Pan, H.; Chang, X.; Sun, W. Multitask Knowledge Distillation Guides End-to-End Lane Detection. IEEE Trans. Ind. Inform. 2023, 19, 9703–9712. [Google Scholar] [CrossRef]
  14. Tabelini, L.; Berriel, R.; Paixao, T.M.; Badue, C.; De Souza, A.F.; Oliveira-Santos, T. Polylanenet: Lane estimation via deep polynomial regression. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 6150–6156. [Google Scholar]
  15. Chae, Y.J.; Park, S.J.; Kang, E.S.; Chae, M.J.; Ngo, B.H.; Cho, S.I. Point2Lane: Polyline-Based Reconstruction with Principal Points for Lane Detection. IEEE Trans. Intell. Transp. Syst. 2023, 24, 14813–14829. [Google Scholar] [CrossRef]
  16. Li, X.; Li, J.; Hu, X.; Yang, J. Line-cnn: End-to-end traffic line detection with line proposal unit. IEEE Trans. Intell. Transp. Syst. 2019, 21, 248–258. [Google Scholar] [CrossRef]
  17. Tabelini, L.; Berriel, R.; Paixao, T.M.; Badue, C.; De Souza, A.F.; Oliveira-Santos, T. Keep your eyes on the lane: Real-time attention-guided lane detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 294–302. [Google Scholar]
  18. Zheng, T.; Huang, Y.; Liu, Y.; Tang, W.; Yang, Z.; Cai, D.; He, X. Clrnet: Cross layer refinement network for lane detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 898–907. [Google Scholar]
  19. Cheng, Z.; Zhang, G.; Wang, C.; Zhou, W. DILane: Dynamic Instance-Aware Network for Lane Detection. In Proceedings of the Asian Conference on Computer Vision, Macau, China, 4–8 December 2022; pp. 2075–2091. [Google Scholar]
  20. Qin, Z.; Wang, H.; Li, X. Ultra fast structure-aware deep lane detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXIV 16. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 276–291. [Google Scholar]
  21. Qin, Z.; Zhang, P.; Li, X. Ultra fast deep lane detection with hybrid anchor driven ordinal classification. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 2555–2568. [Google Scholar] [CrossRef] [PubMed]
  22. Liu, L.; Chen, X.; Zhu, S.; Tan, P. Condlanenet: A top-to-down lane detection framework based on conditional convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3773–3782. [Google Scholar]
  23. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  24. Liu, R.; Yuan, Z.; Liu, T.; Xiong, Z. End-to-end lane shape prediction with transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3694–3702. [Google Scholar]
  25. Bai, Y.; Chen, Z.; Fu, Z.; Peng, L.; Liang, P.; Cheng, E. Curveformer: 3d lane detection by curve propagation with curve queries and attention. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023. [Google Scholar]
  26. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  27. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  28. Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lile, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
  29. Ma, L.; Ren, H.; Zhang, X. Effective cascade dual-decoder model for joint entity and relation extraction. arXiv 2021, arXiv:2106.14163. [Google Scholar]
  30. Zhang, X.; Zhao, N.; Lv, Q.; Ma, Z.; Qin, Q.; Gan, W.; Bai, J.; Gan, L. Garbage Classification Based on a Cascade Neural Network. Neural Netw. World 2023, 2, 101–112. [Google Scholar] [CrossRef]
  31. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  32. TuSimple. Tusimple Benchmark. Available online: https://github.com/TuSimple/tusimple-benchmark (accessed on 30 September 2020).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.