An Improved YOLOP Lane-Line Detection Utilizing Feature Shift Aggregation for Intelligent Agricultural Machinery

Wang, Cundeng; Chen, Xiyuan; Jiao, Zhiyuan; Song, Shuang; Ma, Zhen

doi:10.3390/agriculture15131361

Open AccessArticle

An Improved YOLOP Lane-Line Detection Utilizing Feature Shift Aggregation for Intelligent Agricultural Machinery

by

Cundeng Wang

¹

,

Xiyuan Chen

^1,*

,

Zhiyuan Jiao

¹,

Shuang Song

¹ and

Zhen Ma

²

¹

The Key Laboratory of Micro-Inertial Instrument and Advanced Navigation Technology of Ministry of Education, Southeast University, Nanjing 210018, China

²

School of Agricultural Engineering, Jiangsu University, Zhenjiang 212013, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(13), 1361; https://doi.org/10.3390/agriculture15131361

Submission received: 17 April 2025 / Revised: 6 June 2025 / Accepted: 18 June 2025 / Published: 25 June 2025

(This article belongs to the Section Agricultural Technology)

Download

Browse Figures

Versions Notes

Abstract

Agricultural factories utilize advanced facilities and technologies to cultivate crops in a controlled environment, enhancing operational yields and reducing reliance on natural resources. This is crucial for ensuring a stable supply of agricultural products year-round and plays a significant role in the transformation of agricultural modernization. Automated Guided Vehicles (AGVs) are commonly employed in agricultural factories due to their low ownership costs and high efficiency. However, small embedded devices on AGVs face significant challenges in managing multiple tasks while maintaining the required timeliness. Multi-task learning (MTL) is increasingly employed to enhance the efficiency and performance of detection models in joint detection tasks, such as lane-line detection, pedestrian detection, and obstacle detection. The YOLOP (You Only Look for Panoptic Driving Perception) model demonstrates strong performance in simultaneously addressing these tasks; detecting lane lines in changeable agricultural factory scenarios is yet a challenging task, limiting the subsequent accurate planning and control of AGVs. This paper proposes a feedback-based network for joint detection tasks (MTNet) that simultaneously detects pedestrians, automated guided vehicles (AGVs), and QR codes, while also performing lane-line segmentation. This approach addresses the challenge faced by using embedded devices mounted on AGVs, which are unable to run multiple models for different tasks in parallel due to limited computational resources. For lane-line detection tasks, we also propose an improved YOLOP lane-line detection algorithm based on feature shift aggregation. Homemade datasets were used for training and testing. Comparative experiments of our model with different models in the target-detection and lane-line detection tasks, respectively, show the progressiveness of our model. Surprisingly, we also obtained a significant improvement in the model’s processing speed. Furthermore, we conducted ablation experiments to assess the effectiveness of our improvements in lane-line detection, all of which outperformed the original detection model.

Keywords:

agricultural factories; automated guided vehicles (AGVs); multi-task learning (MTL); changeable agricultural factory scenarios; MTNet; lane-line detection

1. Introduction

As living standards improve, the demand for agricultural products, particularly fruits and vegetables, is steadily increasing. Consequently, agricultural factories that achieve efficient and sustainable [1] crop production through environmental factor control, independent of natural environmental influences, have been rapidly promoted. At the same time, with the rapid advancement of machine vision, artificial intelligence, vehicle networking, and other cutting-edge technologies, mobile robots are increasingly utilized across various fields. A notable example is the application of low-cost and high-efficiency Automated Guided Vehicles (AGVs) in modern logistics automation [2]. The panoramic drive sensing system is a critical technology that enables AGVs to achieve autonomous driving. This process primarily relies on cameras and other image-sensing devices to capture detailed views of the surrounding road environment [3]. Advanced visual perception and image-processing algorithms are then employed to facilitate functions such as driving-area segmentation, obstacle detection, and lane-line detection to support the decision-making system in achieving autonomous driving. In the AGV visual navigation system, obtaining current-position information is crucial for effective AGV operation. This is typically achieved using pre-set landmarks and feature-identification methods, such as QR codes. The AGV employs a camera to capture images of the environment, detects and recognizes QR code images, and extracts the position information contained within them. This process enables the AGV to determine its current location accurately. The timely acquisition of accurate environmental information is essential for panoramic driving sensing systems. However, in practical applications, autonomous vehicles often implement these systems on embedded devices, which may not ensure real-time performance due to latency arising from limited computational power. Therefore, it is critical to meet both requirements simultaneously.

Many research methods address these tasks separately. For instance, YOLOv5 [4], YOLOv8 [5], and DE-YOLO [6] focus on object detection and RoaDSaVe [7] and SCNN [8] are dedicated to lane-line detection. UASegNe [9], ENet [10], and PSPNet [11] specialize in semantic segmentation. Although these methods have achieved excellent results, processing these tasks sequentially is more time-consuming than addressing them simultaneously. Furthermore, traffic-scene perception tasks share significant commonalities across different domains. Therefore, it is essential to explore multi-task learning (MTL) approaches that integrate drivable-area segmentation, obstacle detection, and lane-line detection. By leveraging shared underlying features and the complementary nature of diverse data, this approach not only significantly reduces computational demands but also enhances the model’s generalization capabilities. Moreover, it aligns with the requirements of panoramic driving perception systems for high precision and real-time performance. Among notable advancements, YOLOP [12] demonstrates exceptional performance. YOLOP effectively addresses three critical tasks in autonomous driving: object detection, drivable-area segmentation, and lane-line detection, significantly reducing computational costs and inference times. By integrating the characteristics of factory farming with the YOLOP network, we propose a feedback-based network for joint detection tasks (MTNet) that simultaneously detects pedestrians, automated guided vehicles (AGVs), and QR codes, while also performing lane-line segmentation. This approach addresses the challenge faced by embedded devices mounted on AGVs, which are unable to run multiple models for different tasks in parallel due to limited computational resources. However, its performance in lane-line detection is less satisfactory in complex scenarios characterized by factors such as dim lighting, reflections, occlusions, and blurriness.

Lane-line detection is a crucial component of the AGV panoramic sensing system, serving as the foundation for understanding road-layout information [13]. Traditional lane-line detection methods can be categorized into feature-based and model-based approaches. Feature-based methods typically rely on visual characteristics such as edges, textures, colors, and gradient changes to identify lane lines. However, these methods struggle to adapt to varying road environments. In contrast, model-based lane-line detection has gained prominence in recent years, leveraging deep-learning techniques to achieve remarkable results across diverse complex scenarios due to its robust feature-extraction capabilities. However, excessive computational demands and parameter overheads result in high memory usage, which undermines real-time performance and limits the algorithm’s applicability. Consequently, this paper aims to develop an improved multi-task detection model that balances high accuracy with real-time performance. To achieve this objective, we propose an enhanced YOLOP lane-line detection model (FSA-YOLOP), which optimizes the lane-line segmentation head by refining the feature-extraction network and incorporating the feature shift aggregation network (FSAN) along with coarse and fine-grained joint up-sampling (CFGU). These enhancements aim to improve the model’s computational efficiency and performance in complex road environments, while ensuring the high accuracy and real-time requirements of lane-line detection.

The research presented in this paper encompasses the following key contributions:

Network Framework Design: We propose a multi-task joint detection algorithm (MTNet) tailored for embedded devices with limited computational resources, enabling simultaneous lane-line segmentation and the detection of pedestrians, Automated Guided Vehicles (AGVs), and QR codes. The network architecture comprises a shared feature encoder and two decoders for detection and segmentation;
Optimization Techniques: We introduced the Feature Shift Aggregation Network (FSAN) and Coarse and Fine Grain Size Combined Up-sampling (CFGU) to optimize the lane-line segmentation header of the MTNet model. These enhancements enable the model to infer lane lines even in complex scenarios, such as occluded, missing, or blurred lines, while addressing the challenge of preserving detailed texture and structural information in dim and reflective environments;
Model Evaluation: We conducted training and testing of the MTNet model, performing experimental comparisons with various network architectures. Ablation experiments were designed to validate the effectiveness of the MTNet model in lane-line detection.

The paper is organized as follows: Section 2 presents related work on multi-task learning and lane-line detection. Section 3 describes the overall framework and specific improvements made to the feature-extraction network and lane-line segmentation header. The experimental results are presented in Section 4, and Section 5 summarizes the contributions of this research.

2. Related Work

2.1. Multi-Task Learning

Obstacle detection and lane-line detection in panoramic sensing systems exhibit significant interdependencies. By integrating these tasks through multi-task learning, it is possible to capture their intrinsic connections better, thereby enhancing both the accuracy and efficiency of model detection, as well as improving robustness in complex road environments. MultiNet [14] proposes an end-to-end trained multi-task learning model that features a shared encoder and three decoders, enabling simultaneous scene classification, vehicle detection, and road segmentation on the KITTI dataset. Similarly, DLT-Net [15] employs a shared encoder design to achieve the concurrent detection of vehicles, lane lines, and drivable areas. However, unlike MultiNet, DLT-Net facilitates information transfer between different subtask decoders, significantly enhancing detection accuracy and computational efficiency through the complementarity of subtasks. HybridNets [16] further optimize the multi-task learning model using a weighted bidirectional feature network, designing a more effective loss function and training strategy to balance and optimize the network, resulting in strong performance on the Berkeley DeepDrive dataset. YOLOP [12] comprises a shared encoder, which includes a backbone network and a feature-fusion component. This encoder extracts generic and rich image features, facilitating information sharing among multiple tasks. The architecture features three decoders: a detection head for traffic target detection, a segmentation head for drivable area segmentation, and a lane-line segmentation head. Each decoder converts the feature maps generated by the encoder into the output format required for its specific task. Notably, there is no information-sharing mechanism among the different decoders, which enables the network to achieve rapid inference while maintaining high performance. The model implements 127 panoramic driving perceptions on the BDD100K dataset, and successfully executes visual-perception tasks on the embedded Jetson TX2 device, achieving a speed of 23 FPS. Therefore, multi-task learning models are crucial for advancing research in autonomous driving. Figure 1 shows the architecture of YOLOP.

2.2. Lane-Line Detection

Feature-based lane-line detection methods identify lane lines by utilizing visual characteristics such as edges, texture, color, and gradient changes. Niu et al. [17] initially extracted lane lines using a modified Hough transform, subsequently employing a density-based spatial clustering algorithm with noise filtering to complete lane-line detection. Jung et al. [18] utilized multi-frame spatiotemporal images to detect and track lane lines, applying the Hough transform to identify parallel lines at the edges of the lane lines, followed by fitting the lane lines using the least squares algorithm and a cubic curve model. Berriel et al. [19] implemented an adaptive thresholding method to select potential lane-line regions based on identified regions of interest in the images, combining this with a particle-filtering technique to extract the lane lines.

With the advancement of convolutional neural networks, deep-learning-based lane-line detection algorithms have emerged as a significant research focus in recent years [20,21,22,23]. Compared to traditional lane-line detection methods, these deep-learning approaches offer enhanced feature-extraction capabilities and greater robustness, achieving superior results in a variety of complex scenarios [24,25,26,27].

SCNN [8] applied semantic segmentation to the lane-line detection task, which thoroughly considers the spatial structural information of lane lines. The spatial convolution module in this framework captures pixel dependencies in both vertical and horizontal directions, allowing each row or column of pixels to be fused with local contextual information. This approach enhances the detection of shape changes and the continuity of lane lines in complex traffic scenarios, although it is computationally intensive. ENet-SAD [28] incorporated channel attention mechanisms that dynamically emphasize important regions and feature channels within the image, enabling the model to focus more effectively on critical areas for lane-line detection. LaneAF [29] introduced an affine field-based clustering technique that fixes the number of recognized lane lines. RoaDSaVe [6] improves detection performance in challenging local scenarios by integrating spatial and semantic features to distinguish between road and lane pixels, while also extracting lane-line information based on scene labels. Yao et al. [30] proposed an efficient lane-line detection technique utilizing a lightweight attention-deep neural network, optimizing the detection model and enhancing computational efficiency. CLRNet [31] introduced a cascade optimization algorithm that leverages high-level features to improve the accuracy of lane-line detection.

3. Methods

As shown in Figure 2, we propose a feedback-based model for joint multi-task detection (MTNet). The network architecture consists of a shared encoder for feature extraction and two decoders dedicated to segmentation and detection, respectively.

3.1. Encoder Design

Darknet-53 serves as the backbone network, achieving excellent results in various tasks, including image classification, object detection, semantic segmentation, and pose estimation. It is utilized as a feature extractor to leverage its powerful feature-extraction capabilities, thereby providing more details about lane lines. With regard to the feature fusion component, this paper employs a Weighted Bidirectional Feature Pyramid Network (BiFPN) as a feature distributor for the two task branches. This approach enhances the network’s capacity to represent the features associated with the two different task types. The structure of the Weighted Bidirectional Feature Pyramid Network is illustrated in Figure 3.

MTNet utilizes the bidirectional weighted feature pyramid network (BiFPN) as a feature distributor for two sub-task branches in the feature fusion component and offers several advantages over a traditional feature pyramid network (FPN):

Enhanced Feature Fusion: BiFPN introduces a bidirectional feature propagation mechanism, allowing features to be transmitted from both higher layers to lower layers and vice versa. This approach overcomes the unidirectional (bottom-up) feature fusion method of FPN, facilitating comprehensive interaction among features at each layer and improving the integrity of feature fusion;
Reduced Information Loss: By incorporating a two-way propagation mechanism, BiFPN minimizes information loss during feature fusion, ensuring that both high-level abstract features and low-level detailed features are effectively utilized;
Dynamic Weighted Fusion: BiFPN employs learnable weight parameters to prioritize features from different layers. This dynamic weighting allows the network to adjust the importance of features based on task requirements, enabling adaptive learning of the optimal feature combinations and enhancing the network’s adaptability;
Balancing Efficiency and Accuracy: BiFPN maintains relatively low computational complexity while ensuring high accuracy by iteratively applying top-down and bottom-up fusion mechanisms.

Given that this paper aims to achieve simultaneous lane-line segmentation and the detection of 2D images and obstacles, the use of BiFPN provides a sophisticated feature fusion solution for multi-task learning networks. This approach enhances efficient bidirectional information transfer and dynamic weighted fusion, improving the network’s ability to capture multi-scale target details and contextual information while ensuring high performance in detection and segmentation tasks across various complex scenarios.

3.2. Detection-Decoder Design

The detection decoder adopts the multi-scale anchor frame detection concept from YOLOv5, implementing necessary adjustments and extensions. Specifically, we employed an m × m grid division strategy for feature maps at three different scales: small, medium, and large. Within each grid cell, three anchor frames were pre-set to cover and match potential target objects. These grid cells are responsible not only for detecting the presence or absence of targets but also for predicting the exact coordinate regression parameters, confidence scores, and probability distributions for the categories associated with the three detection boxes.

Notably, this paper diverges from YOLOv5 in the design of the detection head. While YOLOv5 utilizes a coupled detection head that simultaneously outputs the coordinate regression parameters, category probabilities, and confidence scores, our approach employs a decoupled detection head design. This design separates the classification and regression tasks into two distinct branches, thereby reducing the coupling between tasks. Each branch focuses on a specific prediction task, which enhances the system’s detection accuracy for each subtask. This decoupled design allows for more precise control over the performance and optimization direction of each prediction task.

The structure of the decoupled detection head is illustrated in Figure 4.

For the feature maps generated by BiFPN, the decoupled detection head first applies a 1 × 1 convolutional layer to reduce the channel count. It then employs two parallel branches: one for classification and the other for regression. Each branch contains two 3 × 3 convolutional layers, and an Intersection over Union (IoU) branch is incorporated into the regression task branch to enhance target-detection accuracy.

Despite the use of decoupled detection heads to improve detection accuracy, some redundant detection frames may still be generated. These redundancies can occur due to multiple detections of the same target or as a result of background noise and false detections. To eliminate these redundant detection frames, this paper implements the Non-Maximum Suppression (NMS) algorithm as a post-processing step. The core principle of the NMS algorithm is to retain the detection frames with the highest confidence within a local region while suppressing those with lower confidence and higher overlap.

Following the application of the NMS algorithm, a refined set of prediction results was obtained, significantly reducing the number of redundant detections. This streamlined output not only enhances localization accuracy and confidence but also improves the algorithm’s ability to accurately identify and localize objects in complex scenes, thereby enhancing overall detection performance.

3.3. Split-Decoder Design

The lane-line segmentation header of YOLOP employs a cubic bilinear interpolation algorithm for up-sampling to obtain the lane-line segmentation results. However, this approach may hinder the network’s ability to accurately restore detailed information, such as edge information, about the lane lines. In this paper, we propose using the feature-shift aggregation network (FSAN) following the bidirectional feature pyramid network (BiFPN) to enhance the spatial dimensionality of deeper feature fusion. The resulting feature map is then processed through the coarse and fine granularity joint up-sampler (CFGU) to restore it to the original input size. Figure 5 shows the lane-line detection network design.

Furthermore, recognizing that obstacle detection is significantly more critical than lane-line detection for the safe operation of Automated Guided Vehicles (AGVs), we incorporated a lane-line information feedback network into the segmentation decoder.

The structure of the lane-line information feedback network is illustrated in Figure 6.

In the multi-task network designed for lane-line segmentation and obstacle detection, integrating lane-line segmentation information into the detection task branch is motivated by several key considerations. First, lane-line segmentation and obstacle detection are visually correlated, as the location and orientation of lane lines provide valuable contextual information that enhances obstacle localization and understanding, thereby improving detection accuracy. Second, the edge and line features extracted during lane-line segmentation are crucial for localizing obstacle edges. By sharing these features, the detection branch can leverage them without redundant computations, thus enhancing computational efficiency and inference speed.

Moreover, the lane-line segmentation task typically benefits from abundant labeled data, which can be indirectly utilized to train the obstacle-detection task. This increase in training data enables the model to better adapt to various scenarios and environmental changes, thereby improving its robustness and generalization performance. Most importantly, the multi-task network facilitates joint weight updates for both the lane-line segmentation and obstacle-detection tasks, allowing them to collaboratively learn the underlying features of the road scene. This synergy enables the detection branch to benefit from the lane-line segmentation task’s comprehensive understanding of road structure, ultimately enhancing its ability to comprehend complex scenes.

3.3.1. Feature Shift Aggregation Network (FSAN)

The FSAN module leverages the inherent shape characteristics of lane lines to extract comprehensive information by performing feature shift slicing operations in both the row and column directions of the feature map. This dual-direction operation allows for the fusion of row and column information, resulting in a complete representation of lane-line features and enabling each pixel to capture global information from the image. Subsequently, up-sampling is applied, integrating low-level edge texture information from the first and second layers of the feature extraction network. This approach enables the model to infer potential lane-line locations, even in complex scenarios where lane lines may be occluded, missing, or blurred, thereby facilitating the accurate detection of complete lane lines.

Figure 7 illustrates that the feature maps of size C × H × W, fused by the Feature Pyramid Network, were initially input. Subsequently, the slicing operation was conducted line by line to obtain H slices, each of size C × W labeled as 0, 1, 2, …, H − 1. Next, we shifted I line by line, generating the new feature map [i, i + 1, i + 2,…,0, 1, 2, …, i − 1]. Figure 8 illustrates the FSAN module shift aggregation in a column direction. Convolution and Rectified Linear Unit (ReLU) operations were then applied to the newly obtained feature maps, which were subsequently summed with the original feature maps [0, 1, 2, …, H − 1]. This process allows the feature maps to incorporate lane-line information from other rows, representing a deep spatial feature fusion method. As shown in Figure 9, the feature map was sliced horizontally and vertically, and then the information was transferred to the slices in four directions (bottom-up, top-down, left-right, and right-left) in the Spatial Convolutional Neural Network (SCNN) algorithm. In contrast to SCNN, FSAN employs the parallel approach by slicing the entire feature map. This not only enhances computational efficiency and reduces processing time but also enables each row element in the feature map to receive information from distant lane lines, thereby mitigating the information loss typically associated with long-distance transfer in SCNN.

The calculation process of SCNN can be expressed by Equation (1).

X_{c, p, q}^{'} = \{\begin{matrix} X_{c, p, q} p = 1 \\ X_{c, p, q} + f (\sum_{m} \sum_{n} X_{m, p - 1, q + n - 1}^{'} \times K_{m, c, n}) p = 2,3, 4,5, \dots, H \end{matrix}

(1)

X_{c, p, q}

is an element in the given input 3D tensor X;

c, p, q

represent the channel, row and column indices;

X_{c, p, q}^{'}

is the pixel after updating; and

K_{m, c, n}

is an elemental representation of a 3D kernel tensor K in SCNN_D.

The feature map obtained through deep spatial feature fusion using the FSAN was computed by performing feature aggregation along the row direction. The process is as follows:

X_{c, p, q}^{'} = X_{c, p, q} + \sum_{n = 1}^{H / i} \int (X_{c, p, q + n i})

(2)

Feature aggregation by column direction is computed as follows:

X_{c, p, q}^{'} = X_{c, p, q} + \sum_{n = 1}^{H / i} \int (X_{c, p + n i, q})

(3)

X_{c, p, q}^{'}

is the pixel after updating;

X_{c, p, q}

is the pixel before updating; i denotes the step size of the shift; and f denotes the convolution and ReLU operation.

3.3.2. Coarse-Fine Granularity Combined Up-Sample (CFGU)

We employed two branches for up-sampling to restore the feature map to the size of the input in Figure 9. The YOLOP algorithm utilizes three consecutive nearest-neighbor interpolation up-sampling operations to resize the image from (H/8, W/8) to (H, W). The fundamental principle of this approach is to assign the pixel value of the nearest pixel to the point being interpolated. However, this method has significant drawbacks in that nearest neighbor interpolation fails to leverage the continuity information between pixels, as it solely selects the value of the nearest pixel to the target location for interpolation, disregarding the influence of the surrounding neighboring pixels. Consequently, the resulting interpolated lane-line image often suffers from a loss of detailed texture and structural information, leading to issues such as unsmooth edges and discontinuities (jagged edges). Furthermore, lane lines may exhibit a weak texture or low contrast under certain lighting conditions (e.g., low light) or road conditions (e.g., reflections). In such cases, we propose a coarse and fine-grained joint up-sampling method that utilizes two distinct branches for up-sampling. The coarse-grained up-sampling branch employs nearest-neighbor interpolation to quickly obtain coarse up-sampled features, while the fine-grained up-sampling branch utilizes inverse convolution to preserve more detailed information. Ultimately, this approach integrates the feature extraction network (Figure 3) with the low-level feature information (such as edges and texture) from the underlying P1 and P2 layers to achieve fine-grained up-sampling.

Figure 10 illustrates that a feature map of size (H/8, W/8) was input into CFGU. The coarse-grained up-sampling branch applied nearest-neighbor interpolation to achieve a feature map of size (H/4, W/4) through two-fold up-sampling. Similarly, the fine-grained up-sampling branch employed inverse convolution to obtain a feature map of the same size (H/4, W/4). We fully utilized the low-level edge and texture information from the shallow neural network to extract finer feature information of the lane lines. CFGU fused the two feature maps obtained from the respective branches with the P2 layer of the feature extraction network, resulting in a fused feature map of size (H/4, W/4). Following a second round of CFGU and fusion with the P1 layer of the feature extraction network, we obtained a feature map of size (H/2, W/2). Finally, the feature map was restored to its original size (H, W) using CFGU.

4. Experiments

4.1. Training Details

4.1.1. Experimental Platforms

This experimental platform is based on the visual navigation of AGVs; the microprocessor uses Jetson Xavier from NVIDIA, Santa Clara, CA, USA, and the vision sensor uses a ZED2i camera from Stereolabs, San Francisco, CA, USA. Figure 11 illustrates the actual experimental platform. The experimental environment for this section consists of the following specifications: Windows 11 operating system, AMD 5800X CPU, NVIDIA GeForce RTX 4090 GPU, and the PyTorch 1.11.0 framework. During model training and testing, the input images were resized to 640 × 640 pixels. The hyperparameter settings are shown in Table 1.

The Adam optimizer was employed in the experiments, with a momentum value set to 0.937 and a weight decay coefficient of 0.0005. The first 10 epochs were designated as the model’s warm-up period, during which a smaller learning rate is utilized to stabilize training in its early stages. The learning rate adjustment strategy follows the One-Cycle Policy Learning Rate Scheduler (OneCycleLR), which initially employs a higher learning rate to accelerate convergence and subsequently reduces the learning rate throughout the training process, concluding with a smaller final learning rate of 0.0002.

4.1.2. Loss Function

The loss function of MTNet consists of two parts: the detection loss

L_{d e t}

and the segmentation loss

L_{l a n e - s e g}

.

(1): Detect loss

In the detection task, the loss function is divided into two primary branches: the classification branch and the bounding box branch. The classification branch employs a binary cross-entropy loss function, denoted as

L_{c l s}

. The bounding box branch consists of the focus loss

L_{f l}

and the Complete Intersection over Union (CIoU) loss

L_{C I o U}

. Consequently, the detection loss

L_{d e t}

can be expressed by Equation (4).

L_{d e t} = α_{1} L_{c l s} + α_{2} L_{f l} + α_{3} L_{C I o U}

(4)

α_{1}, α_{2}, α_{3}

are the weighting coefficients, respectively;

(2): Split loss

The lane-line segmentation loss

L_{l a n e}

is defined as follows:

L_{l a n e} = L_{l a n e - c l s} + L_{l a n e - I o u}

(5)

L_{l a n e - c l s} = L_{F o c a l L o s s} = - {(1 - P_{t})}^{γ} l o g (P_{t})

(6)

L_{l a n e - I o u} = 1 - \frac{T P}{T P + F P + F N}

(7)

L_{l a n e - c l s}

is used to compute the pixel classification loss,

L_{l a n e - I o u}

is used to calculate the cross-combination ratio loss, and

T P

refers to a pixel point where the model correctly predicts the presence of a lane line, which corresponds to an actual lane line.

F P

indicates a pixel point where the model predicts a lane line, but it is actually a background point (non-lane line), and

F N

signifies a pixel point where the model predicts a background, yet it corresponds to an actual lane line.

4.1.3. Dataset

This paper focuses on lane-line visual navigation technology for autonomous guided vehicles (AGVs) in a factory-farming environment. This includes the detection of lane lines and the detection of targets (e.g., pedestrians, AGVs, and QR codes). The dataset utilized consists of 3600 images collected and manually labeled within a smart factory in Jiangsu, encompassing a total of 6868 lane-line annotations. Each image has a resolution of 1280 × 720 pixels, with the training set comprising 70%, the validation set 20%, and the test set 10%.

The annotation tool utilized in this paper is LabelMe, which employs polygonal boxes for lane lines and rectangular boxes for pedestrians, QR codes, and AGVs. The annotation files are formatted in JSON. Figure 12a,b illustrate examples of target-detection labeling and lane-line labeling, respectively.

For lane-line detection, we employed semantic segmentation. A crucial preprocessing step involved converting lane-line labels from the JSON format into corresponding binary images. Initially, we differentiated the lane lines from the background in the original image, as illustrated in Figure 13a. Subsequently, the lane-line regions were displayed in white in the binary image, as shown in Figure 13b.

4.1.4. Data Enhancements

Insufficient training samples can lead to overfitting, resulting in poor model performance on unseen data. Data augmentation serves as an effective strategy to optimize model performance by expanding the dataset without the need for additional training samples. This approach increases data diversity while simultaneously reducing the risk of overfitting.

To enhance the model’s generalization ability and address the issue of limited training data, we employed various data-augmentation techniques, including translation, rotation, horizontal flipping, scaling, hue adjustment, brightness adjustment, and saturation adjustment. These methods transformed and expanded the original dataset, generating additional training samples that simulated variations in different scenes and environments, such as changes in lighting conditions, weather, and viewing angles. By improving the model’s adaptability to these image variations, we enhanced its detection capabilities in complex scenarios. Figure 14 shows the data-enhancement effects.

4.1.5. Performance-Evaluation Indicators

For the task of detecting pedestrians, automated guided vehicles (AGVs), and QR codes, this paper employs a deep-learning-based target-detection method and utilizes recall rate (Recall, R) and mAP50 as the evaluation metrics.

The recall rate represents the proportion of actual targets that are correctly detected by the model, as defined by Equation (8). A higher recall rate signifies an improved detection capability and a reduced rate of missed detections.

R = \frac{T P}{T P + F N}

(8)

TP means that the model correctly detects pedestrians, AGVs, or QR codes, and FN means that pedestrians, AGVs, and QR codes are not correctly detected by the model.

The mAP50 is the mean value of the average precision of the model across each category when the threshold is set to 0.5, as expressed by Equation (9). It serves as a comprehensive indicator of the model’s performance; a higher mAP50 indicates better detection performance.

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(9)

N is the total number of detection categories; AP denotes average precision, calculated as shown in Equation (10); and P denotes the precision rate, defined as shown in Equation (11).

A P = \int_{0}^{1} P d R

(10)

P = \frac{T P}{T P + F P}

(11)

FP indicates that the model incorrectly recognizes objects and TP is the opposite.

For the lane-line detection task, this study employed the deep-learning-based semantic segmentation method, utilizing accuracy and Intersection over Union (IoU) as evaluation metrics.

Accuracy is a standard metric for classification problems, measuring the ratio of correctly predicted samples to the total number of samples. In the context of lane-line detection, accuracy indicates the proportion of lane-line pixels accurately predicted by the model relative to all lane-line pixels. This metric reflects the model’s overall effectiveness in predicting lane lines, as defined by Equation (12).

A c c u r a c y = \frac{T P + T N}{T P + F P + F N + T N}

(12)

TP refers to pixel points that are correctly predicted by the model as lane lines, meaning these pixels are accurately recognized as lane lines. FP denotes false detection. FN indicates missed detection. TN refers to pixels that are correctly predicted as background and are indeed background, thus accurately identified by the model as non-lane-line regions.

When detecting lane lines using semantic segmentation, the definition of IoU differs slightly from that used in target-detection tasks. In this context, each pixel is treated as a region. Specifically, assuming the semantic segmentation model outputs a pixel-level map of lane-line predictions, the pixels identified as lane lines in the predicted results and those labeled as lane lines in the ground truth are considered regions for the IoU calculation. The IoU is computed according to Equation (13).

{L a n e}_{I o U} = \frac{P \cap P^{g t}}{P \cup P^{g t}}

(13)

P is the pixel area predicted by the model as lane lines and

P^{g t}

is the real lane-line pixel area.

IoU measures the overlap between the predicted lane lines and the actual lane lines. It is defined as the ratio of the intersection area to the union area of the predicted and actual lane-line regions. The IoU value ranges from 0 to 1, with values closer to 1 indicating a higher degree of overlap between the predicted and actual lane lines, signifying better lane-line segmentation performance.

4.2. Comparisons and Analyses

To assess the efficacy of the feedback-based multi-task (MTNet) joint detection algorithm introduced in this paper, we first validated the model’s performance in target detection. Subsequently, it experimentally compared this algorithm with prominent target-detection algorithms, including Faster R-CNN, YOLOv5s, and the multi-task network YOLOP. The results of these experimental comparisons are presented in Table 2.

As shown in Table 2, the multi-task joint detection network (MTNet) outperforms the other algorithms in both recall and mAP50. Specifically, compared to Faster R-CNN, YOLOv5s, and YOLOP, MTNet improves recall by 9.3, 4.7, and 3.4 percentage points, respectively; additionally, mAP50 is enhanced by 17.8, 4.1, and 6.6 percentage points.

The impact of MTNet on detecting pedestrians, automated guided vehicles (AGVs), and QR code images under normal light, reflective, and dark environments is illustrated in Figure 15.

Then the performance of model lane-line segmentation was verified and experimentally compared with four lane-line detection algorithms: SCNN [8], ENet [32], ENet-SAD [33], and YOLOP [12]. The experiments evaluated the accuracy rate, intersection-over-union (IoU) ratios and the processing speed of the models.

Table 3 indicates that the MTNet based on feature shift aggregation, achieved superior results in both accuracy and IoU, and the processing speed doubled compared to the original YOLOP algorithm, while the accuracy was improved by 3.4% and the IoU was enhanced by 4.2%.

To demonstrate the detection performance of the MTNet in real industrial scenarios, we selected lane-line images from various conditions in Figure 16, including normal daylight, low light at night, reflective surfaces, missing lane lines, and blocked lane lines.

Comparative experiments in different scenarios indicate that the MTNet provides smoother and more complete lane-line detection around corners. MTNet maintains strong feature-extraction capabilities in dark environments and accurately detects the location and shape of lane lines even when they are far away. In addition, MTNet accurately predicts potential lane-line locations even with strong reflections. By relying on its understanding of global features and mastery of detailed information, MTNet achieves accurate detection and reasonable prediction of potential lane-line locations, even in cases of missing lane lines, and it detects lane lines completely and continuously, even when they are blocked, based on implementing feature shift aggregation and coarse and fine granularity joint up-sampling. Compared to YOLOP, MTNet is more suitable for lane-line detection in factory environments.

Finally, the performance of the MTNet algorithm for joint multi-task detection was evaluated on the test set, with the results presented in Figure 17.

4.3. Ablation Experiment

We performed ablation experiments to further verify the effectiveness of the feedback-based multi-task joint detection algorithm. The results are shown in Table 4.

Table 4 indicates that the multi-task joint detection algorithm proposed in this paper enhanced recall by 0.5 percentage points and mAP50 by 0.3 percentage points in the target-detection task. Additionally, it improved accuracy by 0.1 percentage points and Intersection over Union (IoU) by 0.5 percentage points in the lane-line detection task.

In order to further validate the effectiveness of the method proposed in this paper for the improvement of the YOLOP algorithm for lane-line detection, ablation experiments were designed as shown in Table 5.

Table 5 shows that after replacing the backbone network of YOLOP with Darknet53, the network exhibited enhanced feature-extraction capabilities, resulting in a 0.6% increase in lane-line detection accuracy and a 1.1% improvement in IoU. However, the FSAN and CFGU methods are more effective in optimizing YOLOP. When utilizing the FSAN alone, lane-line detection accuracy increased by 2.8%, with the intersection and merger ratio rising by 3.6%. In contrast, employing the CFGU method alone yielded a 1.7% increase in accuracy and a 2.7% increase in IoU.

The integration of both the FSAN and the CFGU resulted in improvements of 3.0% in lane-line detection accuracy and 4.0% in IoU. Additionally, replacing the backbone network while incorporating the FSAN enhanced accuracy by 3.1% and the IoU by 3.8%. Similarly, substituting the backbone network and adding the CFGU increased accuracy by 2.1% and the IoU by 2.8%. Overall, replacing the backbone network by adding both the FSAN and CFGU modules to YOLOP yielded the best detection performance for lane lines; an accuracy of 99.1% and an IoU of 94.2% were achieved.

5. Conclusions

In this article, we propose a multi-task joint detection algorithm (MTNet) tailored for embedded devices with limited computational resources, enabling simultaneous lane-line segmentation and the detection of pedestrians, Automated Guided Vehicles (AGVs), and QR codes. The network architecture comprises a shared feature encoder and two decoders for detection and segmentation. The encoder employs a bidirectional weighted feature pyramid structure to enhance feature representation for different tasks. In the detection decoder, a decoupled detection head is utilized to separate classification and localization tasks, thereby improving detection accuracy. Additionally, a lane-line information feedback network is integrated into the segmentation decoder to support the target-detection task branch. The experimental results demonstrate that the algorithm achieves strong performance in lane-line detection, as well as in detecting pedestrians, AGVs, and QR codes, while effectively conserving computational resources. We also propose an improved YOLOP lane-line detection algorithm based on feature-shift aggregation. The algorithm utilizes Darknet-53 as the backbone network, followed by the fusion of lane-line features using a fast spatial pyramid pooling method. Next, the feature map undergoes feature shift aggregation (FSAN) to integrate spatial global information, enhancing the algorithm’s performance in scenarios involving occluded or missing lane lines. Finally, the algorithm employs combined coarse and fine-grained up-sampling (CFGU), merging low-level edge and texture features to achieve refined up-sampling. The experimental results demonstrate that the algorithm achieves higher detection accuracy, a more complete structure, and smoother edges.

Multi-task learning (MTL) is increasingly being adopted in intelligent driving to enhance the efficiency and performance of the model. In this paper, we propose a feedback-based network for joint lane-line detection and target-detection tasks. In future work, we plan to conduct a further study of a joint detection model for traffic-vehicle detection, drivable-area detection, lane-line detection, and other tasks in factory environments.

Author Contributions

Writing-original draft preparation, C.W.; supervision and formal analysis, X.C.; writing-review and editing, Z.J., S.S. and Z.M. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Agricultural Innovation Workshop Project of Jiangsu Province [CX(21)2025] and the Graduate Practice and Innovation Project of Jiangsu Province [SJCX24-0066].

Institutional Review Board Statement

This study did not require ethical approval.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, D.; Dong, F.; Li, Z.; Xu, S. How Can Farmers’ Green Production Behavior Be Promoted? A Literature Review of Drivers and Incentives for Behavioral Change. Agriculture 2025, 15, 744. [Google Scholar] [CrossRef]
Wu, X. Application status and development trend of AGV autonomous guided robots. Robot. Technol. Appl. 2012, 16–17. [Google Scholar]
Yao, W.; Xu, J.; Wang, H. Design of visual localization system based on visual navigation AGV. Manufact. Auto. 2020, 42, 18–22. [Google Scholar]
Guo, H.; Chen, H.; Wu, T. MSDP-Net: A YOLOv5-Based Safflower Corolla Object Detection and Spatial Positioning Network. Agriculture 2025, 15, 855. [Google Scholar] [CrossRef]
Zhou, X.; Chen, W.; Wei, X. Improved Field Obstacle Detection Algorithm Based on YOLOv8. Agriculture 2024, 14, 2263. [Google Scholar] [CrossRef]
Liang, Z.; Xu, X.; Yang, D.; Liu, Y. The Development of a Lightweight DE-YOLO Model for Detecting Impurities and Broken Rice Grains. Agriculture 2025, 15, 848. [Google Scholar] [CrossRef]
Maghsoumi, H.; Masoumi, N.; Araabi, B.N. RoaDSaVe: A Robust Lane Detection Method Based on Validity Borrowing From Reliable Lines. IEEE Sens. J. 2023, 23, 14571–14582. [Google Scholar] [CrossRef]
Pan, X.; Shi, J.; Luo, P.; Wang, X.; Tang, X. Spatial as Deep: Spatial CNN for Traffic Scene Understanding. Proc. AAAI 2018, 32, 7276–7283. [Google Scholar] [CrossRef]
Liu, P.; Yu, G.; Wang, Z.; Zhou, B.; Ming, R.; Jin, C. Uncertainty-Aware Point-Cloud Semantic Segmentation for Unstructured Roads. IEEE Sens. J. 2023, 23, 15071–15080. [Google Scholar] [CrossRef]
Huang, C.; Zhang, Y.; Lei, L. Research on ELAN-Based Multimodal Teaching in International Chinese Character Micro-Lessons. In Proceedings of the 2024 13th International Conference on Educational and Information Technology (ICEIT), Chengdu, China, 22–24 March 2024; pp. 397–403. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
Wu, D.; Liao, M.-W.; Zhang, W.-T.; Wang, X.-G.; Bai, X.; Cheng, W.-Q.; Liu, W.-Y. YOLOP: You Only Look Once for Panoptic Driving Perception. Mach. Intell. Res. 2022, 19, 550–562. [Google Scholar] [CrossRef]
Wu, Y.; Liu, L. Research progress of vision-based lane line detection method. J. Instrum. 2019, 40, 92–109. [Google Scholar]
Teichmann, M.; Weber, M.; Zollner, M.; Cipolla, R.; Urtasun, R. MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 1013–1020. [Google Scholar]
Qian, Y.; Dolan, J.M.; Yang, M. DLT-Net: Joint Detection of Drivable Areas, Lane Lines, and Traffic Objects. IEEE Trans. Intell. Transp. Syst. 2020, 21, 4670–4679. [Google Scholar] [CrossRef]
Vu, D.; Ngo, B.; Phan, H. HybridNets: End-to-End Perception Network. arXiv 2022, arXiv:2203.09035. [Google Scholar]
Niu, J.; Lu, J.; Xu, M.; Lv, P.; Zhao, X. Robust lane detection using two-stage feature extraction with curve fitting. Pattern Recognit. 2016, 59, 225–233. [Google Scholar] [CrossRef]
Jung, S.; Youn, J.; Sull, S. Efficient Lane Detection Based on Spatiotemporal Images. IEEE Trans. Intell. Transp. Syst. 2016, 17, 289–295. [Google Scholar] [CrossRef]
Berriel, R.F.; de Aguiar, E.; Filho, V.V.d.S.; Oliveira-Santos, T. A Particle Filter-Based Lane Marker Tracking Approach Using a Cubic Spline Model. In Proceedings of the 2015 28th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Salvador, Brazil, 26–29 August 2015; pp. 149–156. [Google Scholar]
Mehta, S.; Rastegari, M.; Caspi, A.; Shapiro, L.; Hajishirzi, H. ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation. In Proceedings of the Lecture Notes in Computer Science), Munich, Germany, 8–14 September 2018; pp. 561–580. [Google Scholar]
Qu, Z.; Jin, H.; Zhou, Y.; Yang, Z.; Zhang, W. Focus on Local: Detecting Lane Marker from Bottom Up via Key Point. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14117–14125. [Google Scholar]
Wang, J.; Ma, Y.; Huang, S.; Hui, T.; Wang, F.; Qian, C.; Zhang, T. A Keypoint-based Global Association Network for Lane Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1382–1391. [Google Scholar]
Liu, L.; Chen, X.; Zhu, S.; Tan, P. CondLaneNet: A Top-to-down Lane Detection Framework Based on Conditional Convolution. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3753–3762. [Google Scholar]
Kontente, S.; Orfaig, R.; Bobrovsky, B. CLRmatchNet: Enhancing Curved Lane Detection with Deep Matching Process. arXiv 2023, arXiv:2309.15204. [Google Scholar]
Yang, Z.; Shen, C.; Shao, W.; Xing, T.; Hu, R.; Xu, P.; Chai, H.; Xue, R. CANet: Curved Guide Line Network with Adaptive Decoder for Lane Detection. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Zoljodi, A.; Abadijou, S.; Alibeigi, M.; Daneshtalab, M. Contrastive Learning for Lane Detection via cross-similarity. Pattern Recognit. Lett. 2024, 185, 175–183. [Google Scholar] [CrossRef]
Honda, H.; Uchida, Y. CLRerNet: Improving Confidence of Lane Detection with LaneIoU. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 1165–1174. [Google Scholar]
Hou, Y.; Ma, Z.; Liu, C.; Loy, C.C. Learning Lightweight Lane Detection CNNs by Self Attention Distillation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–3 November 2019; Volume 2019, pp. 1013–1021. [Google Scholar]
Abualsaud, H.; Liu, S.; Lu, D.B.; Situ, K.; Rangesh, A.; Trivedi, M.M. LaneAF: Robust Multi-Lane Detection With Affinity Fields. IEEE Robot. Autom. Lett. 2021, 6, 7477–7484. [Google Scholar] [CrossRef]
Yao, Z.; Chen, X.; Guindel, C. Efficient Lane Detection Technique Based on Lightweight Attention Deep Neural Network. J. Adv. Transp. 2022, 2022, 5134437. [Google Scholar] [CrossRef]
Zheng, T.; Huang, Y.; Liu, Y.; Tang, W.; Yang, Z.; Cai, D.; He, X. CLRNet: Cross Layer Refinement Network for Lane Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 888–897. [Google Scholar]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for RealTime Semantic Segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar]
Pahk, J.; Shim, J.; Baek, M.; Lim, Y.; Choi, G. Effects of Sim2Real Image Translation via DCLGAN on Lane Keeping Assist System in CARLA Simulator. IEEE Access 2023, 11, 33915–33927. [Google Scholar] [CrossRef]

Figure 1. The architecture of YOLOP. YOLOP shares one encoder and combines three decoders to solve different tasks. The encoder consists of a backbone and a neck.

Figure 2. The architecture of MTNet. MTNet shares one encoder and combines two decoders to solve different tasks. The encoder consists of a backbone and a neck.

Figure 3. BiFPN feature-connectivity diagram.

Figure 4. Decoupled detector-head structure.

Figure 5. Lane-line detection network design.

Figure 6. Lane-line information feedback structure.

Figure 7. FSAN shift aggregation in a row direction.

Figure 8. FSAN shift aggregation in a column direction.

Figure 9. The implementation of Spatial CNN.

Figure 10. CFGU network architecture.

Figure 11. Actual experimental platform; the pictures from left to right are NVIDIA Jetson Xavier, Stereolabs ZED2i camera, and AGV.

Figure 12. Labeling of datasets. (a) pedestrian, AGV, QR code labeling; (b) example of lane-line marking.

Figure 13. Lane-line split labels. (a) Lane lines, background distinction; (b) binary lane-line labeling.

Figure 14. Data-enhancement effects. (a) raw image; (b) pan; (c) rotate; (d) flip horizontal; (e) zoom; (f) hue adjustment; (g) brightness adjustment; (h) saturation conditions.

Figure 15. MTNet target-detection results.

Figure 16. Comparison experiment of lane-line detection under different scenarios; the pictures from left to right are real lane-line labeling, YOLOP detection results, and FSA-YOLOP detection results. (a) Comparison of experiments in dim scenes; (b) comparison of experiments in reflective scenes; (c) comparison of experiments in lane-line loss scenes; (d) comparison of experiments in lane-line occlusion scenes; (e) comparison of experiments in normal light scenes.

Figure 17. MTNet detection results in different scenarios.

Table 1. Experimental hyperparameter settings.

Parameters	Values
Batchsize	32
Epoch	300
Initial Learning Rate	0.001
Weight Decay	0.0005
Box Loss Gain	0.05
Classification Loss Gain	0.5
Lane Segmentation Loss Gain	0.2
Lane IoU Loss Gain	0.2

Table 2. Comparison results of target-detection algorithms.

Method	Recall (%)	mAP50 (%)
Faster R-CNN	80.6	80.2
YOLOv5s	94.3	93.6
YOLOP	96.2	91.5
MTNet	99.0	97.9

Table 3. Comparisons of experimental results.

Method	Accuracy (%)	IoU (%)	Speed (ms/Frame)
ENet	80.6	71.9	4.5
SCNN	85.2	74.5	17.6
ENet-SAD	87.4	76.3	7.0
YOLOP	95.7	90.0	8.7
MTNet	99.1	94.2	4.2

Table 4. Results of MTNet ablation experiments.

Detecting Branch	Split Branch	Recall (%)	mAP50 (%)	Accuracy (%)	IoU (%)
√	×	98.5	97.6	-	-
×	√	-	-	99.1	94.2
√	√	99.0	97.9	99.3	94.8

Table 5. Results of lane-line detection ablation experiments.

Darknet53	FSAN	CFGU	Accuracy (%)	IoU (%)
×	×	×	95.7	90.0
√	×	×	96.3	91.1
×	√	×	98.5	93.6
×	×	√	97.4	92.7
√	√	×	98.8	93.8
×	√	√	98.7	94.0
√	×	√	97.8	92.8
√	√	√	99.1	94.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, C.; Chen, X.; Jiao, Z.; Song, S.; Ma, Z. An Improved YOLOP Lane-Line Detection Utilizing Feature Shift Aggregation for Intelligent Agricultural Machinery. Agriculture 2025, 15, 1361. https://doi.org/10.3390/agriculture15131361

AMA Style

Wang C, Chen X, Jiao Z, Song S, Ma Z. An Improved YOLOP Lane-Line Detection Utilizing Feature Shift Aggregation for Intelligent Agricultural Machinery. Agriculture. 2025; 15(13):1361. https://doi.org/10.3390/agriculture15131361

Chicago/Turabian Style

Wang, Cundeng, Xiyuan Chen, Zhiyuan Jiao, Shuang Song, and Zhen Ma. 2025. "An Improved YOLOP Lane-Line Detection Utilizing Feature Shift Aggregation for Intelligent Agricultural Machinery" Agriculture 15, no. 13: 1361. https://doi.org/10.3390/agriculture15131361

APA Style

Wang, C., Chen, X., Jiao, Z., Song, S., & Ma, Z. (2025). An Improved YOLOP Lane-Line Detection Utilizing Feature Shift Aggregation for Intelligent Agricultural Machinery. Agriculture, 15(13), 1361. https://doi.org/10.3390/agriculture15131361

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved YOLOP Lane-Line Detection Utilizing Feature Shift Aggregation for Intelligent Agricultural Machinery

Abstract

1. Introduction

2. Related Work

2.1. Multi-Task Learning

2.2. Lane-Line Detection

3. Methods

3.1. Encoder Design

3.2. Detection-Decoder Design

3.3. Split-Decoder Design

3.3.1. Feature Shift Aggregation Network (FSAN)

3.3.2. Coarse-Fine Granularity Combined Up-Sample (CFGU)

4. Experiments

4.1. Training Details

4.1.1. Experimental Platforms

4.1.2. Loss Function

4.1.3. Dataset

4.1.4. Data Enhancements

4.1.5. Performance-Evaluation Indicators

4.2. Comparisons and Analyses

4.3. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI