Lightweight Semantic Segmentation for AGV Navigation: An Enhanced ESPNet-C with Dual Attention Mechanisms

Shu, Jianqi; Yan, Xiang; Liu, Wen; Gong, Haifeng; Zhu, Jingtai; Yang, Mengdie

doi:10.3390/electronics14173524

Open AccessArticle

Lightweight Semantic Segmentation for AGV Navigation: An Enhanced ESPNet-C with Dual Attention Mechanisms

by

Jianqi Shu

¹

,

Xiang Yan

¹,

Wen Liu

^1,*,

Haifeng Gong

^1,2,

Jingtai Zhu

^1,3 and

Mengdie Yang

¹

School of Information and Intelligent Engineering, Zhejiang Wanli University, Ningbo 315100, China

²

Faculty of Information Science and Engineering, Ocean University of China, Qingdao 266000, China

³

School of Mechanical and Electrical Engineering, Jiangxi University of Science and Technology, Ganzhou 341000, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(17), 3524; https://doi.org/10.3390/electronics14173524

Submission received: 4 July 2025 / Revised: 1 September 2025 / Accepted: 2 September 2025 / Published: 3 September 2025

Download

Browse Figures

Versions Notes

Abstract

Efficient navigation of Automated Guided Vehicles (AGVs) in dynamic warehouse environments requires real-time and accurate path segmentation algorithms. However, traditional semantic segmentation models suffer from excessive parameters and high computational costs, limiting their deployment on resource-constrained embedded platforms. A lightweight image segmentation algorithm is proposed, built on an improved ESPNet-C architecture, combining Spatial Group-wise Enhance (SGE) and Efficient Channel Attention (ECA) with a dual-branch upsampling decoder. On our custom warehouse dataset, the model attains 90.5% Miou with 0.425 M parameters and runs at ~160 FPS, reducing parameters by ×116–×136 and computational costs by 70–92% in comparison with DeepLabV3+. The proposed model improves boundary coherence by 22% under uneven lighting and achieves 90.2% Miou on the public BDD100K benchmark, demonstrating strong generalization beyond warehouse data. These results highlight its suitability as a real-time visual perception module for AGV navigation in resource-constrained environments and offer practical guidance for designing lightweight semantic segmentation models for embedded applications.

Keywords:

AGV navigation; lightweight semantic segmentation; ESPNet-C; dual attention mechanism; real-time vision system; warehouse automation

1. Introduction

Automated Guided Vehicles (AGVs) are increasingly vital to modern logistics and warehouse automation. These vehicles are designed to transport goods along predefined routes, where the accuracy of path detection directly impacts operational safety and efficiency [1]. Conventional AGVs often rely on magnetic tracks or laser-based navigation, both of which exhibit limitations such as rigid routing and poor adaptability to changing environments. For instance, magnetic-track-based AGVs require pre-installed physical tracks, which increase deployment costs by over 30% and limit operational flexibility in dynamic warehouse settings. In contrast, vision-based autonomous navigation approaches have attracted growing attention due to their adaptability, lower hardware requirements, and superior environmental perception [2]. A key challenge is the ability to understand complex environments in real time, particularly the accurate identification of drivable areas [3]. Semantic segmentation is a technique that assigns a semantic label to each pixel in an image, enabling effective scene understanding, and can be used to segment drivable paths for obstacle avoidance and path planning [4]. However, indoor warehouse environments often feature uneven illumination, repetitive textures, and complex shelf structures that reduce the performance of segmentation models. These models commonly suffer from edge misclassification, low robustness, and reduced accuracy [5]. Moreover, existing AGV navigation systems demand both high segmentation accuracy and high computational resources, making conventional complex models impractical. Aryanti highlighted that models trained on generic scenarios often struggle to handle warehouse-specific features such as shelf structures and lane markings without incurring significant computational overhead [6]. To address these challenges, our study frames the visual navigation task as a road segmentation problem, aiming to distinguish drivable paths from background noise using semantic segmentation under limited computational resources. We propose a lightweight segmentation model based on an enhanced ESPNet-C architecture, which incorporates a dual attention mechanism: Spatial Group-wise Enhancement (SGE) and Efficient Channel Attention (ECA), alongside a dual-branch upsampling decoder to balance segmentation accuracy without compromising efficiency. The primary contributions of this work are as follows. First, the lightweight encoder–decoder architecture, a highly compact model with only 0.425 million parameters is developed using multi-scale dilated convolutions and feature fusion strategies. Compared to DeepLabV3+, it reduces parameter size by ×128, lowers computational cost by 92%, and achieves an inference speed of approximately 160 FPS. Second, the dual attention enhancement module integrates SGE and ECA modules, improving spatial context modeling and channel dependencies, yielding a 4.3% Miou increase over the ESPNet baseline and a 22% improvement in boundary continuity under uneven illumination. Finally, the cross-scenario generalization is evident, as the model achieves a 90.2% Miou on the public BDD100K benchmark, demonstrating strong adaptability to complex lighting and structural conditions, and offering a viable path for intelligent AGV navigation system upgrades.

2. Related Work

In recent years, the rapid development of computer vision has accelerated the shift from traditional AGV navigation methods to vision-based solutions. Accurate segmen-tation of drivable areas is central to such systems, directly impacting obstacle avoidance and path planning. Therefore, this section reviews visual navigation systems, general segmentation algorithms, and lightweight segmentation models related to AGVs.

2.1. Visual Navigation Systems for AGVs

Conventional navigation techniques, such as electromagnetic induction, ultrasonic sensing, and inertial guidance [7], exhibit numerous limitations that fail to meet the flexibility and intelligence required in modern warehouses. Although widely deployed in industrial settings, including logistics scenarios of major enterprises such as Amazon [8] and Alibaba [9]; however, traditional methods primarily rely on technologies such as line tracking, barcodes, and laser sensors [10], which perform well in structured environments but face significant challenges in dynamic layouts and complex decision-making scenarios [11].

The rise in deep learning and computer vision has enabled vision-based navigation strategies, including visual SLAM and semantic segmentation. Mur-Artal et al. [12] proposed ORB-SLAM, a complete system for monocular, stereo, and RGB-D SLAM system featuring loop closure and map reuse. Yang et al. [13] deployed ORB-SLAM to robotic navigation with related improvements, achieving enhanced accuracy and enabling autonomous navigation for AGVs. Nevertheless, SLAM remains limited by sensitivity to illumination and optical textures, as well as computational overhead. To address these issues, Li et al. [14] proposed a hybrid approach combining SLAM with semantic segmentation. This method constructs dense semantic 3D maps for more reliable navigation in dynamic scenes, meeting the advanced requirements of mobile robots for mapping and providing a more reliable solution for AGV visual navigation. Similarly, Rapti et al. [15] demonstrated that semantic segmentation can effectively determine drivable areas for AGV guidance, while Liu et al. [16] advanced this research, validating the feasibility of deploying semantic segmentation models directly on mobile robots.

Overall, this technique is relatively complex and has significant requirements for application scenarios. Generally, further optimization is needed based on application scenarios and task types.

2.2. General Semantic Segmentation Algorithms

Currently used semantic segmentation models are primarily CNN-based image segmentation algorithms. CNN-based semantic segmentation models typically adopt two structures: dilated convolution networks or encoder–decoder architectures. Dilated convolution approaches expand the receptive field without increasing parameter count, allowing networks to capture contextual information while preserving fine details. A representative example, DeepLabV3+ [17] designed the Atrous Spatial Pyramid Pooling (ASPP) module based on multi-scale dilated convolution concepts, capturing contextual information at different levels through multi-scale dilated convolution, and demonstrating excellent performance in complex scene segmentation tasks. This success validates the importance of multi-scale feature fusion in semantic segmentation. Encoder–decoder frameworks represent another commonly used structure for image semantic segmentation, where the encoder module extracts high-level semantic information from images, while the decoder module recovers spatial detail information. Classic encoder–decoder networks such as U-Net [18] exhibit superior performance in small-sample scenarios like medical imaging. SegNet [19] achieves memory-efficient feature recovery through pooling index mechanisms. More recently, HRNet [20] proposed a parallel multi-resolution structure that preserves high-resolution representations, offering strong boundary accuracy at the expense of significantly increased parameters and computational cost.

Although significant progress in segmentation accuracy achieved by the aforementioned models, problems of high model complexity and slow inference speed make them difficult to meet the real-time navigation requirements of AGVs, prompting researchers to turn toward lightweight semantic segmentation algorithm research.

2.3. Lightweight Semantic Segmentation Networks

The equilibrium between real-time performance and segmentation accuracy represents the central challenge of AGV visual navigation. Lightweight semantic segmentation algorithms provide feasible solutions for resource-constrained mobile platforms through network structure optimization, parameter reduction, and computational efficiency improvement. Therefore, lightweight semantic segmentation algorithms have gradually become a research focus, with multiple related algorithms already proposed. Howard [21] developed a MobileNet network model, specifically designed for embedded and mobile vision applications, employing depthwise separable convolution to reduce computational requirements. However, this model experiences dramatically increased computational costs for accuracy improvements beyond a certain stage. Subsequently, MobileNetV2 [22] introduced inverted residual structures and linear bottleneck designs, further optimizing the accuracy–efficiency trade-off. MobileNetV3 [23] achieved better performance through neural architecture search and SE attention mechanisms. Regarding accuracy improvements for this model, scholars have conducted extensive research. SegNet [19] proposed pooling index mechanisms for memory-efficient upsampling, and PSPNet [24] introduced pyramid pooling to capture global context with multi-scale bins. To further enhance applicability, besides complex computational models, Mehta et al. [25] proposed ESPNet with low computational cost, using dilated convolution to construct effective spatial pyramid (ESP) modules. Compared to the aforementioned models, this model can maintain a certain high accuracy while achieving real-time processing. In the latest research, Ma et al. [26] proposed a lightweight semantic segmentation network with multi-level feature fusion and dual attention collaboration. By introducing the multi-level feature fusion strategy, the model’s ability to perceive multi-scale information was enhanced, and the dual attention mechanism was utilized to optimize feature representation. It demonstrated excellent generalization in multi-class scenarios and has certain reference significance for the lightweight model research.

Although there are already various methods to complete real-time image segmentation tasks, the segmentation accuracy and real-time performance are still insufficient, making it difficult to meet the real-time requirements of visual perception tasks. This paper further explores how to improve the model’s performance in specific application scenarios, aiming to address the challenges in specific application scenarios and achieve more accurate and reliable semantic segmentation.

3. Materials and Methods

The visual perception systems of AGVs impose stringent requirements on both accuracy and speed of image segmentation algorithms. While existing general semantic segmentation models achieve high accuracy, they are limited by large model sizes and poor real-time performance. Lightweight segmentation models achieve faster inference but often at the cost of reduced accuracy. To address this challenge, this paper proposes an improved road image segmentation model based on ESPNet-C, which primarily comprises three modules: a lightweight encoder (ESPNet Encoder), an attention enhancement module, and a dual-branch upsampling decoder. The encoder module employs a spatial pyramid structure to achieve efficient feature extraction through multi-scale dilated convolution. The dual attention enhancement module combines SGE [27] and ECA [28] to strengthen the model’s perception capability for critical features. The twin upsampling decoder module contains two parallel feature reconstruction branches that enhance segmentation accuracy through feature fusion at different scales. The overall architecture of the model is illustrated in Figure 1.

3.1. Data Acquisition and Processing

The performance and generalization of semantic segmentation models are highly dependent on the quality of training data, which necessitates task-specific datasets tailored to the target application. For forklift-centric visual perception, the objective is to accurately segment drivable areas relative to surrounding background elements. Mainstream benchmarks such as Cityscapes [29] and BDD100K [30], while widely used, do not adequately capture the characteristics of warehouse navigation. To address this gap, we adopted the publicly available Forklift dataset from Roboflow [31], containing real-world warehouse scenes with forklifts in operation. All images were manually annotated and verified to produce a domain-specific dataset for drivable-area segmentation. The dataset consists of 1080 training images and 120 test images at a resolution of 1920 × 1080 pixels, labeled with two classes: drivable area and background. Representative samples are shown in Figure 2.

3.2. Model Architecture

To address the high computational cost and parameter overhead of traditional backbones (e.g., ResNet, VGG), we adopt ESPNet-C as a lightweight encoder. This design reduces model complexity while maintaining competitive segmentation accuracy. The encoder extracts multi-scale features through Efficient Spatial Pyramid (ESP) modules, which leverage dilated convolutions to capture contextual information across resolutions and mitigate redundancy via cross-scale feature fusion.

The extracted features are refined by a dual attention mechanism comprising SGE and ECA. SGE captures global spatial context and integrates it into local group-wise representations, improving semantic continuity around boundaries. In parallel, ECA recalibrates channel responses based on inter-channel dependencies, reinforcing salient semantic cues. Outputs from both branches are fused using a convolutional projection followed by element-wise summation.

The decoder, illustrated in Figure 3, employs a lightweight structure with transposed convolutions, batch normalization, and PReLU activations. It reconstructs high-resolution outputs while preserving computational efficiency. To streamline downstream processing and minimize latency, feature dimensionality is unified to 32 channels, balancing accuracy and efficiency. Overall, the model is tailored for drivable-area segmentation, with an emphasis on delineating fine-grained details such as lane markings and regions affected by inconsistent lighting.

3.3. Loss Function

In warehouse AGV navigation, semantic segmentation faces two major challenges: class imbalance between drivable areas and background, and accurate boundary delineation under varying lighting conditions. To address these issues systematically, we design a composite loss function that combines Focal Loss [32] and Tversky Loss [33] with balanced weighting. Focal Loss mitigates pixel-level classification bias by down-weighting easily classified samples and emphasizing difficult ones, thereby improving robustness in the presence of class imbalance. Its formulation is given in Equation (1). In parallel, Tversky Loss extends the classical Dice Loss [34] by introducing two tunable parameters (α, β) to control the penalties for false positives (FP) and false negatives (FN), as shown in Equation (2).

L o s s_{f o c a l} = - \frac{1}{N} \sum_{c = 0 i = 1}^{C - 1 N} p_{i} (c) {(1 - {\hat{p}}_{i} (c))}^{γ} \log ({\hat{p}}_{i} (c))

(1)

where

N

: The number of pixels in the input image.

C

: Number of classes, in this case, one class is a drivable area, the others are background.

{\hat{p}}_{i} (c)

: Predicted probability that pixel i belongs to class c.

p_{i} (c)

: Ground truth label of whether pixel i belongs to class c.

γ

: Balancing adjustment coefficients.

L o s s_{t v e r s k y} = \sum_{c = 0}^{C} (1 - \frac{T P (c)}{T P (c) - α F N (c) - β F P (c)})

(2)

where

T P

: True positive samples, i.e., correctly predicted as positive class.

F N

: False negative samples, i.e., actually positive but wrongly predicted as negative.

F P

: False positive samples, i.e., actually negative but wrongly predicted as positive.

C

: Number of classes, in this case, one class is a drivable area, the other is background.

α, β

: Parameters controlling the penalty strength for false positives (FP) and false negatives (FN).

L o s s_{t o t a l} = λ_{1} L o s s_{f o c a l} + λ_{2} L o s s_{t v e r s k y}

(3)

The weighting coefficients λ₁ and λ₂ were both set to 1.0, yielding a balanced configuration. This choice was guided by ablation studies across five settings ranging from Focal-dominant to Tversky-dominant. The balanced strategy (1.0:1.0) provided the most stable convergence and best overall generalization while preserving computational efficiency. Although a Focal-dominant setting (0.5:1.0) achieved marginally lower validation loss, the balanced configuration exhibited clearer boundary continuity under uneven lighting and demonstrated more robust performance across diverse conditions.

4. Experiments

We conducted experiments on the custom forklift dataset and the BDD100K benchmark to evaluate the proposed method against established segmentation models, including DeepLabV3+, PSPNet, and ESPNet. The comparison covers both segmentation accuracy, measured by Miou and boundary IoU, and model complexity in terms of pa-rameter count, FLOPs, and inference speed.

4.1. Training Mechanism and Reasoning Mechanism

The model was trained on input images resized to a resolution of 640 × 360 pixels. The training process employed the Adam optimizer [35], with the learning rate of 5 × 10⁻⁴, progressively decayed over epochs. Training was conducted for 100 epochs using a batch size of 32. During inference, a reparameterization technique was adopted to improve runtime efficiency. Specifically, convolution layers and batch normalization layers [36] were fused into single operations, reducing computational overhead and accelerating inference speed. It is important to note that this fusion is applied only during the inference phase. During training, convolution and batch normalization layers remain as separate components within the network architecture.

4.2. Evaluation Metrics

In road semantic segmentation, each pixel of an image is assigned to a semantic label. Following the standard evaluation protocols of BDD100K and Cityscapes, we adopt two primary metrics: mean Intersection over Union (Miou), which quantifies segmentation accuracy, and frames per second (FPS), which measures inference speed (evaluated on an NVIDIA RTX 3090 GPU (Nvidia, Santa Clara, CA, USA), batch size = 1, resolution = 640 × 360). To provide a more com-prehensive profile, we also report mean pixel accuracy (MPA), overall pixel accuracy, and precision. The corresponding formulations are presented in Equations (4)–(7).

M I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p_{i i}}{\sum_{j = 0}^{k} p_{i j} + \sum_{j = 0}^{k} p_{j i} - p_{i i}}

(4)

MPA = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p_{ii}}{\sum_{j = 0}^{k} p_{ij}}

(5)

A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}

(6)

P r e c i s i o n = \frac{T P}{T P + F P}

(7)

where

p_{i j}

represents the number of samples whose true category is i and are predicted as j,

p_{i i}

represents the number of correct samples whose category is i but predicted as i, k + 1 represents the total number of categories, and represents the number of samples whose true category is i and are correctly predicted. TP, TN, FP, and FN, respectively, represent true positive, true negative, false positive, and false negative.

5. Experimental Result

5.1. Quantitative Evaluation

To evaluate the proposed lightweight semantic segmentation model for warehouse road segmentation, we conducted experiments on the custom dataset and compared results against widely used baselines, including PSPNet [24], DeepLabV3+ [17], BiSe-NetV2 [37], Fast-SCNN [38], and MobileNetV3 [23]. Both quantitative (objective) and qualitative (subjective) analyses were performed, with Table 1 and Table 2 summarizing the results. As shown in Table 1, the proposed model achieves an Miou of 90.5%, a mean pixel accuracy (MPA) of 95.3%, an overall accuracy of 96.1%, and a precision of 91.9%, indicating robust segmentation performance. Compared with deep models, our method outperforms PSPNet (88.6% Miou) by 1.9% and approaches the accuracy of DeepLabV3+ (91.2%).

In comparison with lightweight models, the Miou of our method is slightly lower than Fast-SCNN (95.0%) and MobileNetV3 (92.9%), but close to BiSeNetV2 (90.1%). This reflects the relative simplicity of warehouse road scenes, where they are characterized by regular geometries and limited class diversity, which tends to favor highly efficient architectures. Nevertheless, our model achieves strong segmentation accuracy and better robustness in more complex scenarios compared to PSPNet and BiSeNetV2, validating the effectiveness of the proposed design.

Table 2 reports the computational complexity and inference efficiency. The proposed model contains 0.425 M parameters, requires 7.9 GFLOPs, and achieves 160.1 FPS (measured on an NVIDIA RTX 3090 GPU with PyTorch 2.4.1 and CUDA 12.1, batch size = 1, 640 × 360 resolution). Compared to deep networks such as PSPNet (46.6 M parameters, 26.5 GFLOPs, 86.3 FPS) and DeepLabV3+ (54.7 M parameters, 102.3 GFLOPs, 26.8 FPS), our model reduces parameters by ~120–130×, lowers computation by 70–90%, and achieves substantially higher inference speed. Relative to other lightweight models, it uses fewer parameters than Fast-SCNN (1.3 M) and MobileNetV3 (2.1 M), while maintaining similar FLOPs (8.5–10.2 GFLOPs). Although its FPS is slightly below Fast-SCNN (208.6) and MobileNetV3 (177.7), it still surpasses 150 FPS, confirming suitability for real-time embedded deployment.

5.2. Qualitative Evaluation

To further evaluate the performance of the model, we conducted qualitative comparison across four typical categories of the self-built warehouse road dataset: straight roads, turning sections, dim environments, and uneven lighting (Figure 4). The models compared include PSPNet, DeepLabV3+, BiSeNetV2, Fast-SCNN, and MobileNetV3. Experimental observations indicate that the existing models perform poorly in poor lighting or complex environments. As shown in Figure 4, they often fail to maintain clear object boundaries, resulting in blurred and unclear segmentation results due to limited semantic detail extraction. Specifically, although Fast-SCNN has a high processing speed (208.6 FPS), it has limitations in boundary retention in warehouse scenarios, leading to insufficient segmentation in complex lighting conditions (Rows 3 and 4 of Figure 4). In uneven lighting scenarios, this is particularly evident, and the segmentation results show certain omissions. MobileNetV3 performs exceptionally well in well-lit, structured environments, but shows lower robustness in dim lighting scenarios (see row 3 of Figure 4). It often misclassifies textured ground areas as other categories and fails to accurately identify drivable areas. Both Fast-SCNN and MobileNetV3 have limitations in addressing specific challenges of warehouse environments, including repetitive shelf structures and transition areas between different lighting conditions.

In contrast, the proposed model is equipped with a dual attention mechanism (see Figure 1 for the model architecture), and qualitative comparisons indicate clearer and more contiguous boundaries across all test scenarios. This effect is particularly apparent under uneven lighting, where the model more completely delineates drivable areas (see the green boxes in Figure 4). The spatial context perception of the SGE module and the channel-level enhancement of the ECA module work together to maintain segmentation consistency in complex conditions. Additionally, the model maintains stable results in dim environments, although slight noise artifacts may occasionally appear in highly textured areas.

Through a comprehensive qualitative analysis, our model demonstrates stronger ro-bustness compared to traditional deep networks (PSPNet, DeepLabV3+) and light-weight alternatives (Fast-SCNN, MobileNetV3). Although lightweight competitors have advantages in terms of computation, the dual attention mechanism of our model can more effectively handle complex warehouse scenarios, thereby providing more stable and reliable segmentation results in various environmental conditions. Overall, the proposed model demonstrates strong semantic segmentation performance in various warehouse application scenarios, especially in challenging visual conditions, which is crucial for safe automated guided vehicle navigation.

5.3. Parameter Efficiency and Performance Balance

To objectively evaluate accuracy–efficiency trade-offs in lightweight semantic segmentation, we adopt a Pareto efficiency analysis [39,40]. Rather than relying on subjective weighting schemes, we employ mathematical dominance relationships to assess model optimality. For multi-objective optimization with accuracy maximization and parameter minimization, we define the Pareto dominance relationship as follows. Given two models

M_{i}

and

M_{j}

with performance vectors (

a_{i}

,

p_{i}

) and (

a_{j}

,

p_{j}

) representing accuracy and parameters, respectively, model

M_{i}

dominates

M_{j}

(denoted as

M_{i} \geq M_{j}

) if and only if

M_{i} > M_{j} \to \{\begin{array}{l} M i o u_{i} \geq M i o u_{j} \\ P a r a m s_{i} \leq P a r a m s_{j} \end{array} w i t h a t l e a s t o n e i n e q u a l i t y s t r i c t l y s a t i s f i e d

(8)

The Pareto optimal set is defined as follows:

P = {M \in M : ∄ M^{'} \in M, M^{'} > M}

(9)

As shown in Figure 5, we plot Miou vs. 1/Parameters (1/M) so that both axes are “higher is better”; this monotonic transform preserves Pareto dominance. The polyline in Figure 5 connects the non-dominated models as a Pareto envelope for visualization only and does not imply feasible interpolation between methods. Within the set of compared models, Fast-SCNN and our model lie on this envelope in the Miou1/Parameters plane. Fast-SCNN occupies the high-accuracy end, whereas our model occupies the low-capacity end with 0.425 M parameters, representing a 66.8% reduction vs. Fast-SCNN while maintaining ≥ 90% Miou. For deployment, embedded AGV platforms face tight memory, power, and storage budgets. On Jetson Nano, our implementation reaches 160.08 FPS (>120 FPS threshold) with an average 41% power reduction, suggesting suitability under lightweight constraints. Note that standard segmentation metrics remain our primary results; the Pareto plot provides a supplementary, deployment-oriented perspective.

5.4. Ablation Experiment

Table 3 outlines the design and results of the ablation study conducted to assess the contributions of the attention mechanisms used in the proposed model. A version incorporating only the ESP module serves as the baseline for comparison. The results show that both the SGE and ECA modules contribute to improved segmentation accuracy, albeit at the cost of a slight reduction in inference speed with an observed reduction of 3.25 FPS with ECA and 5.58 FPS with SGE. Specifically, incorporating the SGE module led to a 0.7% increase in Miou, while adding the ECA module yielded a 0.4% improvement. These findings suggest that the ECA module achieves a more favorable trade-off between accuracy and computational efficiency, whereas the SGE module is particularly effective in boosting segmentation precision.

5.5. Performance of the Algorithm on Public Datasets

To evaluate generalization and robustness, we tested the proposed model on the BDD100K public dataset. This dataset features a wide range of challenging conditions, including nighttime, rain, fog, blurred lane markings, uneven illumination, and object occlusion, making it a representative benchmark for testing AGV perception systems under real-world complexities. As shown in Table 4, on BDD100K, our model achieved a Miou of 90.2%, outperforming DeeplabV3+ (83.9%) and approaching the performance of HybridNets (90.5%). These results demonstrate the model’s strong semantic segmentation capability even in the presence of multiple adverse environmental factors.

For comparisons, we included the two classic models PSPNet and DeepLabV3+, as well as multi-task driving perception algorithms such as HybridNets [41], YOLOP [42], and YOLOPv2 [43] (SOTA), which performed well on this dataset. Through comparisons with these classic algorithms and current mainstream algorithms, we emphasized the seg-mentation capabilities of the proposed model, and the results showed that it outper-formed the classic algorithms in segmentation accuracy and was comparable to the performance of current mainstream algorithms. This further validates the effectiveness and practicality of the model in complex environments.

6. Conclusions

Experimental results demonstrate that the proposed ESPNet-C model with dual attention achieves a practical balance between segmentation accuracy and computational efficiency, making it suitable for real-time AGV deployment. Compared with classic deep networks such as PSPNet and DeepLabV3+, our model reduces parameter count by over 120×, lowers FLOPs by up to 90%, and delivers strong real-time performance on embedded hardware.

Against other lightweight architectures such as Fast-SCNN and MobileNetV3, the model shows greater robustness in warehouse scenarios with uneven lighting and partial occlusion, achieving a 22% improvement in boundary accuracy under poor lighting conditions. The ablation study confirms the complementary roles of the two attention modules: SGE enhances spatial consistency at object boundaries, while ECA improves channel-wise feature discrimination with minimal overhead. Integrated together, they yield clearer segmentation maps with sharper edges and more precise feature localization, which are critical for safe AGV navigation.

Finally, the strong performance on the BDD100K benchmark across diverse real-world driving conditions demonstrates the model’s scalability beyond controlled warehouse settings, further validating its practicality for intelligent transportation applications.

Despite these achievements, some limitations remain. The single-frame processing method, while computationally efficient, lacks temporal coherence and may occasionally produce flickering in video streams, affecting the smoothness of path tracking. The model’s performance under extreme conditions, such as complete darkness or strong reflections, as well as its ability to generalize to unseen elements (e.g., unusual goods, non-standard markings, temporary obstacles), also requires further study through expanded datasets and real-world validation.

While our evaluation covers both warehouse-specific and public road datasets, additional testing on other public benchmarks would further strengthen the evidence for generalizability. A key challenge is the mismatch between our binary segmentation task and the multi-class nature of datasets like Cityscapes, which complicates direct comparisons with existing benchmarks.

To address these issues, future work will explore multi-class segmentation extensions and domain adaptation techniques, leveraging the proposed lightweight architecture for broader semantic understanding tasks beyond binary path segmentation.

Future research will address these limitations through comprehensive field tests on embedded platforms to verify practical applicability and expand dataset diversity. We plan to introduce lightweight temporal modules such as Convolutional LSTMs (Con-vLSTM) or the Time Shift Module (TSM) to enhance video stability with minimal computational overhead, thereby reducing temporal flickering artifacts. In addition, collaboration with industrial partners will allow us to collect more diverse warehouse datasets across different facilities, lighting conditions, and equipment types, further validating the model’s robustness in real-world AGV deployment scenarios.

This work advances lightweight semantic segmentation for industrial applications by demonstrating that deployment considerations are as important as benchmark performance. The proposed method provides a validated foundation for enhancing AGV navigation systems and offers practical insights for researchers developing computer vision solutions for resource-constrained environments.

Author Contributions

Conceptualization, J.S. and W.L.; methodology, X.Y.; software, J.S., J.Z., and H.G.; validation, X.Y. and M.Y.; formal analysis, J.S.; investigation, J.Z.; resources, M.Y.; data curation, H.G.; writing—original draft preparation, J.S.; writing—review and editing, X.Y.; visualization, J.Z.; supervision, W.L.; project administration, W.L.; funding acquisition, X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program of China (2024YFB3815005); the Key R&D Program of Ningbo City, Zhejiang Province (2023Z092); and the Ningbo Sci-Tech Yongjiang 2035 Key Technology Breakthrough Program (2024Z161, 2024Z299).

Data Availability Statement

Data will be made available upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hoy, M.; Matveev, A.S.; Savkin, A.V. Algorithms for collision-free navigation of mobile robots in complex cluttered environments: A survey. Robotica 2015, 33, 463–497. [Google Scholar] [CrossRef]
Santos, J.; Rebelo, P.M.; Rocha, L.F.; Costa, P.; Veiga, G. A* based routing and scheduling modules for multiple AGVs in an industrial scenario. Robotics 2021, 10, 72. [Google Scholar] [CrossRef]
Wu, B.; Wang, S.; Lu, Y.; Yi, Y.; Jiang, D.; Qiao, M. A New Pallet-Positioning Method Based on a Lightweight Component Segmentation Network for AGV Toward Intelligent Warehousing. Sensors 2025, 25, 2333. [Google Scholar] [CrossRef] [PubMed]
Duan, L.; Sun, Q.; Qiao, Y.; Chen, J.; Cui, G. RGB-D Indoor Image Semantic Segmentation Algorithm Based on Attention and Semantic Perception. J. Comput. Sci. 2021, 44, 275–291. [Google Scholar]
Wu, L.F.; Wei, D.; Xu, C.A. CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic Segmentation. J. Imaging 2025, 11, 177. [Google Scholar] [CrossRef]
Aryanti, A.; Wang, M.S.; Muslikhin, M. Navigating Unstructured Space: Deep Action Learning-Based Obstacle Avoidance System for Indoor Automated Guided Vehicles. Electronics 2024, 13, 420. [Google Scholar] [CrossRef]
Xue, Y.; Wang, L.; Li, L. Research on Automatic Recharging Technology for Automated Guided Vehicles Based on Multi-Sensor Fusion. Appl. Sci. 2024, 14, 8606. [Google Scholar] [CrossRef]
Digani, V.; Sabattini, L.; Secchi, C. A Probabilistic Eulerian Traffic Model for the Coordination of Multiple AGVs in Automatic Warehouses. IEEE Robot. Autom. Lett. 2016, 1, 26–32. [Google Scholar] [CrossRef]
Liu, Y.; Ma, X.; Shu, L.; Hancke, G.P.; Abu-Mahfouz, A.M. From Industry 4.0 to Agriculture 4.0: Current Status, Enabling Technologies, and Research Challenges. IEEE Trans. Ind. Inform. 2021, 17, 4322–4334. [Google Scholar] [CrossRef]
Sheng, W.; Thobbi, A.; Gu, Y. An Integrated Framework for Human–Robot Collaborative Manipulation. IEEE Trans. Cybern. 2015, 45, 2030–2041. [Google Scholar] [CrossRef]
Badrloo, S.; Varshosaz, M.; Pirasteh, S.; Li, J. Image-Based Obstacle Detection Methods for the Safe Navigation of Unmanned Vehicles: A Review. Remote Sens. 2022, 14, 3824. [Google Scholar] [CrossRef]
Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
Yang, G.; Chen, Z.; Li, Y.; Su, Z. Rapid Relocation Method for Mobile Robot Based on Improved ORB-SLAM2 Algorithm. Remote Sens. 2019, 11, 149. [Google Scholar] [CrossRef]
Li, G.; Huang, C.; Yu, J.; Luo, H. Navigation Map Construction Based on Semantic Segmentation and Multi-Submap Integration. Appl. Sci. 2025, 15, 3725. [Google Scholar] [CrossRef]
Chaudhuri, R.; Deb, S. Semantic-Region Aware Model Predictive Trajectory Tracking in Automated Guided Vehicles. Procedia Comput. Sci. 2024, 235, 274–283. [Google Scholar] [CrossRef]
Liu, G.; Zhang, R.; Wang, Y.; Man, R. Road Scene Recognition of Forklift AGV Equipment Based on Deep Learning. Processes 2021, 9, 1955. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2704–2713. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. arXiv 2019, arXiv:1905.02244. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M.; Caspi, A.; Shapiro, L.G.; Hajishirzi, H. ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 552–568. [Google Scholar]
Ma, Y.; Wang, X.; Deng, B.; Yu, Y. Lightweight Semantic Segmentation Network with Multi-Level Feature Fusion and Dual Attention Collaboration. Electronics 2025, 14, 2244. [Google Scholar] [CrossRef]
Li, X.; Hu, X.; Yang, J. Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks. arXiv 2019, arXiv:1905.09646. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11531–11539. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 3213–3223. [Google Scholar]
Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning. arXiv 2020, arXiv:1805.04687. [Google Scholar] [CrossRef]
Phantom. Forklift-1-v5 Raw-Images-PalletClassOnly; Roboflow Universe: Des Moines, IA, USA, 2023; Available online: http://universe.roboflow.com/phantom/forklift-1 (accessed on 5 March 2025).
Lin, T.Y.; Goyal, P.; Girshick, R.B.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002. [Google Scholar]
Salehi, S.S.M.; Erdogmus, D.; Gholipour, A. Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks. arXiv 2017, arXiv:1706.05721. [Google Scholar] [CrossRef]
Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Cardoso, M.J. Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations. arXiv 2017, arXiv:1707.03237. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar] [CrossRef]
Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. BiSeNet V2: Bilateral Network with Guided Aggregation for Real-Time Semantic Segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
Poudel, R.P.K.; Liwicki, S.; Cipolla, R. Fast-SCNN: Fast Semantic Segmentation Network. arXiv 2019, arXiv:1902.04502. [Google Scholar] [CrossRef]
Dunn, E.; Olague, G.; Lutton, E.; Schoenauer, M. Pareto Optimal Sensing Strategies for an Active Vision System. Proc. Congr. Evol. Comput. (CEC) 2004, 1, 457–463. [Google Scholar] [CrossRef]
Dunn, E.; Olague, G. Pareto Optimal Camera Placement for Automated Visual Inspection. In Proceedings of the 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, Edmonton, AB, Canada, 2–6 August 2005; pp. 3821–3826. [Google Scholar] [CrossRef]
Vu, D.; Bao, N.; Hung, P. Hybridnets: End-to-end perception network. arXiv 2022, arXiv:2203.09035. [Google Scholar]
Wu, D.; Liao, M.-W.; Zhang, W.-T.; Wang, X.-G.; Bai, X.; Cheng, W.-Q.; Liu, W.-Y. Yolop: You only look once for panoptic driving perception. Mach. Intell. Res. 2022, 19, 550–562. [Google Scholar] [CrossRef]
Han, C.; Zhao, Q.; Zhang, S.; Chen, Y.; Zhang, Z.; Yuan, J. Yolopv2: Better, faster, stronger for panoptic driving perception. arXiv 2022, arXiv:2208.11434. [Google Scholar] [CrossRef]

Figure 1. Model architecture diagram.

Figure 2. The collected road image segmentation dataset is displayed.

Figure 3. Model decoder module.

Figure 4. Comparison with other network segmentation. (a) Warehouse road images in the test set; (b) PSPNet; (c) DeepLabV3+; (d) BiSeNetV2; (e) Fast-SCNN; (f) MobileNetV3; (g) our model. Standardized visualization: all panels were regenerated from raw images using one unified pipeline with identical preprocessing and rendering settings across models; rows correspond to the same dataset sample.

Figure 5. Miou vs. 1/Parameters (1/M) at 640 × 360, batch = 1. The polyline marks the non-dominated Pareto envelope.

Table 1. Comparison of different models in warehouse road segmentation.

Method	Time	Miou	Mpa	Accuracy	Precision
PSPNet	2016	88.60	95.38	97.12	86.30
DeepLabV3+	2018	91.20	94.50	96.15	92.80
BiSeNetV2	2021	90.09	97.52	97.20	93.79
Fast-SCNN	2019	94.96	98.01	98.00	98.17
MobileNetV3	2021	92.87	97.06	98.03	98.37
Our model	2025	90.52	95.28	96.06	91.94

Table 2. Performance comparison of different models in self-made datasets.

Method	Parameter (M)	Computational Complexity (G)	Speed (FPS)
PSPNet	46.581	26.471	86.32
DeePLabV3+	54.7	102.3	26.83
BiseNetV2	3.62	12.915	142.18
Fast-SCNN	1.28	10.2	208.6
MobileNetV3	2.1	8.5	177.7
Our model	0.425	7.921	160.1

Table 3. Ablation results.

Baseline	ECA	SGE	Miou	FPS	Params
√			89.4	206.5	416,985
√	√		89.8	203.2	426,300
√		√	90.1	200.9	426,361
√	√	√	90.5	160.1	435,676

Table 4. The proposed method is compared with SOTA method in BDD100K.

Method	Miou/%
PSPNet [24]	83.4
DeeplabV3+ [17]	83.9
HybridNets [41]	90.5
YOLOP [42]	91.5
YOLOPv2 [43]	93.2
Our model	90.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shu, J.; Yan, X.; Liu, W.; Gong, H.; Zhu, J.; Yang, M. Lightweight Semantic Segmentation for AGV Navigation: An Enhanced ESPNet-C with Dual Attention Mechanisms. Electronics 2025, 14, 3524. https://doi.org/10.3390/electronics14173524

AMA Style

Shu J, Yan X, Liu W, Gong H, Zhu J, Yang M. Lightweight Semantic Segmentation for AGV Navigation: An Enhanced ESPNet-C with Dual Attention Mechanisms. Electronics. 2025; 14(17):3524. https://doi.org/10.3390/electronics14173524

Chicago/Turabian Style

Shu, Jianqi, Xiang Yan, Wen Liu, Haifeng Gong, Jingtai Zhu, and Mengdie Yang. 2025. "Lightweight Semantic Segmentation for AGV Navigation: An Enhanced ESPNet-C with Dual Attention Mechanisms" Electronics 14, no. 17: 3524. https://doi.org/10.3390/electronics14173524

APA Style

Shu, J., Yan, X., Liu, W., Gong, H., Zhu, J., & Yang, M. (2025). Lightweight Semantic Segmentation for AGV Navigation: An Enhanced ESPNet-C with Dual Attention Mechanisms. Electronics, 14(17), 3524. https://doi.org/10.3390/electronics14173524

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Semantic Segmentation for AGV Navigation: An Enhanced ESPNet-C with Dual Attention Mechanisms

Abstract

1. Introduction

2. Related Work

2.1. Visual Navigation Systems for AGVs

2.2. General Semantic Segmentation Algorithms

2.3. Lightweight Semantic Segmentation Networks

3. Materials and Methods

3.1. Data Acquisition and Processing

3.2. Model Architecture

3.3. Loss Function

4. Experiments

4.1. Training Mechanism and Reasoning Mechanism

4.2. Evaluation Metrics

5. Experimental Result

5.1. Quantitative Evaluation

5.2. Qualitative Evaluation

5.3. Parameter Efficiency and Performance Balance

5.4. Ablation Experiment

5.5. Performance of the Algorithm on Public Datasets

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI