4.3.1. Ablation Study on GLCA and SWConv Modules
In the ASPP module, convolutions with different dilation rates and standard convolutions are used to capture multi-scale contextual information. The first layer of the ASPP module employs a standard 1 × 1 convolution, which has the smallest receptive field and mainly serves to retain fine-grained local texture information. However, it is limited in terms of semantic modeling. The second and third layers use dilated convolutions with rates of 6 and 12, respectively, providing medium-to-relatively large receptive fields. These layers strike a balance between capturing local details and fusing contextual semantics. The fourth layer employs a dilated convolution with a rate of 18, providing the largest receptive field and is thus better suited for capturing wide-range, global semantic information. However, as the receptive field increases, the sampling points of the convolution become increasingly sparse. This sparsity, especially in the rate = 18 path, impairs the model’s ability to perceive local details and object boundaries, making it difficult to model targets with complex shapes or irregular structures.
To address this issue, we incorporate the GLCA and SWConv modules into the rate = 18 path of the ASPP module, aiming to enhance this path’s capacity for semantic fusion and structural modeling. The GLCA module leverages bidirectional feature interaction and a dynamic weighting mechanism to effectively integrate global semantics with local detail, enhancing the representation of key regions and mitigating issues such as semantic ambiguity and blurred boundaries. This is particularly beneficial for the rate = 18 path, where the receptive field is large but information is sparse. The SWConv module introduces an asymmetric sampling topology and six-directional perception paths, breaking conventional convolution’s reliance on regular structures. This improves the model’s ability to capture object shapes, edge contours, and spatial structures, thereby compensating for the rate = 18 path’s deficiencies in shape adaptability. In contrast, the 1 × 1 convolution path and low-to-medium rate dilated convolution paths already possess strong capabilities in capturing local features due to their denser structures. As a result, the benefits of adding GLCA and SWConv to these paths are relatively limited. The rate = 18 path, however, combines high-level semantic abstraction with a significant information gap, making it the optimal location for the combined application of both modules.
To evaluate the individual effectiveness and synergistic adaptability of the proposed GLCA and SWConv modules in semantic segmentation, we conducted a systematic ablation study.
Table 10 presents the results on the Cityscapes dataset. The two modules were inserted into or used to replace different components of the ASPP module under two input resolutions (256 × 256 and 768 × 768). Specifically, the positions labeled 1, 6, 12, and 18 correspond to the first feature extraction layer and the dilated convolution layers with dilation rates of 6, 12, and 18 in the ASPP module, respectively.
Experimental results show that regardless of the insertion position within the ASPP module, both the GLCA and SWConv modules consistently improve performance. Notably, placing the GLCA module and SWConv module in the fourth layer of the ASPP module (rate = 18) yields the best results across all evaluation metrics. This indicates that introducing these modules at the final stage of feature extraction helps integrate the multi-scale semantic information captured by the preceding layers more effectively, thereby enhancing the model’s ability to capture cross-layer feature dependencies.
Furthermore, to further validate the synergistic benefits of the two modules, we integrated both the GLCA module and the SWConv module into the final layer of the ASPP module. Comparative experimental results demonstrate that the combined use of these modules leads to significantly better performance than using either module alone. At an input resolution of 256 × 256, the model achieved an improvement of 2.73% in mIoU, 2.64% in mRecall, 2.15% in mPrecision, and 0.27% in mAccuracy. At a higher resolution of 768 × 768, mIoU increased by 1.47%, mRecall by 1.80%, mPrecision by 0.32%, and mAccuracy by 0.13%. Based on the above quantitative experimental results, the proposed GLCA and SWConv modules demonstrate superior performance across different input resolutions, validating their strong scale adaptability and cross-scene generalization capability. These modules consistently and stably enhance model performance in multi-resolution semantic segmentation tasks.
Figure 11 presents the semantic segmentation visualization results on the Cityscapes dataset, showing the step-by-step integration of the GLCA and SWConv modules into the baseline model. As highlighted in the annotated regions of the figure, the inclusion of these two modules leads to more accurate recognition of fine-grained objects such as pedestrians, poles, and traffic lights. The object boundaries become noticeably clearer, misclassifications are significantly reduced, and the segmentation results show substantial improvements in both completeness and precision. These visual results further validate the synergistic effect of the GLCA and SWConv modules, demonstrating that they not only enhance the model’s capability in object perception and boundary localization but also effectively mitigate semantic confusion across categories.
To validate the stability and generalization of the proposed modules across different scenarios and resolutions, additional experiments were conducted on the CamVid dataset. Since no complex cropping and restoration operations were performed during training and testing at 480 × 360 resolution, the FPS metric is introduced to better assess the real-time processing capability of the model and its modules.
Table 11 shows the ablation experiment results of the GLCA and SWConv modules on the CamVid dataset, tested at two resolutions, 480 × 360 and 960 × 720. The experiments show that the two proposed modules significantly improve model performance when integrated into the ASPP module at various positions, with the best effect achieved when placed at the dilation convolution with a dilation rate of 18. At a resolution of 480 × 360, after introducing the two modules, the mIoU improves by 1.29%, mRecall by 0.64%, and mPrecision and mAccuracy by 1.81% and 0.34%, respectively, compared to the baseline model. At a high resolution of 960 × 720, mIoU is improved by 1.48%, mRecall by 1.95%, and mAccuracy by 0.26%, while mPrecision is optimal after replacing it with the SWConv module. The overall experimental results further verify the effectiveness of the GLCA and SWConv modules in enhancing model performance, particularly demonstrating stronger stability and adaptability under high-resolution input. Both modules bring performance gains at different insertion positions, with the combined use yielding the most significant effect, reflecting good module compatibility and complementarity. It is worth mentioning that although there is a slight sacrifice in inference speed, the accuracy is significantly improved, demonstrating a better trade-off between accuracy and efficiency.
Figure 12 shows the semantic segmentation results on the CamVid dataset after gradually adding modules. As seen in the marked area, the baseline model exhibits noticeable semantic confusion and boundary discontinuities, especially around structures like walls, indicating its limited ability to capture fine-grained details. With the introduction of the GLCA and SWConv modules, the model’s ability to understand semantics and accurately locate boundaries improves significantly, with the segmentation results progressively refining from coarse to fine. Finally, when both modules are integrated, they not only reduce semantic interference between categories but also enhance the clarity and integrity of structural edges. The segmented images become visually closer to the true labels, both in terms of appearance and structural consistency.
Table 12 presents the quantitative ablation results on the BDD100K dataset. It can be observed that, based on the baseline segmentation model, progressively introducing the GLCA and SWConv modules leads to improvements across all four evaluation metrics, fully demonstrating the effectiveness and applicability of these two modules in complex autonomous driving scenarios. When the GLCA and SWConv modules are integrated simultaneously, the segmentation performance is further enhanced. Specifically, compared to the baseline model, mIoU increases by 1.73%, mRecall by 1.82%, mPrecision by 1.62%, and mAccuracy by 0.32%. These results indicate that the proposed modules work synergistically to effectively enhance the model’s ability to perceive and recognize diverse semantic information in autonomous driving environments, thereby improving overall segmentation performance and robustness.
To more intuitively verify the performance improvements brought by the proposed modules in complex scenarios, qualitative ablation experiments were conducted on the BDD100K dataset.
Figure 13 shows a comparison of segmentation results from the baseline model, the model with the GLCA module, the model with the SWConv module, and the model combining both modules. The baseline model exhibits semantic ambiguity and unclear boundaries in some detailed regions. After introducing the GLCA module, the model achieves more accurate feature representation in key local areas, enhancing its ability to capture small objects and fine details. The addition of the SWConv module strengthens the model’s perception of multi-directional spatial information, effectively improving the representation of object contours and shapes. When both modules are combined, the model maintains overall semantic consistency while producing finer details and more precise boundaries, significantly enhancing the visual quality of the segmentation results. This qualitative analysis further validates the complementary advantages and practical effectiveness of the GLCA and SWConv modules in complex autonomous driving scenarios.
To better understand the function of each module, we will next demonstrate the specific changes in the feature maps after introducing different modules and analyze their impact on the model’s performance. The feature map results are shown in
Figure 14. It can be observed that after introducing SWConv, the feature extraction performance is significantly improved compared to the original feature map. In the original feature map, the effective features in the bright regions are scattered, and multi-directional information is not fully integrated, leading to blurred target boundaries and spatial relationships. After adopting SWConv, the module integrates multi-directional contextual information, making the bright regions in the feature map more coherent and providing more comprehensive coverage. Features such as road markings, vehicle contours, and pedestrian trajectories are more precisely presented, enhancing the perception of target boundaries and spatial relationships, providing more effective feature support for tasks in autonomous driving scenarios. After introducing the GLCA module, the key regions in the feature map become more prominent, indicating a significant increase in feature concentration in those areas. The GLCA module enhances the model’s ability to focus on critical regions of the image by integrating global and local contextual information. It dynamically adjusts the importance of these areas, thereby increasing attention on key features. This improvement enables the model to more accurately recognize fine details and comprehend spatial relationships, thereby enhancing its ability to capture the essential aspects of the scene.
Finally, after using both modules together, the performance is further improved. The attention module complements SWConv in global context understanding and feature correlation capture, allowing SWConv to focus more on the key features and strengthening the connection between various parts of the feature map. The combination of the attention module and SWConv enhances feature extraction, making it more comprehensive and precise, as they work synergistically to optimize the model’s recognition and decision-making capabilities in autonomous driving.
4.3.2. Ablation Study on Hyperparameters of GLCA and SWConv
To comprehensively evaluate the impact of key module designs on the overall performance of the segmentation model, we further analyze the role of core hyperparameters. This work conducts an extensive experimental analysis on the design of the local feature extraction branch, global feature extraction branch, and dynamic weighting branch in the GLCA module, as well as the hyperparameter of padding direction numbers in the SWConv module. By systematically adjusting these parameters and observing the resulting performance changes, we can clarify the performance boundaries and stability of each structural design. Here, GLCA-NL, GLCA-NG, and GLCA-NW represent the removal of the local feature extraction branch, global feature extraction branch, and the absence of the dynamic weighting branch, wherein the latter adopts equal-weight averaging for feature fusion. SWConv-UD, SWConv-LF, SWConv-TD, SWConv-UDTD, SWConv-LFTD, and SWConv-UDLF denote configurations using only the up-down padding branch, left-right padding branch, diagonal padding branch, up-down plus diagonal branches, left-right plus diagonal branches, and up-down, left-right plus diagonal branches, respectively. To systematically assess and analyze these components, experiments are conducted on the Cityscapes and CamVid datasets.
The quantitative experimental results of each hyperparameter component on the Cityscapes dataset are shown in
Table 13. From the data observed in the table, it is evident that removing the local feature extraction branch, global feature extraction branch, or dynamic weighting branch in the GLCA module leads to a certain degree of performance degradation across metrics such as mIoU, mRecall, mPrecision, and mAccuracy, indicating that all branches play a crucial role in feature extraction and fusion. Notably, removing the local feature extraction branch results in the most severe decline, with decreases of 1.75%, 1.65%, 1.79%, and 0.08% in mIoU, mRecall, mPrecision, and mAccuracy, respectively. The quantitative experiments demonstrate that the hyperparameter configurations of these branch structures have a significant impact on model performance, further underscoring the importance of jointly optimizing structural design and hyperparameter settings in practical applications. Similarly, in the SWConv module, removing different numbers of padding branches also leads to significant performance drops, illustrating that the choice of padding branch quantity has a critical impact on feature representation capability. Various combinations of padding branch numbers and directions capture diverse spatial information and directional features, enhancing the model’s sensitivity to complex scene details. However, the impact of different padding direction combinations varies, as unreasonable combinations may cause information redundancy or reduce feature representation efficiency, resulting in performance degradation. Compared to SWConv-UD, the SWConv module improves mIoU, mRecall, mPrecision, and mAccuracy by 0.97%, 0.65%, 1.10%, and 0.17%, respectively.
The quantitative experimental results of each hyperparameter component on the Cityscapes dataset are illustrated in
Figure 15. From the visualization results, it can be further observed that different hyperparameter components have a significant impact on model performance. Through reasonable hyperparameter tuning, the model achieves more comprehensive learning of edge details and small object regions, yielding superior segmentation results. Compared to other hyperparameter module designs, the final GLCA and SWConv modules proposed in this work demonstrate exceptional performance in capturing fine-grained features and restoring complex scene details, effectively enhancing overall segmentation accuracy and robustness.
To evaluate the robustness and adaptability of different hyperparameters across various scenarios, we also conducted experiments on the CamVid dataset, with the quantitative experimental results presented in
Table 14. The results indicate that the hyperparameter configurations of different components maintain consistent performance trends on this dataset, further validating the effectiveness and generalizability of the proposed modular design. Specifically, in the GLCA module, the full GLCA configuration outperforms GLCA-NL by 1.43%, 1.08%, 2.20%, and 0.36% in terms of mIoU, mRecall, mPrecision, and mAccuracy, respectively. In the SWConv module, our SWConv configuration achieves improvements of 0.96%, 0.55%, 1.22%, and 0.33% over SWConv-LFTD across the same metrics. The qualitative results are shown in
Figure 16. From the visual comparisons, it can be observed that both the GLCA and SWConv modules produce finer segmentation results, particularly in edge contours, small object regions, and structurally complex scenes. They demonstrate a superior ability to preserve semantic boundaries and exhibit better detail recovery and visual consistency compared to other hyperparameter settings, further confirming their effectiveness and adaptability across diverse scenarios.