1. Introduction
Semantic segmentation provides fine-grained pixel-level understanding of visual scenes, which is critical for safe and robust perception in autonomous vehicles. Traditionally, convolutional neural networks (CNNs) have dominated this field, but transformer-based architectures have recently gained traction due to their ability to model long-range dependencies and global context effectively.
SegFormer, a transformer-based architecture, introduces a hierarchical encoder and lightweight decoder design that eliminates the need for positional embeddings and heavy upsampling. Although extensively evaluated on datasets like Cityscapes, its behavior on smaller and geographically diverse datasets such as CamVid, KITTI, and IDD is less understood.
This paper addresses two major gaps in current research:
Understanding how different SegFormer variants scale on a smaller dataset like CamVid in terms of performance and computational cost.
Investigating the effectiveness of cross-dataset transfer learning from CamVid to KITTI (Germany) and IDD (India), each representing structured and unstructured driving environments, respectively.
Additionally, we introduce explainability into the evaluation pipeline using confidence heatmaps, allowing us to visually interpret the uncertainty of the model and the quality of a decision at the pixel level.
Unlike prior studies that typically focus on architectural innovation, large-scale benchmarks, or synthetic-to-real adaptation, this work emphasizes a unified evaluation of architecture scaling, cross-dataset transfer, and explainability under limited-data conditions. In particular, the use of CamVid as a source domain reflects a practical scenario where only a small, well-annotated dataset is available for pretraining. This setup allows us to investigate how transformer-based segmentation models behave when transferred across geographically diverse environments without relying on large-scale source datasets.
This paper addresses the identified gaps as follows:
A systematic evaluation of SegFormer variants (B3, B4, B5) on CamVid to study architecture scaling effects.
Implementation of cross-dataset transfer learning from CamVid to KITTI and IDD, including custom class mapping strategies.
Introduction of confidence heatmaps to visualize model prediction certainty and aid explainability in safety-critical contexts.
Rather than proposing a new architecture, this paper contributes a structured empirical analysis that combines scaling behavior, real-to-real cross-dataset transfer, and interpretability within a single experimental framework. This integrated perspective distinguishes our study from prior works that typically examine these aspects independently.
The remainder of this paper is organized as follows:
Section 2 reviews prior work;
Section 3 details our methodology;
Section 4 describes the datasets and class mappings;
Section 5 outlines evaluation metrics and experimental results; and
Section 6 concludes the paper with insights and future directions.
2. Related Work
Semantic segmentation has progressed substantially with the advent of deep learning, evolving from convolution-based architectures to transformer-driven designs. Early approaches such as Fully Convolutional Networks (FCNs) and SegNet [
1] introduced encoder–decoder frameworks that extracted spatial features using convolutional layers. Later models like the DenseNet-based Tiramisu [
2] improved training stability via dense skip connections and deep supervision. These CNN-based models were effective for structured scene parsing but often struggled with modeling long-range dependencies and global context.
To address real-time constraints in autonomous systems, efficient architectures such as BiSeNet [
3] adopted dual path designs to balance spatial precision and receptive field. PIDNet [
4] leveraged principles from control theory to better handle multi-scale information, while RTFormer [
5] demonstrated that transformer-based designs could match CNN efficiency in real-time settings. These approaches prioritized inference speed, but often at the expense of segmentation accuracy in complex scenes.
Transformer-based segmentation models marked a turning point, with SegFormer [
6] emerging as a notable breakthrough. Its hierarchical transformer encoder, overlapped patch embeddings, and Mix-FFN layers enabled strong global context modeling while preserving local structure, outperforming convolutional models on Cityscapes with fewer parameters. Further innovations like Skip-SegFormer [
7] and CFF-SegFormer [
8] extended this architecture with improved multi-scale feature fusion and decoder efficiency.
Parallel to architectural advancements, transfer learning has become essential for scenarios where labeled data is scarce. DAFormer [
9] showed that domain-adaptive segmentation using transformers could generalize well across synthetic and real domains. However, most prior works focus on large-scale benchmarks or synthetic-to-real adaptation. Cross-dataset generalization between small, real-world datasets—such as CamVid, KITTI, and IDD—remains underexplored. This is particularly relevant for developing perception systems intended for deployment across diverse geographic regions. Our study addresses this gap by systematically evaluating SegFormer’s transferability from CamVid to both KITTI and IDD.
Interpretability is increasingly recognized as a key aspect of trustworthy segmentation. Bayesian SegNet [
10] pioneered uncertainty modeling using Monte Carlo dropout. More recent approaches explore class activation maps (CAMs), attention visualizations, and confidence heatmaps to make dense predictions more interpretable. While explainable AI (XAI) has advanced for classification tasks, its integration with transformer-based segmentation remains limited. Recent transformer-specific interpretability techniques such as attention rollout and token-based saliency mapping provide complementary perspectives, but our focus remains on confidence-driven and gradient-based visualization within the SegFormer framework. Our work contributes to this area by introducing confidence heatmaps as a diagnostic tool to expose model uncertainty and failure points in urban scene segmentation.
Despite these advances, many studies evaluate models either on large-scale benchmarks or within a single methodological dimension, leaving open questions regarding scalability on smaller datasets, real-to-real transfer across geographic domains, and the interpretability of transformer-based segmentation models.
In summary, prior research has laid the foundation for efficient, accurate, and adaptive segmentation. Yet, the combined challenges of scaling transformer models, transferring them across real-world datasets, and interpreting their predictions remain open. We address this intersection by evaluating SegFormer’s performance under scaling, cross-dataset transfer, and explainability constraints, bringing a unified perspective to transformer-based semantic segmentation.
3. Datasets and Cross-Dataset Mapping
To evaluate both architectural scalability and cross-domain transferability, we used three publicly available urban driving datasets: CamVid, KITTI, and IDD. These datasets represent different geographical locations, label taxonomies, and annotation densities, making them suitable benchmarks for our study.
The Cambridge-driving Labeled Video Database (CamVid) [
11] contains 701 densely annotated frames (
resolution) extracted from video sequences captured in the UK. It includes 32 semantic classes covering the road, buildings, pedestrians, vehicles, sky, and street furniture. We use the standard split: 367 training, 101 validation, and 233 testing images. Due to its clean annotation and balanced scene composition, CamVid serves as the source domain for model scaling and transfer learning.
The KITTI Semantic Segmentation Benchmark [
12] offers 200 high-resolution images (
) from urban driving scenes in Germany. It follows a 19-class taxonomy derived from Cityscapes, focusing on structured environments with well-defined object boundaries. Due to its relatively small sample size and partial label overlap with CamVid, KITTI is selected as a target domain for cross-dataset transfer.
The Indian Driving Dataset (IDD) captures diverse and unstructured driving scenes in India. It contains 10,004 images annotated with 27 semantic classes, including region-specific categories such as autorickshaw, guard rail, and billboard. For consistency and computational feasibility, we use a subset comprising 1761 training and 350 validation samples. IDD introduces significant visual domain shifts, including varied lighting, occlusions, and unconventional vehicle types, making it a challenging yet valuable target domain.
To facilitate effective transfer learning from CamVid to target datasets, we establish a structured class mapping protocol. Since each dataset follows its own label taxonomy, alignment is necessary to preserve semantic consistency and ensure correct feature adaptation. Our mapping strategy categorizes class relationships into three types, as summarized in
Table 1.
To handle Novel Classes, we reinitialize the decoder weights while preserving the CamVid-pretrained encoder. This allows the network to reuse generalized features and adapt them to new class boundaries and appearance patterns during fine-tuning. This strategy ensures semantic consistency while enabling flexible cross-dataset transfer in geographically and structurally diverse urban scenes.
4. Methodology
Our proposed methodology investigates the scalability, transferability, and interpretability of transformer-based segmentation using SegFormer as seen in the
Figure 1. The experimental design is divided into three major stages: (1) model scaling experiments using CamVid; (2) cross-dataset transfer learning to KITTI and IDD; and (3) application of confidence-based interpretability techniques to evaluate model reliability.
4.1. SegFormer Architecture Overview
SegFormer [
6] consists of three core components: (i)
overlapping patch embedding to preserve spatial details, (ii) a
hierarchical transformer encoder that extracts multi-scale contextual features, and (iii) a
lightweight decoder that fuses these features via an MLP and performs semantic prediction.
We consider three SegFormer variants, B3, B4, and B5, differentiated by model size, capacity, and architectural configurations:
B3: With 47.1 M parameters, this variant employs a hierarchical structure of 12 transformer layers distributed across 4 stages (2, 3, 6, 3), where each stage operates at progressively reduced spatial resolutions of , , , and . It utilizes a hidden dimension of 512 in the deeper layers and a reduction ratio of 1 in its efficient self-attention mechanism.
B4: The intermediate variant contains 64.1 M parameters with 12 transformer layers arranged in a (2, 2, 8, 2) configuration across the four stages. B4 employs a larger embedding dimension of 640, increasing its capacity to model complex relationships while maintaining reasonable computational demands.
B5: The largest variant with 84.7 M parameters, B5 maintains the (2, 2, 8, 2) layer distribution of B4 but expands the embedding dimension to 768. This provides substantially increased representational capacity and attention width, allowing for more nuanced feature extraction and relationship modeling.
All three variants share the same lightweight MLP decoder structure, which aggregates multi-level features from the hierarchical encoder through a simple yet effective design. For all variants, we employ the same efficient self-attention mechanism with a linear complexity of
instead of the standard quadratic
complexity, achieved through a sequence reduction process. This attention mechanism is mathematically expressed as
where
R represents the reduction operation that decreases the sequence length by a factor of the reduction ratio.
4.2. Cross-Dataset Transfer Learning
We use CamVid as the source domain for transfer learning due to its structured annotations and clean urban scenes. SegFormer-B3 serves as the base model for knowledge transfer. For each target dataset, KITTI and IDD, we perform the following steps:
- 1.
Initialize the encoder using CamVid-pretrained weights.
- 2.
Reinitialize the decoder to match the target dataset’s class taxonomy.
- 3.
Adapt input–output pipelines using custom class mappings.
- 4.
Fine-tune the model with a reduced learning rate to avoid catastrophic forgetting.
Our mapping strategy includes: Direct Mappings (e.g., road → road), Semantic Mappings (e.g., bicyclist → rider), and Novel Classes (e.g., autorickshaw in IDD).
4.3. Transfer Learning Algorithm
To formalize our cross-dataset knowledge transfer approach, we present Algorithm 1, which details the process of transferring learned representations from a source urban scene dataset (CamVid) to target datasets (KITTI and IDD) with different class taxonomies and visual characteristics.
| Algorithm 1 Cross-Dataset Knowledge Transfer for Semantic Segmentation. |
| Require: Source dataset with classes, Target dataset with classes |
| Require: Class mapping function |
| Ensure: Target-adapted model |
| 1: | Phase 1: Source Domain Pretraining |
| 2: | Initialize SegFormer model with random weights |
| 3: | Train on using loss |
| 4: | Store optimized source parameters |
| 5: | Phase 2: Cross-Domain Parameter Transfer |
| 6: | Initialize target model with encoder weights from |
| 7: | Randomly initialize decoder parameters of for classes |
| 8: | Apply class mapping M to align source and target semantics |
| 9: | Phase 3: Target Domain Fine-tuning |
| 10: | Set learning rate |
| 11: | for epoch to max_epochs do |
| 12: | Update by minimizing on |
| 13: | Evaluate on validation set |
| 14: | if early stopping criterion met then |
| 15: | break |
| 16: | end if |
| 17: | end for |
| 18: | return optimized target model |
4.4. Loss Function
To ensure robust learning across imbalanced classes and complex boundary regions, we employ a multi-component loss function that addresses various aspects of semantic segmentation quality:
The class-weighted cross-entropy component
addresses class imbalance by applying inverse frequency weighting:
where
is the ground truth,
is the predicted probability, and
is the class weight calculated as
, with
being the frequency of class
c in the training set.
The IoU loss component
focuses on optimizing the Intersection-over-Union metric directly:
The boundary-aware component
enhances precision at class transitions:
where
is the set of boundary pixels identified using a Sobel edge detector on the ground truth, and
is a distance-based weight that emphasizes pixels closer to boundaries.
The coefficients and were determined through systematic grid search on the validation set. This configuration achieved an optimal balance between overall segmentation accuracy and boundary precision.
4.5. Explainability via Confidence Heatmaps
To interpret model behavior, we generate confidence heatmaps that visualize softmax entropy per pixel. These maps provide spatial cues on:
High-confidence regions (well-learned objects);
Uncertain predictions (occlusions, rare classes);
Model confusion at boundaries.
These visualizations help reveal failure modes, guide further training improvements, and support safer model deployment in real-world scenarios.
5. Evaluation Metrics and Results
This section presents a rigorous quantitative and qualitative assessment of our SegFormer experiments across the CamVid, KITTI, and IDD datasets. We first establish the mathematical foundation of our evaluation framework, and then analyze the performance of different SegFormer variants and the effectiveness of our cross-dataset transfer learning approach.
5.1. Mathematical Formulation of Evaluation Metrics
Mean Intersection over Union (mIoU) measures the overlap between predicted and ground truth segmentation masks for each class, and then averages across all classes:
In our CamVid experiments, semantic classes, while KITTI uses classes and IDD uses classes.
Pixel Accuracy (PA) quantifies the overall proportion of correctly classified pixels:
Convergence Acceleration Factor (CAF) measures the reduction in training time:
Class-Specific Transfer Gain (CSTG) quantifies the improvement for each semantic class after transfer learning:
5.2. Architecture Scaling Analysis on CamVid
We systematically evaluated three SegFormer variants (B3, B4, and B5) on the CamVid dataset to understand the relationship between model capacity and segmentation performance.
Table 2 summarizes our findings.
The relationship between model size and performance follows a sub-linear pattern, indicating diminishing returns as model capacity increases as seen in the
Figure 2. While SegFormer-B5 achieves the highest accuracy, the performance gain over B3 (+4.5% mIoU) comes at the cost of significantly increased parameters (+80%) and inference time (+29.6%).
For our cross-dataset transfer learning experiments, we selected SegFormer-B3 as the base model due to its favorable balance between accuracy and efficiency.
5.3. KITTI Transfer Performance
We transferred knowledge from CamVid (source domain) to KITTI (target domain) by initializing a SegFormer-B3 model with weights pretrained on CamVid, and then fine-tuning on KITTI training data.
Table 3 quantifies the class-specific benefits of this transfer learning approach which can also be seen in the
Figure 3.
The magnitude of improvement varies significantly across classes, with structural elements showing the largest gains. The overall mIoU improvement from 52.08% to 53.42% may appear modest, but this aggregate metric obscures the substantial class-specific gains. Moreover, the training efficiency improvement is dramatic, with a 61.1% reduction in training time from 18 epochs to 7 epochs. We can see the difference between the baseline model and the improved transfer learning model in
Figure 4.
5.4. IDD Transfer Performance
The Indian Driving Dataset presents a more challenging transfer scenario due to greater visual differences from European urban scenes.
Table 4 shows that transfer learning yields even larger improvements on IDD than on KITTI which can also be seen in
Figure 5.
The remarkably high improvement for “Motorcycle” (+72.74%) can be attributed to low baseline performance due to class imbalance and visual complexity, combined with transferable features from similar classes in CamVid. Even classes unique to IDD like “Autorickshaw” benefit from transfer learning (+15.87%), demonstrating that lower-level features learned from CamVid transfer effectively despite semantic differences. We can see the difference between the baseline model and the improved transfer learning model in
Figure 6.
6. Explainability and Interpretability
While performance metrics such as mean Intersection over Union (mIoU) and pixel-wise accuracy provide essential quantitative benchmarks for evaluating semantic segmentation models, they often fall short in explaining the underlying reasoning behind model predictions. To bridge this interpretability gap, we incorporate two complementary techniques: confidence heatmaps and Gradient-weighted Class Activation Mapping (Grad-CAM).
Confidence heatmaps serve as a visual indicator of the model’s certainty in its predictions. For each pixel, the confidence score is defined as the maximum softmax probability across all classes:
This scalar value reflects the model’s belief in its most likely prediction at that location. As illustrated in
Figure 7, high-confidence regions typically correspond to well-represented classes like roads and vehicles, while low-confidence regions often cluster around ambiguous areas, class boundaries, and occluded objects.
Grad-CAM is employed to generate class-specific activation maps, enabling us to probe the model’s internal reasoning. This technique computes the gradient of the class-specific logit
with respect to the activation maps
of the final convolutional layer. The final Grad-CAM heatmap
for class
c is obtained by
where
are importance weights for each channel. As shown in
Figure 8, each class triggers distinct regions of activation, with road class activations across pavement areas and pole class activations in vertical regions.
In our implementation, Grad-CAM is applied to the final encoder feature representation prior to decoder fusion, enabling visualization of spatial attention within the transformer backbone while preserving the segmentation-specific prediction pipeline.
To examine generalizability across diverse driving environments, we evaluate Grad-CAM outputs on the CamVid and IDD datasets. The CamVid dataset features dense urban traffic with frequent pedestrian and bicycle interactions. As illustrated in
Figure 9, the model shows broader, more diffused attention around dynamic objects, while stationary classes maintain sharp activations.
The IDD dataset poses greater challenges due to unstructured Indian street scenes. As shown in
Figure 10, attention remains strong for high-frequency classes like road and vegetation, but becomes scattered for occluded classes, reflecting increased model uncertainty in complex contexts.
Visual Results with Prediction Overlay
To supplement our quantitative metrics and interpretability analyses, we present qualitative visualizations demonstrating SegFormer-B3’s segmentation capabilities across diverse real-world driving environments.
Figure 11 shows results from the CamVid dataset, where the model demonstrates strong performance in delineating core urban classes such as roads, sidewalks, and buildings, as well as dynamic objects like cyclists and pedestrians.
Figure 12 shows the results of the KITTI dataset, featuring structured suburban driving scenes. SegFormer-B3 exhibits precise segmentation of road surfaces, curbs, and sidewalks, with clear improvements from transfer learning.
Figure 13 presents segmentation examples from the IDD dataset with unstructured Indian street scenes. Despite complexity, SegFormer-B3 accurately segments roads and vehicles, and adapts well to underrepresented classes after transfer learning.
These visualizations collectively illustrate that SegFormer-B3 generalizes well across varying spatial layouts and lighting conditions, demonstrating its suitability for real-time, resource-constrained environments like autonomous vehicles.
7. Conclusions and Future Work
This study presented a comprehensive evaluation of SegFormer for urban scene segmentation, contributing significant insights in three key dimensions: architecture scaling, cross-dataset transfer learning, and model interpretability. Rather than introducing a new architecture, our contribution lies in providing an integrated empirical analysis that combines scaling behavior, cross-dataset transfer, and explainability within a unified evaluation framework.
Our architectural scaling experiments on the CamVid dataset demonstrated that while SegFormer-B5 achieves the highest accuracy at % mIoU, the efficiency–performance trade-off favors the more balanced SegFormer-B3 variant for practical deployments. The relationship between model capacity and performance follows a sub-linear pattern, with diminishing returns as parameter count increases, an important consideration for resource-constrained autonomous systems.
Our cross-dataset transfer learning investigation revealed that knowledge acquired from CamVid can significantly enhance segmentation performance in both structurally similar environments (KITTI) and substantially different driving contexts (IDD). Transfer benefits were particularly pronounced for classes with limited training examples and complex geometric structures, with improvements as high as % for “Wall” in KITTI and % for “Motorcycle” in IDD. Moreover, transfer learning dramatically reduced training time by % on KITTI, demonstrating considerable practical value for model adaptation across geographic regions.
Our interpretability analysis incorporated both confidence heatmaps and Gradient-weighted Class Activation Mapping (Grad-CAM) to enhance transparency in the model’s internal decision-making. Confidence heatmaps allowed us to visualize model certainty at the pixel level, while Grad-CAM provided class-specific saliency visualizations. This layer of explainability is critical for safety-critical applications such as autonomous driving, where understanding the rationale behind predictions is essential for trust and deployment readiness. These multifaceted contributions offer significant value for real-world autonomous systems facing the challenges of varied deployment conditions, limited labeled data availability, and model explainability requirements. By demonstrating SegFormer’s adaptability across diverse urban environments, our work provides a foundation for developing more geographically robust perception systems. Although this study focuses on image-domain transformer models, extending similar transferability and explainability principles to 3D point cloud representations remains an important future direction. Recent work exploring physics-aware point cloud learning suggests that integrating geometric reasoning with transformer-based perception could further enhance robustness in complex urban environments. Investigating how image-domain transfer strategies translate to multimodal or 3D spatial representations is therefore a promising avenue for future research.
Several promising directions extend from our current findings. First, exploring multi-source pretraining and domain adaptation strategies could further enhance model generalization. Rather than transferring from a single source dataset, simultaneously leveraging knowledge from multiple geographically diverse datasets could create more universal feature representations.
Second, evaluating SegFormer deployment on embedded hardware platforms would address real-world implementation challenges. Quantifying the latency–accuracy trade-offs across different computational constraints would provide practical guidelines for autonomous vehicle manufacturers and smart city developers.
By pursuing these research directions, we anticipate significant advancements in the development and deployment of efficient, transferable, and interpretable segmentation models for understanding in diverse global environments of urban scenes.