A GAN-Based Framework with Dynamic Adaptive Attention for Multi-Class Image Segmentation in Autonomous Driving

Jama, Bashir Sheikh Abdullahi; Hacibeyoglu, Mehmet

doi:10.3390/app15158162

Open AccessArticle

A GAN-Based Framework with Dynamic Adaptive Attention for Multi-Class Image Segmentation in Autonomous Driving

by

Bashir Sheikh Abdullahi Jama

^*

and

Mehmet Hacibeyoglu

Department of Computer Engineering, Necmettin Erbakan University, Konya 42090, Türkiye

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8162; https://doi.org/10.3390/app15158162

Submission received: 19 June 2025 / Revised: 19 July 2025 / Accepted: 21 July 2025 / Published: 22 July 2025

Download

Browse Figures

Versions Notes

Abstract

Image segmentation is a foundation for autonomous driving frameworks that empower vehicles to explore and navigate their surrounding environment. It gives a fundamental setting to the dynamic cycles by dividing the image into significant parts like streets, vehicles, walkers, and traffic signs. Precise segmentation ensures safe navigation and the avoidance of collisions, while following the rules of traffic is very critical for seamless operation in self-driving cars. The most recent deep learning-based image segmentation models have demonstrated impressive performance in structured environments, yet they often fall short when applied to the complex and unpredictable conditions encountered in autonomous driving. This study proposes an Adaptive Ensemble Attention (AEA) mechanism within a Generative Adversarial Network architecture to deal with dynamic and complex driving conditions. The AEA integrates the features of self, spatial, and channel attention adaptively and powerfully changes the amount of each contribution as per input and context-oriented relevance. It does this by allowing the discriminator network in GAN to evaluate the segmentation mask created by the generator. This explains the difference between real and fake masks by considering a concatenated pair of an original image and its mask. The adversarial training will prompt the generator, via the discriminator, to mask out the image in such a way that the output aligns with the expected ground truth and is also very realistic. The exchange of information between the generator and discriminator improves the quality of the segmentation. In order to check the accuracy of the proposed method, the three widely used datasets BDD100K, Cityscapes, and KITTI were selected to calculate average IoU, where the value obtained was 89.46%, 89.02%, and 88.13% respectively. These outcomes emphasize the model’s effectiveness and consistency. Overall, it achieved a remarkable accuracy of 98.94% and AUC of 98.4%, indicating strong enhancements compared to the State-of-the-art (SOTA) models.

Keywords:

GANs; image segmentation; Adaptive Ensemble Attention; autonomous driving

1. Introduction

Among the latest and most important milestones in transportation, autonomous driving systems aim to make road transport safer, optimize traffic, and make mobility generally more efficient. The enabling factors include recent developments in artificial intelligence, sensor technology, and real-time computing. An autonomous vehicle architecture at its very core has a multi-dimensional framework integrating perception, localization, planning, control, and connectivity. Perception refers to the capability of a vehicle to identify and interpret its environment through several sensors, including cameras, LiDAR, radar, and ultrasonic systems [1]. Localization techniques, using GPS and mapping, determine the precise position of the vehicle in its surroundings. Planning algorithms implement decisions on route optimization and obstacle avoidance, whereas control systems actuate these decisions through various mechanical operations performed by the vehicle [2].

Inside the extensive scope of autonomous driving, Image segmentation (IM) arises as an urgent innovation in the vehicle’s control system [3]. The IM is a computer vision procedure that parts visual information into particular locales or portions, working with a more granular comprehension of the environment. This cycle is vital for semantic interpretation, as it permits the autonomous system to distinguish and separate features, for example, vehicles, walkers, traffic lights, and street symbols. Semantic segmentation allocates every pixel in a picture to a particular class, empowering a definite environmental understanding [4]. This capacity is especially basic for lane recognition and drivable region distinguishing proof, which are essential to guaranteeing safe and precise navigation. Image segmentation adds to real-time object location and classification, improving the vehicle’s capability to keep away from deterrents and interact consistently with its environmental factors. By dividing and examining environmental conditions, the system can likewise adapt to changing environment and lighting conditions, for example, recognizing wet streets or distinguishing low-perceivability scenarios like snow and fog. This flexibility highlights the significance of robust and effective image segmentation algorithms, frequently controlled by profound learning models, for example, convolutional neural networks (CNNs). These models have fundamentally progressed the exactness and speed of the division process, making ongoing handling practical for autonomous driving applications [5].

Although it has an extraordinary potential, image segmentation in autonomous driving encounters a few difficulties. Guaranteeing performance in different environmental circumstances, like an extreme environment or occlusions, remains a vital area of concentration. Furthermore, continuous activity requires huge computational resources, presenting imperatives for on-board systems. Tending to these difficulties includes investigating enhanced neural architectures, for example, Transformer models, and incorporating multi-modular sensor data to improve precision and unwavering quality. Emerging patterns likewise highlight utilizing distributed cloud computing and edge innovations to handle hardware constraints and speed up processing abilities [6].

The earlier classical methods based on manual feature extraction faced difficulties with urban scenes due to occlusions, changing light conditions, and an abundance of diverse objects. While recent deep learning models have improved performance, problems remain [3]. In this case, this paper proposes a novel deep learning approach based on dynamically adaptable attention in GANs to address these limitations. This research integrates a new mechanism part that changes continuously to moving image complexities and a GAN-based architecture improved for multi-class segmentation tasks. The rest of the paper is organized as follows: Section 2 overviews related work, Section 3 discusses the study’s methodology, Section 4 and Section 5 present experimental results and discussions.

2. Related Work

The rapid advancement of digital technology has significantly accelerated the development of autonomous vehicles (AVs) and the ADAS, opening the door for a future in which AVs will operate without requiring human interaction. Current ADAS techniques, however, have trouble identifying precise stopping sites and other dubious conditions [7]. Applications for semantic picture segmentation are numerous and include medical image analysis and autonomous driving [8]. The dynamic adaptability required for complicated driving circumstances is lacking in traditional segmentation algorithms like U-Net and Fully Convolutional Networks (FCNs) [9]. Although recent developments in attention processes have shown promise, nothing is known about how to incorporate them into GANs [10]. These artificial image-caption pairs provide further training data for the GAN, which is utilized to produce images that match provided captions [11]. For vision perception systems, lightweight and high-performance networks are essential, especially in settings with limited resources [12]. Wang et al. [13] proposed a novel 4D semantic segmentation framework that integrates motion and semantic information through a motion-semantic fusion module, achieving real-time performance with reduced computational complexity. The importance of attention processes in improving network performance has been brought to light by recent developments in CNNs. Instead of using end-to-end techniques, the authors [14] suggested a tiered strategy using a semantic segmentation model to improve the interpretability of autonomous driving systems. To address panoptic driving tasks efficiently, Chang et al. [15] introduced Q-YOLOP, a quantization-aware multi-task perception model utilizing the Efficient Layer Aggregation Network (ELAN) and advanced training techniques to ensure high accuracy and generalization with low computational demands. Luo et al. [16] proposed IDS-MODEL, an optimized CNN-based multi-task network capable of real-time instance segmentation and drivable area segmentation, enhanced by attention mechanisms, residual connections, and feature fusion modules, achieving impressive performance and efficiency in real-world applications.

A comparative study on real-time semantic segmentation for autonomous driving was carried out by Mennatullah Siam et al. [17], who emphasized the necessity of both accurate and computationally economical methods. Di Feng et al. [18] examined progress in perception for autonomous driving, emphasizing deep learning-based multi-modal object detection and semantic segmentation. Similarly, Chen et al. [19] designed HRDLNet, a high-resolution semantic segmentation network for urban streetscapes that combines multi-scale features, dual attention mechanisms, and dynamic contextual information to enhance performance on datasets like Cityscapes and PASCAL VOC. The authors [18] emphasized the utilization of diverse sensors (e.g., cameras, LiDARs, Radars) to exploit their complementary attributes via fusion. Wei Zhou et al. [20] suggested an innovative approach for assessing the stability of semantic segmentation models with LiDAR data to automate validation and eradicate labor-intensive human labeling. In a unified multi-task setting, Qian et al. [21] developed DLT-Net, which promotes shared learning across drivable area detection, lane line segmentation, and object detection by constructing inter-task context tensors, outperforming conventional multi-task models. Eren Erdal Aksoy et al. [22] introduced SalsaNet, a deep encoder-decoder network designed for the semantic segmentation of 3D LiDAR point clouds, with an emphasis on delineating drivable road areas and cars. The network employs Bird-Eye-View (BEV) projections of point clouds for effective segmentation.

Yashwanth et al. [23] presented YOLOP, an efficient perception model that performs all three major driving tasks in real-time on embedded systems, demonstrating superior accuracy on the BDD100K dataset. Hironobu Fujiyoshi et al. [24] observed the enhanced efficacy of deep learning methodologies in general object identification tests relative to preceding techniques. Xie et al. [25] introduced SegFormer, a Transformer-based semantic segmentation framework that unifies hierarchical encoding and lightweight MLP decoding, showing state-of-the-art performance and scalability across multiple benchmarks. Jingwei Yang et al. [26] conducted a survey of cutting-edge semantic segmentation techniques, highlighting their significance in applications such as autonomous driving, image enhancement, and 3D map reconstruction. Yin et al. [27] used a new technique for multi-domain semantic segmentation by utilizing sentence embeddings to portray the class labels, which improves generalization across datasets; this technique will be referenced as SESS in this paper. Zhu et al. [28] introduce a joint image–label propagation and boundary label relaxation framework to enhance segmentation accuracy by synthesizing new training pairs from video sequences; this method is referred to as VPLR (Video Propagation with Label Relaxation) in this review.

3. Materials and Methods

3.1. Overview of GAN Architecture

Generative Adversarial Networks (GANs) comprehend two main architectural components: the Generator (G) and the Discriminator (D), which are trained concurrently within an adversarial framework. The Generator creates synthetic data samples that mimic the distribution of real data. A random noise vector, typically sampled from a latent space like a Gaussian or uniform distribution, serves as input to generate data, including images or text. The generator aims to create data that cannot be distinguished from real data by the discriminator [29]. The Discriminator is tasked with classifying inputs as either real, derived from the actual dataset, or fake, generated by the generator. The output is a probability score that reflects the likelihood of the input being genuine. The objective of the discriminator is to effectively differentiate between authentic and generated data. In this minimax game, the two networks compete, with the generator seeking to maximize the discriminator’s misclassification of fake data, while the discriminator strives to minimize its classification error. The adversarial interaction compels both networks to enhance their performance iteratively, resulting in a generator that can produce high-quality synthetic data closely aligned with the real distribution.

3.1.1. Discriminator Loss

The discriminator’s goal is to precisely recognize fake examples (made by the generator) as unreal or fake and true examples as real. Generally, its mathematical formula is calculated as below:

L_{D} = - \frac{1}{2} E_{a ~ p r e a l (a)} [\log (D (a))] - \frac{1}{2} E_{b ~ p n o i s e (b)} [\log (1 - D (G (b)))]

(1)

where the expectations can be approximated as:

E_{a ~ p r e a l (a)} [h (a)] \approx \frac{1}{k} \sum_{j = 1}^{k} h (a_{j})

(2)

E_{b ~ p r e a l (b)} [h (b)] \approx \frac{1}{L} \sum_{j = 1}^{L} h (b_{j})

(3)

a_{j}

demonstrates data samples which are taken from the real dataset,

k

is the number of data samples taken from the real dataset,

b_{j}

are also data samples taken from the noise distribution, Ea and Eb are expectation values over real samples a and generated samples b in the loss formulation, respectively.

L

is the number of noise samples,

D

is the discriminator function, and

G

is the generator function.

3.1.2. Generator Loss

The central objective of the GAN’s generator is to guarantee the creation of data samples that the discriminator can wrongly group as genuine or not falsified. It is computed using the formula below.

L G = - (1 / 2) E z \sim p z (z) [l o g (D (G (z)))]

(4)

As shown in the above Equation (4), the generator receives a random noise vector

z

sampled from a noise distribution

p_{z} (z)

, then this noise is transformed by the generator

G

into a synthetic data sample, and the discriminator

D

then evaluates this sample to estimate the probability that it is real. The term

D (G (z))

represents the discriminator’s confidence that the generated data is authentic. By taking the logarithm of this value and computing its expectation over all sampled noise inputs, the generator aims to maximize the likelihood that the discriminator is “fooled” into classifying fake data as real. The negative sign and the scaling factor

- \frac{1}{2}

indicate that the generator is trained to minimize this loss, thereby improving its ability to generate realistic samples. This formulation ensures that the generator continually learns to produce data that better resembles the true data distribution as training progresses.

3.2. Adaptive Ensemble Attention for Autonomous Driving

Attention mechanisms are crucial in autonomous driving, enhancing the segmentation of complex scenes by directing the model’s focus toward the most pertinent features and regions within an image. Self-attention effectively facilitates the comprehension of global dependencies in a scene. This enables the model to associate distant features, such as aligning lane markings throughout the image or identifying spatial relationships between a pedestrian on one side of the road and a vehicle on the opposite side. The global context is essential for precise segmentation in situations with multiple objects, including intersections or high-traffic zones.

Spatial attention, then again, improves division by focusing on key spatial areas, guaranteeing that basic regions like street limits, vehicles, and walkers are given more significance. By powerfully weighting spatial locales, spatial attention checks that less significant parts, like the sky or building’s backgrounds, are deprioritized. This is particularly helpful in distinguishing objects in crowded environments or cluttered ones, where focusing on unambiguous regions can essentially further develop segmentation precision.

Channel attention supplements these mechanisms by accentuating the main component maps created by the network. By recalculating feature channels, it guarantees that basic details, for example, street surfaces, vehicle edges, or traffic signal elements, are enhanced while unessential elements are stifled. This particular accentuation upgrades the general component portrayal, prompting more exact division, even in testing conditions like poor lighting or unfriendly environments. This adaptive ensemble approach enables the model to generalize across diverse driving environments, from urban traffic to rural roads, handling challenges like occlusions, varying lighting, and weather conditions. By utilizing the advantages of each attention mechanism and dynamically adjusting their contributions, the Adaptive Ensemble Attention GAN provides robust and precise semantic segmentation, a critical capability for the ADAS in real-world autonomous driving applications.

In open highway scenarios, self-attention may be predominant in capturing global dependencies such as lane alignment. In contrast, in densely populated urban environments, spatial attention may prioritize critical areas such as pedestrians or parked vehicles. Channel attention prioritizes the most relevant feature maps consistently across various scenarios. Equations (5)–(7) present the mathematical formulation for the AEA [30].

F = [D i . D l, P i, P l]

(5)

A W = [D a i, D a l, P a i, P a l]

(6)

E A = (D i * D a i + D l * D a l + P i * P a i + P l + P a l)

(7)

Here, Di,

D l

stand for MobileNetv3’s intermediate and final feature maps, Dai, and

D a l

for corresponding attention.

P i, P l

stands for EfficientNetB7’s intermediate and final feature map,

P a i, P a l

corresponding attention is the entire features.

3.3. Combining Adaptive Ensemble Attention with GAN

Integrating Adaptive Ensemble Attention into a Generative Adversarial Network (GAN) enhances the segmentation performance for ADAS. The generator in the GAN is equipped with self-attention, spatial attention, and channel attention mechanisms, adaptively fused to focus on relevant features based on the scene complexity. This ensures precise and context-aware segmentation outputs, making it ideal for diverse driving scenarios.

The generator uses self-attention to capture long-range dependencies within the image, such as aligning lane markings or detecting relationships between objects at a distance. This mechanism enables the model to understand the broader spatial context, which is crucial for segmenting large-scale structures like roads or detecting vehicles in distant lanes. Spatial attention is incorporated to identify critical regions in the image, such as traffic signs, pedestrians, or vehicles. It prioritizes spatially significant areas, ensuring the model pays more attention to parts of the image that are most relevant for segmentation. Channel attention complements these mechanisms by recalibrating feature maps to emphasize important features, such as road textures or vehicle edges, and suppress irrelevant information like background noise.

The AEA mechanism dynamically combines these three types of attention based on the input image and the driving scenario. For example, in open highway scenarios, self-attention may dominate to capture global dependencies such as lane alignment. Conversely, in crowded urban settings, spatial attention takes precedence to focus on pedestrians, parked vehicles, and other objects in close proximity. Regardless of the scenario, channel attention ensures that critical features are prioritized across the network. This adaptive integration makes the generator versatile and robust in handling varying complexities in real-world driving environments.

The discriminator in the GAN evaluates the segmentation mask generated by the generator. It distinguishes between real (ground truth) and fake (generated) masks by analysing the concatenated pair of the original image and its corresponding segmentation mask. Through adversarial training, the discriminator pushes the generator to refine its outputs, ensuring that the generated masks not only match the ground truth but also appear highly realistic. This feedback loop between the generator and discriminator significantly improves the overall segmentation quality. During training, the generator optimizes a combination of losses, including the segmentation loss (e.g., cross-entropy) to ensure pixel-wise accuracy and the adversarial loss to improve the mask’s realism. The discriminator, on the other hand, minimizes its error in distinguishing real masks from generated ones. This adversarial setup encourages the generator to produce masks that are increasingly difficult for the discriminator to classify as fake. By combining the strengths of GANs with Adaptive Ensemble Attention, the system dynamically adapts to the complexities of input images. This fusion enables robust and precise semantic segmentation across diverse environments, including urban traffic, highways, and rural roads. The integration of these attention mechanisms ensures that the GAN captures both global and local dependencies, handles occlusions effectively, and adapts to challenges like varying lighting and weather conditions. This approach makes Adaptive Ensemble Attention GAN a powerful tool for advancing the capabilities of ADAS in real-world autonomous driving applications.

Let X∈RH × W × C represent the input image, where H is the height, W is the width, and C is the number of channels. The Adaptive Ensemble Attention mechanism integrates three types of attention: self-attention SA, spatial attention SPA, and channel attention CA.

Self-attention captures long-range dependencies within the image. It is typically computed as:

S A = S o f t m a x (\frac{Q K t}{s q r t (d k)}) * V

(8)

where

Q, K, and V are the query, key, and value matrices, respectively, derived from the input X.
dk is the dimensionality of the query and key vectors.
The Softmax operation ensures that the attention weights are normalized.

Spatial attention focuses on the important regions in the image. It can be computed by applying a convolutional layer followed by a sigmoid activation to highlight the most relevant spatial regions:

S P A (X) = σ (C o n v θ (X))

(9)

where:

Convθ is a convolutional layer with parameters θ, designed to output a spatial attention map, σ is the sigmoid function, ensuring the attention map is between 0 and 1.

Channel attention recalibrates the feature maps to emphasize important channels and suppress irrelevant ones. This is typically computed using global average pooling followed by a learned fully connected layer:

C A (X) = σ (W 2 \cdot R e L U (W 1 \cdot G l o b a l A v g P o o l (X)))

(10)

where:

GlobalAvgPool (X) is a global average pooling operation over the spatial dimensions WH × W.
W1 and W2 are learned weight matrices, σ is the sigmoid activation function.

The adaptive ensemble of these attention mechanisms is a weighted combination of the three attention maps. Let αSA, αSPA, and αCA be the adaptive weights assigned to self-attention, spatial attention, and channel attention, respectively. The combined attention can be written as:

A E A (X) = α S A \cdot S A (X) + α S P A \cdot S P A (X) + α C A \cdot C A (X)

(11)

where:

αSA + αSPA + αCA = 1 to ensure a normalized combination.

The AEA combines three attention mechanisms—self-attention (SA), spatial attention (SPA), and channel attention (CA)—through a weighted sum, where the weights (αSA, αSPA, αCA) adaptively adjust based on the input image and driving scenario. The sum of the weights equals 1, ensuring a normalized combination.

To realize the AEA mechanism described in Equation (11), a weighted summation module was implemented that combines the outputs of the three attention branches: the SA, SPA, and CA. Each branch processes the same input feature map and outputs a feature map of identical shape, ensuring that the fusion operation is dimensionally consistent.

The scalar weights αSA, αSPA, and αCA are learnable parameters initialized uniformly (i.e., α_i = 1/3) and are updated during training via backpropagation. To ensure that the final attention fusion remains normalized, we apply a softmax operation across the weights:

α_{i} = \frac{e^{β_{i}}}{\sum_{j} e^{β_{j}}} f o r i \in {S A, S P A, C A}

(12)

where

β_{i}

are unconstrained learnable parameters and

α_{i}

are the normalized attention weights.

This fusion layer is implemented as a lightweight module and placed after the feature extraction block. The resulting attention-enhanced feature map AEA(X) is then passed to the segmentation head for prediction. This adaptive fusion enables the model to emphasize the most informative attention mechanism dynamically, based on the input context.

In the proposed model, the input driving scene image from the BDD100K dataset is first fed into a Generator network that integrates an AEA mechanism. This module combines self-attention, spatial attention, and channel attention adaptively to focus on the most relevant features of the scene, such as vehicles, roads, and pedestrians. The Generator processes these features and produces a pixel-wise segmentation mask that classifies each region of the image. To ensure the segmentation output is both accurate and realistic, the model employs a GAN framework, where the Discriminator receives both the real image-ground truth mask pair and the real image-generated mask pair. It evaluates how closely the generated mask resembles the true segmentation. This adversarial setup enables the Generator to continuously refine its output based on feedback from the Discriminator. Figure 1 shows the proposed architecture. The independent application of self-attention, spatial attention, and channel attention enhances segmentation; however, their integration within an Adaptive Ensemble Attention mechanism significantly elevates segmentation performance [31]. The model integrates these mechanisms adaptively, allowing it to dynamically assess the relative significance of each attention type according to the input image and scene complexity.

Unlike prior attention-based GANs such as George et al. [10], which focus on binary road segmentation, the proposed method targets more complex multi-class segmentation in highly variable street environments. Furthermore, compared to static attention modules in [31] and classification-oriented ensemble attention in [30]. The proposed adaptive ensemble attention module dynamically recalibrates spatial and channel-wise features during both generation and discrimination stages, enabling context-aware feature refinement across different scene types.

The proposed generator in the AEA-GAN is designed as a hybrid architecture that combines MobileNetV3 and EfficientNetB7 to balance computational efficiency and feature richness. MobileNetV3, known for its lightweight and fast inference capabilities, is used to extract low-level and intermediate features, making it suitable for real-time applications such as autonomous driving. In contrast, EfficientNetB7, with its deeper and more expressive layers, contributes high-resolution and semantically rich features. To effectively merge the strengths of both backbones, the architecture employs an AEA module that integrates self-attention, spatial attention, and channel attention mechanisms. Self-attention captures global contextual dependencies, spatial attention highlights critical areas such as pedestrians and vehicles, and channel attention prioritizes the most informative feature maps.

3.4. Dataset

One of the biggest and most varied driving video datasets is the BDD100K dataset, which is used to test and train autonomous driving systems. It was developed to tackle a number of computer vision tasks, including tracking, motion forecasting, lane detection, semantic and instance segmentation, and object detection. The dataset consists of 100,000 high-resolution video clips, each lasting around 40 s, captured in diverse weather conditions, times of day, and geographical locations [32].

For the purpose of training and evaluation, the BDD100K dataset was divided into three subsets: 70% of the data was allocated for training, 15% for validation, and the remaining 15% for testing. This split ensured that the model was trained on a diverse range of samples while preserving a separate validation set to tune hyperparameters and monitor overfitting, and an independent test set to evaluate final performance. The dataset split was performed randomly while maintaining class distribution to ensure balanced representation across all subsets.

In addition to BDD100K, we conducted evaluations of the proposed AEA-GAN on two benchmark datasets—Cityscapes [33] and KITTI [28]—to assess the model’s generalizability. Cityscapes offers dense pixel-level annotations of urban scenes, while KITTI provides driving scenarios from highways and suburban areas. These additional tests, though limited in scale, showed that AEA-GAN maintains strong segmentation performance across varying environmental conditions and scene complexities.

3.5. Implementation Process

The implementation process of the project comprises several key stages, as illustrated in Figure 2, beginning with the input images that undergo an enhancement step to improve their quality and prepare them for further processing. The enhanced images are then passed into the Adaptive Ensemble Attention-GAN, which integrates an adaptive ensemble attention mechanism for effective feature extraction and refinement. Within this module, a training phase for layers is conducted to optimize the performance of the network. Following this, testing images are fed into the trained network, which generates segmentation results by leveraging the refined features and the adaptive attention mechanism. The segmentation results are visually analyzed and further evaluated against ground truth data to assess the accuracy and effectiveness of the proposed system. This streamlined pipeline is designed to enhance segmentation performance, making it particularly suitable for applications such as autonomous driving and complex scene understanding.

Training Setup Details

The training was conducted using the following configuration:

Learning Rate: Initially set to 0.001 with a step decay schedule to reduce it by a factor of 0.1 every 100 epochs.
Optimizer: Adam optimizer was used for both the generator and discriminator due to its efficiency in converging GANs.
Weight Initialization: Xavier (Glorot) initialization was used for convolutional layers to ensure stable gradients.
Regularization: Dropout (rate = 0.5) was applied in intermediate layers to prevent overfitting. L2 regularization (λ = 0.0001) was also used.
Data Augmentation: Applied random horizontal flipping, brightness variation, random cropping, and slight rotations to improve robustness and generalization.

3.6. Evaluation Metrics

In image segmentation tasks, pixel accuracy and mean F1 score are frequently used evaluation metrics [34]. The Intersection over Union (IoU) is a reliable indicator of segmentation quality since it calculates the overlap between the ground truth and the predicted segmentation by dividing their intersection by their union. Pixel accuracy calculates the proportion of correctly classified pixels across the entire image, offering a straightforward measure of overall performance. Meanwhile, the mean F1 score, which balances precision and recall, is used to assess the harmonic mean of these two metrics across all classes, making it particularly effective in scenarios with imbalanced datasets. Together, these metrics provide a comprehensive evaluation of segmentation models.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(13)

P r e c i s i o n = \frac{T P}{T P + F P}

(14)

R e c a l l = \frac{T P}{T P + F N}

(15)

F 1 - S c o r e = 2 \cdot \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(16)

A U C (P P R) = 2 \cdot \frac{F P}{F P + T N}

(17)

I o u (A, B) = \frac{∣ A U B ∣}{∣ A \cap B ∣}

(18)

4. Results

This section presents the results obtained from the experiments conducted to evaluate the performance of the proposed image segmentation model. The findings provide a comprehensive overview of the model’s capabilities and effectiveness across various evaluation criteria. The section includes an analysis of the experimental setup, visual comparisons of segmentation outputs, performance metrics displayed through graphical analysis, and a comparative study with existing models in the field.

4.1. Experimental Setup

The suggested AEA-GAN outperforms conventional attention mechanisms for 394 ADAS tasks in terms of accuracy, robustness, and computational efficiency, based on ex-395 experimental results on benchmark datasets such as BDD100K. The Experiments were con-396 ducted on a system with 16 GB of GPU memory. The training process spanned 50 epochs, 397 with a batch size of 16.

4.2. Visual Results of Segmentation Outputs—BDD100K Dataset

Qualitative results include segmentation maps that delineate objects like vehicles, pedestrians, and road signs. Attention heatmaps illustrate the mechanism’s ability to focus on relevant regions. Figure 3, provided above, demonstrates the results of a segmentation task in the proposed system. The first image represents the original input taken from the study’s dataset, showcasing a real-world urban environment. The second image shows ground truth segmentation, where different regions (such as roads, vehicles, buildings, and trees) are labeled with distinct colors to represent various semantic classes. The third image displays the output of the proposed segmentation method, which appears to align with the ground truth closely. This comparison highlights the model’s effectiveness in accurately identifying and segmenting critical objects and regions in the scene, such as road surfaces, vehicles, and background elements, which is essential for applications like autonomous driving and urban scene understanding.

The training and validation curves shown in Figure 4 and Figure 5 indicate that the proposed model learned effectively and achieved high performance. The training accuracy rapidly increased and stabilized close to 100%, while the validation accuracy consistently remained around 94–96%, demonstrating good generalization. The training loss decreased steadily to near zero, showing efficient convergence, while the validation loss initially dropped and then fluctuated slightly, which aligns with observations in prior segmentation studies, where the complexity of dense prediction tasks often leads to mild loss instability during validation [19,35]. Overall, the model exhibited fast and stable learning with minimal overfitting, highlighting the robustness and effectiveness of the Adaptive Ensemble Attention mechanism within the GAN framework.

4.3. Visual Results of Segmentation Outputs—CITYSCAPE Dataset

The visual results of segmentation outputs on the Cityscapes dataset provide a qualitative evaluation of the model’s performance in Figure 6 by identifying and classifying various objects within urban street scenes. These results offer insight into how accurately the model distinguishes between different semantic classes such as roads, pedestrians, vehicles, buildings, and vegetation. By comparing the predicted segmentation maps with the ground truth annotations, the effectiveness and reliability of the model in real-world scenarios can be visually assessed. This visual validation complements the quantitative metrics and highlights the model’s ability to preserve spatial details and boundaries.

Figure 7 and Figure 8 illustrate the training accuracy and validation loss trends for the Cityscapes dataset across 100 epochs. As shown in Figure 7, the training accuracy steadily improves, reaching close to 100%, indicating that the model effectively learns the training data. Figure 8 shows the validation loss gradually decreasing, with minor fluctuations toward the later epochs. This suggests that while the model generalizes well, there may be slight overfitting or variability in the validation performance. Overall, the training progress reflects strong learning and stable model behavior.

4.4. Visual Results of Segmentation Outputs—KITTI Dataset

The visual results of segmentation outputs on the KITTI dataset were presented in Figure 9 and provide a qualitative assessment of the model’s performance in urban driving environments. This dataset, known for its real-world street-level scenes, helps evaluate how well the model identifies and segments key objects such as roads, vehicles, pedestrians, and lane markings. By visually comparing the predicted outputs with the ground truth, these results highlight the model’s ability to maintain spatial accuracy and semantic consistency, reflecting its suitability for autonomous driving applications.

The training accuracy and validation loss are presented in Figure 10 and Figure 11, respectively. The plots illustrate the training progress of a model over 100 epochs for the KITTI dataset, focusing on accuracy and loss for both the training and validation sets. In the accuracy plot, the training accuracy steadily improves and converges near 100%, indicating that the model has effectively learned from the training data. The validation accuracy also rises sharply in the initial epochs and stabilizes around 90%, though with slight fluctuations, suggesting good generalization with minor instability across validation samples.

In the loss plot, training loss decreases rapidly and approaches zero, confirming strong learning efficiency. The validation loss follows a similar downward trend initially but shows occasional spikes in the later epochs, which could be due to overfitting or variability in the validation data. Overall, the model demonstrates strong learning performance, with high training accuracy and low training loss, alongside reasonably stable validation metrics, making it suitable for real-world deployment with minimal tuning.

Figure 12 illustrates the performance of the proposed hybrid model across three widely-used datasets: Cityscapes, BDD100K, and KITTI. The model demonstrates consistently high performance across all evaluation metrics, including Accuracy, Precision, Recall, F1-Score, AUC, and mIoU. On the Cityscapes dataset, the model achieved an accuracy of 98.88%, precision of 98.85%, recall of 98.86%, F1-score of 98.85%, AUC of 98.10%, and a mean Intersection over Union (mIoU) of 89.02%. For BDD100K, it showed slightly improved results with an accuracy of 98.94%, precision of 98.91%, recall of 98.93%, F1-score of 98.91%, AUC of 98.40%, and the highest mIoU of 89.46% among all datasets. On the KITTI dataset, the model maintained robust accuracy (98.79%) and F1-score (98.76%), though the mIoU was slightly lower at 88.13%. Overall, the hybrid model delivers strong and balanced performance, particularly excelling in segmentation quality on the BDD100K dataset, confirming its suitability for real-world ADAS applications.

4.5. Ablation Study on Backbone Networks

To assess the contribution of the hybrid backbone in the proposed AEA-GAN architecture, its performance was compared against two individual backbone settings: using MobileNetV3 alone and EfficientNetB7 alone. As shown in Table 1, the hybrid approach achieved a notable improvement in all evaluation metrics, with an mIOU of 89.46%, outperforming MobileNetV3 (85.21%) and EfficientNetB7 (87.38%).

The results data in Table 1 indicate that combining the fast, lightweight feature extraction of MobileNetV3 with the deep semantic representation of EfficientNetB7 provides a more robust and efficient backbone, especially beneficial for complex scene understanding in autonomous driving applications.

4.6. Ablation Study of Framework Components

To validate the effectiveness of each core component in the proposed AEA-GAN architecture, we conducted a detailed ablation study. This involved systematically modifying or removing key parts of the framework to observe their individual impact on performance. First, we removed the GAN component and trained the model solely with the segmentation loss. This version (without the adversarial discriminator) led to a noticeable drop in performance, with the mIoU decreasing from 89.46% to 85.72%, demonstrating the importance of the GAN in producing realistic and refined segmentation maps.

Next, we disabled the AEA module entirely. Without the self-attention, spatial attention, and channel attention branches, the model’s mIoU further dropped to 84.93%. This confirms that attention mechanisms play a significant role in guiding the model to focus on relevant regions and features in the input. We then evaluated the contribution of each attention mechanism by removing it one at a time. Without self-attention, the model achieved 84.66% mIoU; removing spatial attention yielded 84.52%, and without channel attention, the mIoU declined slightly further to 84.49%. These results show that all three types of attention—self, spatial, and channel—contribute meaningfully to the final performance, and their combined use yields better accuracy.

As shown in Table 2, we replaced the adaptive fusion in AEA with a standard, non-adaptive attention mechanism using fixed weights. This version achieved a slightly improved mIoU of 85.39% compared to using no attention at all, but still underperformed compared to the proposed AEA-GAN with dynamic weighting. This highlights the advantage of adaptively learning attention importance based on input complexity. Overall, this ablation study confirms that each component of the framework—GAN structure, attention mechanisms, and adaptive attention fusion—plays a vital role in achieving superior semantic segmentation performance in autonomous driving scenarios.

The numerical results of the evaluation metrics for the Effect of Individual Attention of the proposed method are presented in Table 3. The results demonstrate its superior performance across various indicators. Spatial attention and self-attention mechanisms both achieved an accuracy of 97.9%, precision of 97.95%, recall of 97.95%, F1-score of 97.9%, and an AUC of 96. In comparison, the proposed method outperformed these approaches, achieving an impressive accuracy of 98.94%, precision of 98.91%, recall of 98.93%, F1-score of 98.91%, and an AUC of 98.4. These results indicate that the proposed method effectively integrates channel and spatial features, delivering better overall performance and robustness for vision-based tasks.

To test the performance of the proposed methods, Combined Attention, Majority Voting, and Weighted Averaging were used, which are two commonly used ensemble techniques in machine learning to combine the outputs of multiple models to make a final decision or prediction. As shown in Table 4, the Majority Voting and Weighted Averaging configurations achieved similar performance, with an accuracy of 97.9%, precision and recall of 97.95%, F1-Score of 97.9%, and AUC of 96%. The proposed Method significantly outperformed these configurations. It achieved an accuracy of 98.94%, precision of 98.91%, recall of 98.93%, F1-Score of 98.91%, and AUC of 98.4%, demonstrating the superiority of the proposed approach in leveraging combined attention mechanisms for improved performance in vision tasks.

4.7. Runtime Performance and Model Complexity

The results presented in Table 4 demonstrate the trade-offs between model complexity and segmentation accuracy for different backbone configurations within the proposed AEA-GAN framework. The MobileNetV3-only variant achieved the highest FPS (38) and the lowest computational complexity (36.4 GFLOPs) with 17.8 million parameters, making it the most lightweight but also the least accurate, with an mIOU of 85.21%. In contrast, the EfficientNetB7-only variant produced improved accuracy (mIOU of 87.38%) but at the cost of significantly higher computational overhead—143.5 GFLOPs, 60.2 million parameters, and a reduced inference speed of 20 FPS.

According to the data in Table 5, the proposed hybrid configuration balances the strengths of both networks, delivering the best segmentation performance (mIOU of 89.46%) with a moderate model size (39.5 million parameters) and acceptable inference speed of 26 FPS. All experiments were conducted on an NVIDIA T4 GPU, ensuring consistent and realistic performance evaluation. These results confirm the hybrid AEA-GAN’s suitability for real-time or near-real-time deployment in ADAS applications, offering an optimal balance between accuracy and efficiency.

5. Discussion

A comparative analysis was performed against several cutting-edge models frequently employed in autonomous driving and semantic segmentation tasks to evaluate the efficacy of the suggested segmentation model.

The obtained results, as demonstrated in Table 6, highlight a clear progression in semantic segmentation model performance over time, as measured by the mean Intersection over Union (IoU). Earlier models, such as SegNet 4D and Q-YoloP, with mIOUs of 55.2% and 61.2%, respectively, show limited capability in accurately segmenting image regions, likely due to less effective architectures and feature extraction mechanisms. More recent approaches, including HRDLNet, DLT-Net, and Yolo-P, demonstrate moderate improvements, achieving mIOUs in the 70–73% range. These models benefit from more advanced designs and deeper networks, allowing for better context understanding. Segformer, a transformer-based model, further raises the bar with an mIOU of 75.08%, reflecting the impact of attention mechanisms in capturing long-range dependencies. The IDS model and the standard U-Net show competitive performance with mIOUs of 83.63% and 84.16%, respectively, indicating the robustness of encoder-decoder architectures in segmentation tasks.

On the other hand, the reported segmentation accuracy of 98.94% on the BDD100K dataset may appear high compared to other methods. However, this metric reflects pixel-wise classification accuracy, which tends to be inflated in datasets where background pixels dominate. To address this, we also report the mean Intersection over Union (mIoU) of 89.46%, which provides a more balanced evaluation across classes. To prevent overfitting and validate generalizability:—We applied diverse data augmentation techniques (flipping, brightness variation, cropping).—Dropout and L2 weight decay were used during training. The model employed early stopping based on validation loss trends.—Ablation studies and backbone comparisons were performed to confirm consistent improvements. Furthermore, training and validation curves showed no signs of divergence, and validation metrics remained stable, confirming the robustness of the training process. These precautions collectively ensure that the high accuracy is not a result of overfitting, but rather due to the effectiveness of the AEA-GAN architecture. Notably, the proposed Adaptive Ensemble GAN significantly outperforms all existing methods, achieving an impressive mIOU of 89.46%. This suggests that the ensemble strategy and adversarial learning framework contribute to enhanced generalization and more accurate segmentation, establishing it as a state-of-the-art approach.

Also, the performance across the Cityscapes and KITTI datasets, particularly with the proposed AEA-GAN model. On Cityscapes, AEA-GAN achieves a remarkable mIoU of 89.02 ± 0.2, outperforming established state-of-the-art methods like MaskFormer (84.3 mIoU) and SegFormer (84.0 mIoU). Similarly, on KITTI, AEA-GAN attains 88.13 ± 0.3 mIoU, surpassing Spherical Transformer (74.8 mIoU) and other leading models. These results highlight AEA-GAN’s robustness as a hybrid method, likely due to its effective integration of adversarial training and attention mechanisms, which enables superior feature extraction and contextual understanding compared to purely transformer-based (e.g., MaskFormer) or CNN-based (e.g., HRNet + OCR) approaches. The consistency of AEA-GAN’s performance, indicated by low standard deviations (±0.2–0.3), further underscores its reliability. In contrast, while transformer-based models like SegFormer and MaskFormer excel on Cityscapes, their performance gaps on KITTI (e.g., SegFormer’s 75.08 mIoU adaptation) suggest challenges in generalizing to datasets with sparser annotations or varied environments. Real-time models (e.g., 2DPASS, VPLR) trade accuracy for speed, but AEA-GAN’s hybrid design appears to bridge this gap, achieving high accuracy without compromising efficiency.

These advancements position AEA-GAN as a versatile solution for autonomous driving tasks, though further validation on larger datasets (e.g., BDD100K) and edge-device deployment tests is needed to assess scalability. The results also emphasize the importance of hybrid architectures in pushing the boundaries of semantic segmentation beyond the limitations of single-paradigm models.

Wiseman [44] demonstrated the challenge of misidentifying splashed water or hail as vehicles, which remains a significant limitation for vision-based driving systems. Our approach addresses this issue by employing an Adaptive Ensemble Attention mechanism that focuses on semantically consistent regions and dynamically filters out contextually irrelevant noise. Although our dataset (BDD100K) does not provide explicit splash or hail annotations, the model’s performance on scenes with rain, reflections, and occlusions indicates improved resilience to such anomalies.

Although this study demonstrates promising results, it is important to acknowledge several limitations that may guide future research. First, the model’s performance may deteriorate under challenging conditions involving noisy inputs, such as images affected by motion blur, sensor noise, or other artifacts. While the architecture performs robustly in controlled settings, it has not been comprehensively evaluated under such real-world variability. Second, the model exhibits potential limitations in detecting rare or underrepresented classes due to imbalanced training data. These minority classes are often overshadowed by more frequent categories, leading to diminished recall and reduced performance in critical applications where such instances are essential.

To address these challenges, future work could investigate robust training strategies—including adversarial data augmentation and denoising modules—to improve resilience in noisy environments. Tackling data imbalance may benefit from the integration of class reweighting techniques or few-shot learning methods to enhance the representation of rare categories. Moreover, employing self-supervised learning paradigms or transfer learning from pre-trained models could reduce the reliance on large labeled datasets and improve generalizability.

6. Conclusions

The proposed system outperforms baseline models in accuracy and robustness. The dynamic attention mechanism effectively addresses challenges such as occlusions and low-contrast regions. However, computational complexity remains a limitation, suggesting the need for optimization in future work. The results obtained in this paper point to the possibility of advancing the performance of intelligent driving systems by introducing an adaptive, context-aware attention mechanism in real-time scenarios. This work opens the way for more resilient and scalable ADAS technologies that address the demands of modern autonomous and assisted driving applications.

Author Contributions

B.S.A.J. implemented the proposed architecture, performed data preprocessing, conducted experimental evaluations, analyzed the results, and interpreted the findings. M.H. Conceptualized the research idea, designed the overall methodology, revised, and finalized the final version of the paper for submission. All authors have read and agreed to the published version of the manuscript.

Funding

The authors received no financial support for the research, authorship, and publication of this article.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data supporting the findings of this study are available within the manuscript, and it is also available at the following URL: https://github.com/Raashid2016/DAEA-GAN (accessed on 15 July 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yu, H.; Huo, S.; Zhu, M.; Gong, Y.; Xiang, Y. Machine learning-based vehicle intention trajectory recognition and prediction for autonomous driving. arXiv 2024, arXiv:2402.16036. [Google Scholar]
Guan, L.; Yuan, X. Dynamic weighting and boundary-aware active domain adaptation for semantic segmentation in autonomous driving environment. IEEE Trans. Intell. Transp. Syst. 2024, 25, 18461–18471. [Google Scholar] [CrossRef]
Sun, C.; Zhao, H.; Mu, L.; Xu, F.; Lu, L. Image Semantic Segmentation for Autonomous Driving Based on Improved U-Net. C-Comput. Model. Eng. Sci. 2023, 136, 787–801. [Google Scholar] [CrossRef]
Khairnar, S.; Thepade, S.D.; Kolekar, S.; Gite, S.; Pradhan, B.; Alamri, A.; Patil, B.; Dahake, S.; Gaikwad, R.; Chaudhari, A. Enhancing semantic segmentation for autonomous vehicle scene understanding in indian context using modified CANet model. MethodsX 2025, 14, 103131. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Wang, Y.; Li, Q. Lane detection based on real-time semantic segmentation for end-to-end autonomous driving under low-light conditions. Digit. Signal Process. 2024, 155, 104752. [Google Scholar] [CrossRef]
Hao, W.; Wang, J.; Lu, H. A Real-Time Semantic Segmentation Method Based on Transformer for Autonomous Driving. Comput. Mater. Contin. 2024, 81, 4419–4433. [Google Scholar] [CrossRef]
Unar, S.; Su, Y.; Zhao, X.; Liu, P.; Wang, Y.; Fu, X. Towards applying image retrieval approach for finding semantic locations in autonomous vehicles. Multimed. Tools Appl. 2024, 83, 20537–20558. [Google Scholar] [CrossRef]
Zhang, C.; Lu, W.; Wu, J.; Ni, C.; Wang, H. SegNet network architecture for deep learning image segmentation and its integrated applications and prospects. Acad. J. Sci. Technol. 2024, 9, 224–229. [Google Scholar] [CrossRef]
Mei, J.; Zhou, T.; Huang, K.; Zhang, Y.; Zhou, Y.; Wu, Y.; Fu, H. A survey on deep learning for polyp segmentation: Techniques, challenges and future trends. Vis. Intell. 2025, 3, 1. [Google Scholar] [CrossRef]
George, G.V.; Hussain, M.S.; Hussain, R.; Jenicka, S. Efficient Road Segmentation Techniques with Attention-Enhanced Conditional GANs. SN Comput. Sci. 2024, 5, 176. [Google Scholar] [CrossRef]
Liang, X.; Li, C.; Tian, L. Generative adversarial network for semi-supervised image captioning. Comput. Vis. Image Underst. 2024, 249, 104199. [Google Scholar] [CrossRef]
Ma, X.; Hu, K.; Sun, X.; Chen, S. Adaptive Attention Module for Image Recognition Systems in Autonomous Driving. Int. J. Intell. Syst. 2024, 2024, 3934270. [Google Scholar] [CrossRef]
Wang, N.; Guo, R.; Shi, C.; Wang, Z.; Zhang, H.; Lu, H.; Zheng, Z.; Chen, X. SegNet4D: Efficient Instance-Aware 4D Semantic Segmentation for LiDAR Point Cloud. IEEE Trans. Autom. Sci. Eng. 2025, 22, 15339–15350. [Google Scholar] [CrossRef]
Hofmarcher, M.; Unterthiner, T.; Arjona-Medina, J.; Klambauer, G.; Hochreiter, S.; Nessler, B. Visual scene understanding for autonomous driving using semantic segmentation. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning; Springer: Cham, Switzerland, 2019; pp. 285–296. [Google Scholar]
Chang, C.C.; Lin, W.C.; Wang, P.S.; Yu, S.F.; Lu, Y.C.; Lin, K.C.; Wu, K.C. Q-YOLOP: Quantization-aware you only look once for panoptic driving perception. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), Brisbane, Australia, 10–14 July 2023; IEEE: Piscataway, NJ, USA; pp. 52–56. [Google Scholar]
Luo, T.; Chen, Y.; Luan, T.; Cai, B.; Chen, L.; Wang, H. Ids-model: An efficient multitask model of road scene instance and drivable area segmentation for autonomous driving. IEEE Trans. Transp. Electrif. 2023, 10, 1454–1464. [Google Scholar] [CrossRef]
Siam, M.; Gamal, M.; Abdel-Razek, M.; Yogamani, S.; Jagersand, M.; Zhang, H. A comparative study of real-time semantic segmentation for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 587–597. [Google Scholar]
Feng, D.; Haase-Schütz, C.; Rosenbaum, L.; Hertlein, H.; Glaser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges. IEEE Trans. Intell. Transp. Syst. 2021, 22, 1341–1360. [Google Scholar] [CrossRef]
Chen, W.; Miao, Z.; Qu, Y.; Shi, G. HRDLNet: A semantic segmentation network with high resolution representation for urban street view images. Complex Intell. Syst. 2024, 10, 7825–7844. [Google Scholar] [CrossRef]
Zhou, W.; Berrio, J.S.; Worrall, S.; Nebot, E. Automated Evaluation of Semantic Segmentation Robustness for Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1951–1963. [Google Scholar] [CrossRef]
Qian, Y.; Dolan, J.M.; Yang, M. DLT-Net: Joint detection of drivable areas, lane lines, and traffic objects. IEEE Trans. Intell. Transp. Syst. 2019, 21, 4670–4679. [Google Scholar] [CrossRef]
Aksoy, E.E.; Baci, S.; Cavdar, S. Salsanet: Fast road and vehicle segmentation in lidar point clouds for autonomous driving. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020; IEEE: Piscataway, NJ, USA; pp. 926–932. [Google Scholar]
Wu, D.; Liao, M.W.; Zhang, W.T.; Wang, X.G.; Bai, X.; Cheng, W.Q.; Liu, W.Y. Yolop: You only look once for panoptic driving perception. Mach. Intell. Res. 2022, 19, 550–562. [Google Scholar] [CrossRef]
Fujiyoshi, H.; Hirakawa, T.; Yamashita, T. Deep learning-based image recognition for autonomous driving. IATSS Res. 2019, 43, 244–252. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Yang, J.; Guo, S.; Bocus, M.J.; Chen, Q.; Fan, R. Semantic segmentation for autonomous driving. In Autonomous Driving Perception: Fundamentals and Applications; Springer: Singapore, 2023; pp. 101–137. [Google Scholar]
Yin, W.; Liu, Y.; Shen, C.; Sun, B.; van den Hengel, A. Scaling up multi-domain semantic segmentation with sentence embeddings. Int. J. Comput. Vis. 2024, 132, 4036–4051. [Google Scholar] [CrossRef]
Zhu, Y.; Sapra, K.; Reda, F.A.; Shih, K.J.; Newsam, S.; Tao, A.; Catanzaro, B. Improving semantic segmentation via video propagation and label relaxation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8856–8865. [Google Scholar]
Al-Ajlan, M.; Ykhlef, M. A Review of Generative Adversarial Networks for Intrusion Detection Systems: Advances, Challenges, and Future Directions. Comput. Mater. Contin. 2024, 81, 2053–2076. [Google Scholar] [CrossRef]
Celik, F.; Celik, K.; Celik, A. Enhancing brain tumor classification through ensemble attention mechanism. Sci. Rep. 2024, 14, 22260. [Google Scholar] [CrossRef] [PubMed]
Zhao, Q.; Liu, J.; Li, Y.; Zhang, H. Semantic segmentation with attention mechanism for remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5403913. [Google Scholar] [CrossRef]
Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2636–2645. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Krstinić, D.; Braović, M.; Šerić, L.; Božić-Štulić, D. Multi-label classifier performance evaluation with confusion matrix. Comput. Sci. Inf. Technol. 2020, 1, 1–4. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Yuan, Y.; Chen, X.; Wang, J. Object-contextual representations for semantic segmentation. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 173–190. [Google Scholar]
Lo, S.Y.; Hang, H.M.; Chan, S.W.; Lin, J.J. Efficient dense modules of asymmetric convolution for real-time semantic segmentation. In Proceedings of the 1st ACM International Conference on Multimedia in Asia, Beijing, China, 15–18 December 2019; pp. 1–6. [Google Scholar]
Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9716–9725. [Google Scholar]
Xu, J.; Xiong, Z.; Bhattacharyya, S.P. PIDNet: A real-time semantic segmentation network inspired by PID controllers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19529–19539. [Google Scholar]
Hong, Y.; Pan, H.; Sun, W.; Jia, Y. Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. arXiv 2021, arXiv:2101.06085. [Google Scholar]
Yan, X.; Gao, J.; Zheng, C.; Zheng, C.; Zhang, R.; Cui, S.; Li, Z. 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 677–695. [Google Scholar]
Lai, X.; Chen, Y.; Lu, F.; Liu, J.; Jia, J. Spherical transformer for lidar-based 3d recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17545–17555. [Google Scholar]
Wiseman, Y. Real-time monitoring of traffic congestions. In Proceedings of the 2017 IEEE International Conference on Electro Information Technology (EIT), Lincoln, NE, USA, 14–17 May 2017; pp. 501–505. [Google Scholar]

Figure 1. Architecture of the proposed AEA-GAN model.

Figure 2. Implementation process diagram.

Figure 3. Segmentation results of the proposed method for the BDD100K dataset.

Figure 4. Train Accuracy BDD100K dataset.

Figure 5. Validation Loss BDD100K dataset.

Figure 6. Segmentation results of the proposed method for the CITYSCAPE dataset.

Figure 7. Train Accuracy (Cityscapes dataset).

Figure 8. Validation Loss (Cityscapes dataset).

Figure 9. Segmentation results of the proposed method for the KITTI dataset.

Figure 10. Train Accuracy (KITTI dataset).

Figure 11. Validation Loss (KITTI dataset).

Figure 12. Proposed method Comparison among all three datasets.

Table 1. Performance comparison of different backbone configurations.

Backbone Architecture	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	AUC (%)	mIOU (%)
MobileNetV3 Only	97.65	97.60	97.62	97.61	95.1	85.21
EfficientNetB7 Only	98.12	98.05	98.09	98.07	96.8	87.38
Proposed Hybrid Model	98.94	98.91	98.93	98.91	98.4	89.46

Table 2. Performance of Different Framework Components.

Model Variant	Accuracy (%)	F1-Score (%)	AUC (%)	mIoU (%)
Full AEA-GAN (Proposed)	98.94	98.91	98.4	89.46
Without GAN (only AEA-based UNet)	97.83	97.79	96.7	85.72
Without AEA (no attention mechanisms)	97.65	97.58	96.4	84.93
Without self-attention	97.61	97.55	96.2	84.66
Without spatial attention	97.62	97.53	96.1	84.52
Without channel attention	97.60	97.51	96.0	84.49
With standard attention (non-adaptive)	97.84	97.77	96.6	85.39

Table 3. The result of Individual attention.

Method	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	AUC (%)	mIOU (%)
Spatial attention	97.9	97.95	97.95	97.9	96	85.41
Self-attention	97.9	97.95	97.95	97.9	96	86.62
Proposed	98.94	98.91	98.93	98.91	98.4	89.46

Table 4. Result of Combined Attention in a Different Configuration.

Method	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	AUC (%)	mIOU (%)
Majority voting	97.9	97.95	97.95	97.9	96	87.42
Weighed Averaging	97.9	97.95	97.95	97.9	96	88.46
Proposed Hybrid	98.94	98.91	98.93	98.91	98.4	89.46

Table 5. Runtime Performance.

Backbone Configuration	mIOU (%)	Params (M)	FLOPs (G)	Inference Time (ms)	FPS
MobileNetV3 Only	85.21	17.8	36.4	6.3	38
EfficientNetB7 Only	87.38	60.2	143.5	5.0	20
Proposed Hybrid (AEA-GAN)	89.46	39.5	92.1	3.0	26

Table 6. Performance Comparison with Existing Studies.

Dataset	Studies	Model	mIOU (%)
	Wang et al. (2024) [13]	SegNet 4D	55.2
	Chang et al. (2023) [15]	Q-YoloP	61.2
BDD100K	Chen et al. (2024) [19]	HRDLNet	70.4
	Qian et al. (2019) [21]	DLT-Net	71.3
	Yashwanth et al. (2022) [23]	Yolo-P	72.6
	Xie et al. (2021) [25]	Segformer	75.08
	Luo et al. (2023) [16]	IDS model	83.63
	Standard U-Net model	Standard U-Net	84.16 ± 0.18
	Proposed hybrid method	AEA-GAN	89.46 ± 0.1
	Cheng et al. (2021) [36]	MaskFormer	84.3
	Xie et al. (2021) [25]	SegFormer	84.0
Cityscapes	Yuan et al. (2020) [37]	HRNet + OCR	81.1
	Lo et al. (2019) [38]	EDANet	67.3
	Fan et al. (2021) [39]	STDC-Seg75	75.3
	Xu et al. (2023) [40]	PIDNet-S	78.6
	Hong et al. (2021) [41]	DDRNet-23	80.6
	Proposed hybrid method	AEA-GAN	89.02 ± 0.2
	Zhu et al. (2019) [28]	VPLR	72.8
	Yan et al. (2022) [42]	2DPASS	72.9
KITTI	Lai et al. (2023) [43]	Spherical Transformer	74.8
	Proposed hybrid method	AEA-GAN	88.13 ± 0.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jama, B.S.A.; Hacibeyoglu, M. A GAN-Based Framework with Dynamic Adaptive Attention for Multi-Class Image Segmentation in Autonomous Driving. Appl. Sci. 2025, 15, 8162. https://doi.org/10.3390/app15158162

AMA Style

Jama BSA, Hacibeyoglu M. A GAN-Based Framework with Dynamic Adaptive Attention for Multi-Class Image Segmentation in Autonomous Driving. Applied Sciences. 2025; 15(15):8162. https://doi.org/10.3390/app15158162

Chicago/Turabian Style

Jama, Bashir Sheikh Abdullahi, and Mehmet Hacibeyoglu. 2025. "A GAN-Based Framework with Dynamic Adaptive Attention for Multi-Class Image Segmentation in Autonomous Driving" Applied Sciences 15, no. 15: 8162. https://doi.org/10.3390/app15158162

APA Style

Jama, B. S. A., & Hacibeyoglu, M. (2025). A GAN-Based Framework with Dynamic Adaptive Attention for Multi-Class Image Segmentation in Autonomous Driving. Applied Sciences, 15(15), 8162. https://doi.org/10.3390/app15158162

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A GAN-Based Framework with Dynamic Adaptive Attention for Multi-Class Image Segmentation in Autonomous Driving

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Overview of GAN Architecture

3.1.1. Discriminator Loss

3.1.2. Generator Loss

3.2. Adaptive Ensemble Attention for Autonomous Driving

3.3. Combining Adaptive Ensemble Attention with GAN

3.4. Dataset

3.5. Implementation Process

Training Setup Details

3.6. Evaluation Metrics

4. Results

4.1. Experimental Setup

4.2. Visual Results of Segmentation Outputs—BDD100K Dataset

4.3. Visual Results of Segmentation Outputs—CITYSCAPE Dataset

4.4. Visual Results of Segmentation Outputs—KITTI Dataset

4.5. Ablation Study on Backbone Networks

4.6. Ablation Study of Framework Components

4.7. Runtime Performance and Model Complexity

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI