You are currently viewing a new version of our website. To view the old version click .
Smart Cities
  • Article
  • Open Access

4 January 2026

HDRSeg-UDA: Semantic Segmentation for HDR Images with Unsupervised Domain Adaptation

and
1
Department of Computer Science and Information Engineering, National Taipei University of Technology, Taipei 106, Taiwan
2
Department of Electrical Engineering, National Chung Cheng University, Chiayi 621, Taiwan
*
Author to whom correspondence should be addressed.
This article belongs to the Section Artificial Intelligence and LLM Agents for Data-Driven Decisions in Smart Cities

Highlights

What are the main findings?
  • The use of HDR images with multi-exposure and feature extraction for road marking semantic segmentation efficiently enables pixel-wise classification on driving images under adverse weather.
  • A comprehensive dataset specifically designed for road marking segmentation is introduced, providing a valuable resource for evaluating and improving HDR-based semantic segmentation under different illumination conditions.
What are the implications of the main findings?
  • It is feasible to modify the baseline segmentation architecture to better leverage rich features of HDR images with adversarial training and self-training to enhance driving scene understanding tasks.
  • The HDR dataset serves as a benchmark for future research in semantic segmentation under various weather conditions.

Abstract

Accurate detection and localization of traffic objects are essential for autonomous driving tasks such as path planning. While semantic segmentation is able to provide pixel-level classification, existing networks often fail under challenging conditions like nighttime or rain. In this paper, we introduce a new training framework that combines unsupervised domain adaptation with high dynamic range imaging. The proposed network uses labeled daytime images along with unlabeled nighttime HDR images. By utilizing the fine details typically lost in conventional SDR images due to dynamic range compression, and incorporating the UDA training strategy, the framework effectively trains a model that is capable of semantic segmentation across adverse weather conditions. Experiments conducted on four datasets have demonstrated substantial improvements in inference performance under nighttime and rainy scenarios. The accuracy for daytime images is also enhanced through expanded training diversity.

1. Introduction

Semantic segmentation performs pixel-level classification to achieve a precise description of object locations and boundaries in an image. By creating dense and structured scene representations, it establishes a core function for advanced perception modules in autonomous systems. When integrated into advanced driver assistance systems, semantic segmentation is capable of understanding traffic scenes such as lane lines, road markings, and drivable areas. Among these, the role of road marking segmentation is to identify traffic regulations on the road surface, including lane dividers, arrows, crosswalks, etc., through pixel-wise classification. Precise detection of road markings enhances the stability of ADAS functions, such as lane departure warning, lane keeping, and automatic parking. Furthermore, it provides robust ground image features for accurate localization.
However, achieving semantic segmentation in autonomous driving remains challenging due to diverse real-world factors. First, the variation in traffic regulations in different countries leads to inconsistent road marking patterns, limiting model generalization across domains. Second, traffic objects such as lane lines and symbols typically appear small in images, making accurate pixel-level segmentation difficult. Third, environmental conditions, including nighttime and rain, often degrade the visibility and increase recognition uncertainty. Conversely, recent advances in computation hardware, particularly GPUs, TPUs, and NPUs, enable efficient deployment of deep neural networks on edge devices, facilitating real-time perception and inference.
Semantic segmentation has incorporated deep learning since Long et al. introduced fully convolutional networks (FCNs) for end-to-end dense prediction [1]. The subsequent architectures have further enhanced feature representation and contextual reasoning, improving segmentation robustness under complex traffic and weather conditions. Despite the recent progress, most existing investigations on semantic segmentation in traffic scenes primarily target clear and daytime conditions, while performance under adverse weather still remains limited. Current networks typically rely on supervised learning. This requires large-scale, annotated datasets, which are available for daytime images but are lacking for nighttime or rainy scenes. This scarcity arises from degraded image quality and the difficulty in consistent annotation under poor lighting conditions. As a result, model generalization across diverse weather and various illumination remains inadequate.
This paper presents a segmentation framework trained with high dynamic range (HDR) images to enhance robustness under diverse illumination conditions. HDR imaging expands the exposure dynamic range, enabling more accurate representation of both bright and dark regions in the real scene. Characterized by a color depth of 10 bits or higher, HDR images effectively mitigate distortions from overexposure, underexposure, and low-light noise. To improve cross-domain generalization, unsupervised domain adaptation (UDA) is employed to transfer knowledge from a labeled source domain to an unlabeled target domain. In this study, clear daytime images serve as the source domain, while adverse weather or nighttime images represent the target. By leveraging UDA and HDR, the network achieves consistent semantic segmentation performance across varying illumination and weather conditions, reducing the dependency on extensive manual annotation.
Building on SegFormer [2], we modify the network architecture to optimize feature utilization from HDR images. The proposed technique employs UDA to transfer knowledge from labeled daytime data to unlabeled adverse weather domains, and achieves robust segmentation results across diverse scenarios. Our HDR-based Dual-Path SegFormer is trained and validated using daytime HDR datasets to establish baseline performance. Subsequently, UDA strategies including adversarial learning, self-training, and self-training-based class mixing are incorporated to enhance cross-domain adaptation. These are employed to improve the segmentation robustness under varying weather conditions while preserving class balance and stability during training. The main contributions of this paper are as follows.
  • We develop a road marking segmentation model capable of accurate pixel-wise classification on HDR images.
  • We demonstrate the feasibility of using HDR images for semantic segmentation under adverse weather.
  • The effectiveness of the ClassMix approach for training semantic segmentation model on HDR driving images is verified.
  • We establish a new HDR driving dataset for road marking segmentation benchmarking.

2. Related Work

The related work covers three topics: semantic segmentation tasks, unsupervised domain adaptation, and HDR imaging. Ihe discussion of semantic segmentation delves into a subtask relevant to autonomous driving: semantic segmentation of road markings. In addition to studies that directly train with HDR images, we also review multi-exposure related techniques.

2.1. Semantic Segmentation

Semantic segmentation is an essential computer vision task. While conventional image classification assigns a single label to an entire image, semantic segmentation classifies every pixel in the image. This allows the object boundary to be described in fine detail, providing both the category of each object in the scene and the associated pixel regions. Most recent semantic segmentation methods are based on deep neural networks. Early approaches mainly relied on CNNs, which enhance feature extraction through convolution operations on images [3,4]. More recently, Transformer-based architectures have become the dominant research trend. Following the demonstration by Dosovitskiy et al. [5] that Transformer is able to achieve superior classification performance by splitting images into patches, and arranged as sequences of vectors for training, its potential in vision applications, including 3D vision technologies for a self-developed structural external crack damage recognition robot, automation in construction, has been widely recognized. However, the inherent requirement of significant computation resource makes Vision Transformers difficult to deploy for more complex tasks. Later variants such as Swin Transformers [6,7] are still unable to achieve real-time segmentation in ADAS systems. CACDU-Net [8] focuses on semantic segmentation tasks and proposes novel network architectures based on deep learning frameworks to enhance segmentation performance in specific scenarios. It is employed in skin lesion segmentation in medical images, using a double U-Net structure integrated with attention mechanisms and multi-scale dilated convolution modules. RDL-YOLO [9] integrates RepViT-A, DDC, and LDConv modules to enhance local–global interaction and dynamic perception in agricultural pest and disease detection. It demonstrates stronger multi-scale adaptability and generalization capability on public datasets for agricultural pest detection.
To address these limitations, Xie et al. proposed SegFormer [2], which reduces the computational cost of ViT by modifying its backbone network into the Mix Transformer and introducing several other improvements. They have demonstrated superior semantic segmentation performance on multiple datasets while maintaining excellent inference speed. Hou et al. [10] proposed a knowledge distillation approach called Inter-Region Affinity Knowledge Distillation (IntRA-KD). This decompose the road scene images into different regions labeled as nodes, and then constructs an inter-region affinity graph based on similarities in feature distributions among those nodes. This enables a lighter student network to learn the more complex features extracted by the teacher network more effectively. Wu et al. introduced a multiscale attention-based dilated CNN [11]. Without increasing the number of parameters, this captures broader semantic information and effectively handles the diverse sizes and shapes of road markings. Hsiao et al. [12] proposed a multi-task model incorporating both semantic segmentation and lane detection by leveraging cross-dataset and cross-task learning, using only single or task-limited datasets.

2.2. Unsupervised Domain Adaptation

Most current semantic segmentation studies focus on clear, daytime environments. Although some researchers investigate domain adaptation for more challenging conditions [13,14,15,16], this still predominantly relies on SDR images. Nevertheless, we can still adopt and apply this approach to HDR-image-based model training. In UDA, self-training involves an iterative process where a source-trained Teacher Network generates pseudo-labels for the target domain, which are used to train a Student Network. Updated weights from the Student Network are transferred to the Teacher Network, enabling progressive adaptation. However, inaccurate pseudo-labels can introduce confirmation bias, making verification mechanisms necessary.
Hoyer et al. investigated SegFormer-based self-training with several works. DAFormer [13] augments the SegFormer with an Atrous Spatial Pyramid Pooling (ASPP) module [17] and an Exponential Moving Average (EMA) strategy. HRDA [14] extends DAFormer with joint high-/low-resolution processing to capture the fine detail, while MIC [15] further incorporates a Mask Consistency Loss. SePiCo [16] applies Class-Balance Cropping to improve performance on rare categories and show generalizability across networks such as DeepLabv2 [17] and DAFormer. Although these models perform well on benchmarks such as Cityscapes → Dark Zurich, many of them rely on computationally heavy backbones (e.g., MiT-B5), limiting their practicality for real-time ADAS deployment.
Adversarial training, also known as domain alignment, uses a loss function to encourage a model to learn domain-invariant features. This is typically achieved using a Gradient Reversal Layer (GRL) and a Domain Discriminator. The model attempts to produce similar feature distributions across domains, while the Discriminator tries to identify the domain origin of each feature map, which forms the Min–Max optimization process reminiscent of GANs [18]. Vu et al. [19] introduced a Domain Discriminator combined with a minimum-entropy objective to mitigate low-confidence predictions. Wang et al. [20] advanced this idea through pixel-level domain discrimination, referred to as Fine-Grained Adversarial Learning. Recently, Cai et al. [21] applied SegFormer with unsupervised domain adaptation. By incorporating rare-class sampling, they present a road-marking semantic segmentation model on the multi-weather RLMD-AC dataset [22]. Their model demonstrated strong generalization across diverse weather conditions.
ClassMix is a data augmentation technique that crops and pastes object classes between images to increase data diversity and mitigate class imbalance. Introduced by Olsson et al. [23], it has been widely employed in self-training to enhance target-domain variability. Later work, such as DACS [24] and HRDA [14], extended ClassMix into Cross-Domain Mixed Sampling, further enriching data diversity by applying it to both source and target domains. In this research, we employ the core idea of adversarial training and draw inspiration from Fine-Grained Adversarial Learning. The pixel-wise domain discrimination is performed with our modified SegFormer framework.

2.3. High Dynamic Range Image

With technology advances, high dynamic range images have become much easier to obtain. By improving the image quality directly from the sensing devices, it is possible to enhance the feature extraction capabilities of deep neural networks. In ADAS-related work, many studies utilize the characteristics of HDR images, specifically their ability to preserve details in both high- and low-brightness regions, for model training. Wang et al. [25] developed an HDR-based deep learning model and training framework capable of detecting vehicle’s brake lights more accurately. The HDR-based approach is also adopted for traffic light recognition [26]. Kocdemir et al. [27] used HDR images to address the instability issue in conventional images, solving the recognition failures that occur in traffic scenes with strong backlighting or overexposed regions.
As HDR imagery grows in popularity in traffic object detection, researchers are beginning to explore its potential for traffic-scene semantic segmentation. Weiher employed HDR images captured in virtual environments, converted them to a realistic style, and used these to train a semantic segmentation model, which demonstrates the feasibility of using synthetic HDR image for the task [28]. Huang et al. [29] used HDR images from the Cityscapes dataset to simulate multiple exposures, generating several SDR images for feature extraction and fusion. The resulting model shows strong improvements for several classes in Cityscapes. Beyond using HDR to improve feature extraction, some studies have proposed adjusting image exposure to obtain richer features from a single image. Singh et al. [30] introduced an approach that simulates multi-exposure adjustments on a single standard image. This enables the model to extract features across multiple exposure levels, improving object detection performance in dark regions.
Onzon et al. proposed a different perspective and approach compared to traditional methods [31]. Conventional methods typically fuse multiple exposure images into a single HDR image using an image signal processor (ISP) before performing object recognition. Since this fusion is usually optimized for human visual perception, it can lead to the loss of critical data for deep neural networks. To address this, Onzon et al. extract features from each exposure image separately and fuse them using the “local cross-attention fusion” method before feeding the result into the detection head for end-to-end object recognition. This method eliminates the need to synthesize a single HDR image and instead directly fuses semantic features from images with different exposures.

3. Method

To deal with semantic segmentation under adverse weather conditions with limited numbers of annotated images, this paper presents a UDA-based, end-to-end framework. The proposed network aims to mitigate domain shift between labeled daytime data and unlabeled nighttime and rainy data. More specifically, our architecture incorporates the multi-exposure feature extraction, self-training, and adversarial learning to enhance its robustness and generalization across the illumination changes and weather conditions without the additional annotation on target-domain data. As illustrated in Figure 1, the network model comprises multiple parallel operating modules to optimize domain adaptation and segmentation performance.
Figure 1. This architecture uses High Dynamic Range (HDR) imaging with multi-exposure feature extraction to reduce information loss in bad weather. This minimizes differences between clear daytime and challenging conditions (night/rainy) by adjusting brightness and employing concurrent self-training and adversarial training in an end-to-end framework.

3.1. Multi-Exposure Feature Extraction for HDR Images

The proposed semantic segmentation network extends SegFormer to improve feature extraction robustness under varying illumination conditions. Multiple SegFormer encoders are employed in parallel to extract the multi-exposure representations from the same input image across different exposure variants. This process produces multiple sets of feature maps that are then integrated into a feature fusion sub-module designed for cross-exposure aggregation and channel compression to reduce the computation while preserving discriminative information. As shown in Figure 2, the input undergoes a series of exposure adjustments, including Gamma correction and log transform, to generate diverse exposure images. In the experiments of this paper, the Dual-Encoder uses images transformed by exponential and logarithmic methods as input, while the Triple-Encoder adds Gamma Correction. These are then processed independently by the corresponding encoders to extract multi-exposure, multi-scale feature maps for downstream fusion and segmentation.
Figure 2. Multi-exposure feature extraction. In this approach, the same image undergoes multiple image processing operations and is then fed into different encoders for feature extraction. The extracted features are then compressed and fused through a feature fusion submodule, and finally passed to a decoder for classification.
To extract both global and local image characteristics, decoders generate multiple feature maps of varying dimensions. To efficiently merge these maps, especially identically-sized maps from different decoders, both sets are fed into a feature fusion sub-module (see Figure 3). In this sub-module, feature maps with matching spatial dimensions are first concatenated and then passed through separate multi-layer perceptrons for channel compression. This step enables effective feature blending and alignment of channel dimensions across scales. The smaller feature maps are then upsampled to match the largest map, and all maps are concatenated before being forwarded to the decoder.
Figure 3. Structure of feature fusion module. This module performs feature fusion on feature maps originating from different decoders. After fusion, the unified features are appropriately scaled and then fed into the decoder.

3.2. Source Domain Supervised Training

This module is trained using source domain data that has been extensively annotated for semantic segmentation. In the source domain, image brightness distributions of conventional images are generally less extreme, and local brightness variations are relatively simple compared to those images in the target domain. Furthermore, the overall image quality remains consistently high. As a result, this module utilizes conventional images for transfer learning, consistent with the common pre-training practices.
The proposed HDRSeg-UDA model is based on SegFormer. To enable the network to extract multi-exposure features in the target domain and enhance its feature extraction capability for the images with varying brightness distributions, we incorporate two SegFormer encoders. These encoders extract features from two images separately with different brightness, and then generate two sets of feature maps to feed into the fusion sub-module for integration. The training of this module utilizes a pixel-wise cross-entropy loss function given by
L S ( i ) = m = 1 W n = 1 H c = 1 C y S ( i , m , n , c ) log y S ˜ ( i , m , n , c ) = m = 1 W n = 1 H c = 1 C y S ( i , m , n , c ) log D e c ϕ ( E n c ϕ ( x S ( i ) ) ) ( m , n , c )
where x S ( i ) represents the i-th image from the source domain, E n c ϕ and D e c ϕ denote the encoder and decoder of SegFormer, respectively. ϕ signifies the weights trained using the source domain data. y S ˜ ( i ) represents the predicted segmentation map inferred by the model, while y S ( i ) denotes the ground truth. Finally, H and W represent the height and width of the image, respectively, and C denotes the number of classes.
To enable both decoders to extract distinct image features, the same image undergoes preprocessing to yield two versions with different brightness distributions. This requires a nonlinear adjustment of the image’s brightness values, which ensures each emphasizes different luminance characteristics. Thus, we employ exponential and logarithmic functions in this work for nonlinear contrast stretching. The image intensity value subject to exponential contrast stretching accentuates the bright region, while the intensity value processed with a logarithmic function highlights darker feature variations. The computations for the exponential and logarithmic functions are given by
x e x p = x α x l o g = log ( 1 + β x ) max ( log ( 1 + β x ) )
where α and β are parameters that control the exponential and logarithmic function curves.

3.3. Target Domain Unsupervised Training

The proposed technique utilizes unlabeled HDR images for training. Since direct image annotations are not available, we employ a separate network model with an identical structure to perform inference on the target domain images and generate pseudo-labels. The pseudo-labels serve as supervision for the original network during its training on the target domain (see Figure 1). Subsequently, the network responsible for producing pseudo-labels in the target domain and the network undergoing domain-specific training are referred to as Teacher and Student Networks, respectively. The generation of pseudo-labels is given by
y ^ T ( j , m , n , c ) = 1 , c = arg max c D e c θ ( E n c θ ( x T ( j ) ) ) ( m , n , c ) 0 , c arg max c D e c θ ( E n c θ ( x T ( j ) ) ) ( m , n , c )
where x T ( j ) denotes the j-th image from the target domain. E n c θ and D e c θ represent the SegFormer encoder and decoder, respectively, where θ indicates the weights of the Teacher Network. y ^ T denotes the pseudo-labels generated by the Teacher Network.
While the Teacher Network’s inference capabilities on target domain images are limited, the pseudo-labels cannot substitute ground truth annotations completely. Consequently, it is crucial to apply additional constraints to prevent confirmation bias. Otherwise, the model would update its weights based on erroneous pseudo-label annotations. In this paper, we introduce a confidence threshold ϵ (the value is set as 0.968 in our experiments) when computing the pixel-wise cross-entropy loss between the Student Network’s predicted map and the Teacher Network’s pseudo-labels. The threshold effectively acts as a mask for pseudo-labels:
M ( j , m , n ) = 1 , max c D e c θ ( E n c θ ( x T ( j ) ) ) ( m , n , c ) ϵ 0 , max c D e c θ ( E n c θ ( x T ( j ) ) ) ( m , n , c ) < ϵ
If the confidence for all classes at a given pixel in the predicted map does not meet the criterion, the loss is considered.
Finally, the loss function in target domain is calculated by
L T ( j ) = m = 1 W n = 1 H c = 1 C y ^ T ( j , m , n , c ) log y ˜ T ( i , m , n , c ) log y ˜ T ( i , m , n , c ) = M ( j , m , n ) log D e c ϕ ( E n c ϕ ( x T ( j ) ) ) ( m , n , c )
where x T ( j ) represents the j-th image from the target domain, E n c θ and D e c θ denote the SegFormer encoder and decoder, respectively. θ indicates the weights of Teacher Network. y ˜ T ( j ) is the predicted segmentation map inferred by the model, while y T ( j ) represents the label derived from the pseudo-label y ^ T ( j ) and the confidence threshold mask M ( j ) .
To train a model with robust inference capabilities across both domains, it is crucial to share the weights ϕ of the Student Network with the Teacher Network. This ensures that the Teacher Network can derive more accurate and stable pseudo-labels. In self-training research, there are two common methods to update the weights θ of a Teacher Network. The classic approach is to replicate the weights of the Student Network directly to the Teacher Network, which makes the Student Network the Teacher Network for the next iteration. The other approach is to allow the weights of the Teacher Network to gradually converge towards those of the target domain. In the latter method, the weights of both the Teacher Network and the Student Network are combined through a weighted sum to update the Teacher Network for the next iteration. We use EMA to update the entire Teacher Network (including encoders, decoders, etc.), as shown in the following formula:
θ i t e r α θ i t e r 1 + ( 1 α ) ϕ i t e r 1
where α represents the exponentially decay weight, and i t e r denotes the iteration count.

3.4. Domain Discrimination Training

To facilitate more effective domain shift, beyond the gradual adaptation achieved through self-training, we utilize adversarial learning to fine-tune the model weights. Building upon the supervised learning framework, we develop a pixel-wise binary classification decoder by incorporating a domain discriminator, as depicted in Figure 4. The domain discriminator is to infer from which domain a given feature map derived by the feature fusion module originates. However, an ideal cross-domain semantic segmentation encoder should produce feature maps that are indistinguishable, regardless of their source domain. Since our domain discriminator is given as a binary classification decoder, the pixel-wise binary cross-entropy loss
L A d v ( i , j ) = log ( 1 d ˜ S ( i ) ) log d ˜ T ( j ) = log ( 1 D i s ( E n c ( x S ( i ) ) ) ) log ( D i s ( E n c ( x T ( j ) ) ) )
is employed, where D i s is the domain discriminator, d ˜ S ( i ) and d ˜ T ( j ) denote the discrimination maps generated by the domain discriminator after feature extraction from i-th source domain image and j-th target domain image, respectively. In Equation (8), the source domain is labeled as zero, and the target domain is labeled as one.
Figure 4. The structure of the domain discriminator. This architecture incorporates a domain discriminator. Through an adversarial process between the semantic segmentation encoder and the domain discriminator, the encoder’s feature extraction capabilities for both source and target domain images are compelled to converge, achieving a more consistent representation across domains.
Finally, by combining the self-training and adversarial training approaches as illustrated in Figure 1, the loss function for the domain discriminator is written as
L A d v ( i , j ) = log ( 1 d ˜ S ( i ) ) log d ˜ T ( j ) = log ( 1 D i s ( E n c ϕ ( x S ( i ) ) ) ) log ( D i s ( E n c ϕ ( x T ( j ) ) ) )

4. Experimental Section

4.1. Datasets

We employed two publicly available semantic segmentation datasets, Cityscapes [32] and BDD100K [33], and three public road marking datasets: CeyMo [34], VPGNet [35], and RLMD [35] with the extension RLMD-AC [21]. A major challenge is the lack of HDR images, which are essential for the proposed method, as most datasets (except Cityscapes in clear daytime) provide only 8-bit SDR images. To address this issue, we adopt SingleHDR [36] to reconstruct 32-bit HDR images from single SDR inputs. The resulting images are then stored with 24-bit color depth.

4.1.1. Cityscapes

This is a large-scale urban dataset for semantic segmentation, also providing 16-bit native HDR images (only for clear daytime scenes, and only daytime images are used), The images in this dataset are used as the daytime training and testing set.

4.1.2. BDD100K

This is a dataset for large-scale autonomous driving, which covers diverse scenes and weather (daytime, nighttime, rainy). Its subset, BDD10K, provides 19 semantic classes, manually classified into different weather conditions. All rainy and nighttime images are designated for training and testing, while daytime images are combined with Cityscapes for training. BDD100K images without annotation are used for augmentation via UDA after being converted to HDR format.

4.1.3. CeyMo

This is a road marking semantic segmentation dataset collected in Sri Lanka, covering urban, suburban, and rural roads in daytime, nighttime, and rainy conditions. It contains 11 symbolic road marking annotations.

4.1.4. VPGNet

This is a dataset for lane line and road marking perception, collected in South Korea. It provides annotations for 18 types of markings with four weather conditions consolidated into three (combining rainy and heavy rain).

4.1.5. RLMD-AC

This is a dataset collected in Taiwan, providing annotated training sets for daytime, and testing sets for various weather conditions, along with unlabeled nighttime and rainy training images. Additional nighttime HDR images are added to the RLMD-AC nighttime training and testing sets to better evaluate the generalization of our method on native HDR data.

4.1.6. Evaluation Metric

In all experiments conducted across the datasets, we employ mean Intersection over Union (mIoU) as the main evaluation metric. It is important to note that some classes in the BDD10K, CeyMo, and VPGNet datasets are not available in their nighttime or rainy subsets. For these cases, the calculation of mIoU directly excludes these unrepresented classes.

4.2. Implementation

This work modifies the SegFormer architecture as its foundational framework. We employ two encoders with an identical network structure to MiT-B0. This dual-encoder setting allows the model to receive two input images derived from the same source but subjected to different preprocessing. In the feature fusion module (see Figure 3), our implementation modifies the input dimension of the first multi-layer perceptron within the decoder. This modification enables the presentation of feature maps with twice the original number of channels. Subsequently, the number of channels in these feature maps is compressed back to the original. Finally, the resulting model has approximately 1.93 and 0.52 times the number of parameters in SegFormer-B0 and SegFormer-B1, respectively.
We incorporate ClassMix, as discussed in Section 2.2, into our data augmentation strategy. Furthermore, the approach by Tranheden et al. [24], which extends ClassMix beyond mixing exclusively within the target domain is also adopted. However, we observe that using low-confidence pseudo-labels generated during the early training stages with ClassMix often results in incomplete class mixing. To address this issue, our ClassMix strategy is modified to focus on mixing data from the source domain during the initial training phase. As the model training stabilized, we then shift to predominantly mixing data in the target domain.

4.3. Results

The experimental results and comparison with state-of-the-art techniques are shown in Table 1 and Figure 5. Compared to the baseline method (SegFormer [2], with only training with daytime datasets) and MIC [15], HDRSeg-UDA shows the best mIoU except for the nighttime images in Cityscapes and BDD100K datasets. The consistent improvements on domain changes from clear to night, rainy, and mixed scenarios validate the effectiveness of our HDR-based UDA framework for multi-weather semantic segmentation. Furthermore, the magnitude of the performance gain directly reflects the difficulty of each dataset, including factors such as the number of categories, the geometric complexity of road markings, and illumination conditions. In the BDD100K nighttime dataset experiment, our method had a slightly lower mIoU than MIC. This is presumably because the nighttime training set has a very large amount of data (32,711 unlabeled images), making the model more likely to update in a direction that is more favorable to nighttime. We use a domain discriminator to forcibly eliminate the differences between weather conditions, which can better balance the aggressiveness of updates between different domains. An important finding is the substantial improvement in inference observed under adverse weather, which suppresses the typical degradation of model accuracy in such challenging conditions.
Table 1. The comparison of our HDRSeg-UDA with state-of-the-art methods. The performance is reported as mIoU in %.
Figure 5. The results for the target domain on the RLMD-AC dataset.
This successful knowledge transfer is attributed to the ability of the UDA framework to adapt semantic understanding from labeled daytime (source) images to unlabeled adverse-weather (target) domains. Moreover, the accuracy is also increased on daytime datasets. This demonstrates that our proposed multi-exposure feature extraction module (i.e., Dual-Path SegFormer) designed to exploit HDR inputs yields more robust and stable features even under normal illumination. The experiments on multiple datasets with adverse weather, including Cityscapes, BDD100K, CeyMo, VPGNet, and RLMD-AC, show that our integrated training approach leveraging self-training and adversarial training mechanisms is a highly generalizable semantic segmentation model.

4.4. Ablation Study

To assess the contribution of each component of HDRSeg-UDA, ablation experiments are conducted on the RLMD-AC dataset to evaluate the Dual-Encoder, Triple-Encoder, Self-Training, Discriminator, and ClassMix modules, as tabulated in Table 2. Note that the Triple-Encoder model is too large with an excessive number of parameters. This makes it not suitable for use in autonomous driving systems and thus not adopted. We further examine the impact of HDR imagery by comparing model performance when trained and tested with SDR versus HDR images. As the evaluation tabulated in Table 3 for Cityscapes and RLMD-AC datasets shows, the effectiveness of each key component, including the UDA techniques (self-training and adversarial training) and the use of high-bit-depth HDR data, highlights the overall robustness of the integrated approach across multiple image datasets.
Table 2. Contribution of each submethod in the RLMD-AC CLEAR → NIGHT task. The performance is reported as mIoU in %.
Table 3. Contribution of different bit depths in our architecture.

5. Conclusions

This paper presents a new training framework that enhances semantic segmentation for autonomous driving, particularly in challenging nighttime and rainy conditions. By incorporating HDR imagery with UDA techniques (self-training and adversarial training), our proposed HDRSeg-UDA leverages labeled daytime images and unlabeled nighttime HDR data effectively. The integration of the HDR feature representation with UDA generalization capabilities results in a robust model exhibiting excellent road marking segmentation under different scenarios. Experiments conducted on four adverse weather datasets have demonstrated substantial improvements in inference accuracy on both nighttime and rainy weather conditions, and also yield a notable boost in accuracy on daytime scenes.

Author Contributions

Conceptualization, H.-Y.L. methodology, H.-Y.L. and M.-Y.C.; software, M.-Y.C.; validation, M.-Y.C.; formal analysis, M.-Y.C.; investigation, H.-Y.L.; resources, H.-Y.L.; data curation, M.-Y.C.; writing—original draft preparation, H.-Y.L.; writing—review and editing, H.-Y.L.; visualization, M.-Y.C.; supervision, H.-Y.L.; project administration, H.-Y.L.; funding acquisition, H.-Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Science and Technology Council of Taiwan under Grant 109-2221-E-194-037-MY3.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The code and dataset presented in the study are openly available at https://github.com/ZackChen1140/RMSeg-HDR, accessed on 23 November 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
TLDTraffic Light Detection
C2ICar-to-Infrastructure
CNNConvolutional Neural Network
SOTAState-Of-The-Art
SimAMSimple Attention Module
ECAEfficient Channel Attention Mechanism
CIoUComplete Intersection of Union
EIoUEfficient Intersection of Union
HSMHard Sample Mining
GFLOPsGiga Floating-point Operations Per Second

References

  1. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  2. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
  3. Gidaris, S.; Komodakis, N. Object detection via a multi-region and semantic segmentation-aware cnn model. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1134–1142. [Google Scholar]
  4. Tokunaga, H.; Teramoto, Y.; Yoshizawa, A.; Bise, R. Adaptive weighting multi-field-of-view CNN for semantic segmentation in pathology. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12597–12606. [Google Scholar]
  5. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  6. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
  7. Wang, J.; Liao, X.; Wang, Y.; Zeng, X.; Ren, X.; Yue, H.; Qu, W. M-SKSNet: Multi-scale spatial kernel selection for image segmentation of damaged road markings. Remote Sens. 2024, 16, 1476. [Google Scholar] [CrossRef]
  8. Hao, S.; Wu, H.; Du, C.; Zeng, X.; Ji, Z.; Zhang, X.; Ganchev, I. Cacdu-net: A novel doubleu-net based semantic segmentation model for skin lesions detection in images. IEEE Access 2023, 11, 82449–82463. [Google Scholar] [CrossRef]
  9. Zhang, X.; Li, L.; Bian, Z.; Dai, C.; Ji, Z.; Liu, J. RDL-YOLO: A Method for the Detection of Leaf Pests and Diseases in Cotton Based on YOLOv11. Agronomy 2025, 15, 1989. [Google Scholar] [CrossRef]
  10. Hou, Y.; Ma, Z.; Liu, C.; Hui, T.W.; Loy, C.C. Inter-region affinity distillation for road marking segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12486–12495. [Google Scholar]
  11. Wu, J.; Liu, W.; Maruyama, Y. Automated road-marking segmentation via a multiscale attention-based dilated convolutional neural network using the road marking dataset. Remote Sens. 2022, 14, 4508. [Google Scholar] [CrossRef]
  12. Hsiao, H.C.; Cai, Y.C.; Lin, H.Y.; Chiu, W.C.; Chan, C.T.; Wang, C.C. FuseRoad: Enhancing Lane Shape Prediction Through Semantic Knowledge Integration and Cross-Dataset Training. In Proceedings of the 2025 IEEE Intelligent Vehicles Symposium (IV), Cluj-Napoca, Romania, 22–25 June 2025; pp. 897–902. [Google Scholar]
  13. Hoyer, L.; Dai, D.; Van Gool, L. Daformer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9924–9935. [Google Scholar]
  14. Hoyer, L.; Dai, D.; Van Gool, L. Hrda: Context-aware high-resolution domain-adaptive semantic segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 372–391. [Google Scholar]
  15. Hoyer, L.; Dai, D.; Wang, H.; Van Gool, L. MIC: Masked image consistency for context-enhanced domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 11721–11732. [Google Scholar]
  16. Xie, B.; Li, S.; Li, M.; Liu, C.H.; Huang, G.; Wang, G. Sepico: Semantic-guided pixel contrast for domain adaptive semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9004–9021. [Google Scholar] [CrossRef] [PubMed]
  17. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
  18. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
  19. Vu, T.H.; Jain, H.; Bucher, M.; Cord, M.; Pérez, P. Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2517–2526. [Google Scholar]
  20. Wang, H.; Shen, T.; Zhang, W.; Duan, L.Y.; Mei, T. Classes matter: A fine-grained adversarial approach to cross-domain semantic segmentation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 642–659. [Google Scholar]
  21. Cai, Y.C.; Hsiao, H.C.; Chiu, W.C.; Lin, H.Y.; Chan, C.T. RMSeg-UDA: Unsupervised Domain Adaptation for Road Marking Segmentation Under Adverse Conditions. In Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA), Atlanta, GA, USA, 19–23 May 2025; pp. 13471–13477. [Google Scholar]
  22. Hsiao, H.C.; Cai, Y.C.; Lin, H.Y.; Chiu, W.C.; Chan, C.T. RLMD: A Dataset for Road Marking Segmentation. In Proceedings of the 2023 International Conference on Consumer Electronics-Taiwan (ICCE-Taiwan), PingTung, Taiwan, 17–19 July 2023; pp. 427–428. [Google Scholar]
  23. Olsson, V.; Tranheden, W.; Pinto, J.; Svensson, L. Classmix: Segmentation-based data augmentation for semi-supervised learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 1369–1378. [Google Scholar]
  24. Tranheden, W.; Olsson, V.; Pinto, J.; Svensson, L. Dacs: Domain adaptation via cross-domain mixed sampling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 1379–1389. [Google Scholar]
  25. Wang, J.G.; Zhou, L.; Song, Z.; Yuan, M. Real-time vehicle signal lights recognition with HDR camera. In Proceedings of the 2016 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), Chengdu, China, 15–18 December 2016; pp. 355–358. [Google Scholar]
  26. Wang, J.G.; Zhou, L.B. Traffic light recognition with high dynamic range imaging and deep learning. IEEE Trans. Intell. Transp. Syst. 2018, 20, 1341–1352. [Google Scholar] [CrossRef]
  27. Kocdemir, I.H.; Akyuz, A.O.; Koz, A.; Chalmers, A.; Alatan, A.; Kalkan, S. Object detection for autonomous driving: High-dynamic range vs. low-dynamic range images. In Proceedings of the 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), Shanghai, China, 26–28 September 2022; pp. 1–5. [Google Scholar]
  28. Weiher, M. Domain Adaptation of HDR Training Data for Semantic Road Scene Segmentation by Deep Learning. 2019. Available online: https://mediatum.ub.tum.de/1525857 (accessed on 23 November 2025).
  29. Huang, T.; Song, S.; Liu, Q.; He, W.; Zhu, Q.; Hu, H. A novel multi-exposure fusion approach for enhancing visual semantic segmentation of autonomous driving. Proc. Inst. Mech. Eng. Part D J. Automob. Eng. 2023, 237, 1652–1667. [Google Scholar] [CrossRef]
  30. Singh, K.; Parihar, A.S. MRN-LOD: Multi-exposure refinement network for low-light object detection. J. Vis. Commun. Image Represent. 2024, 99, 104079. [Google Scholar] [CrossRef]
  31. Onzon, E.; Bömer, M.; Mannan, F.; Heide, F. Neural exposure fusion for high-dynamic range object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17564–17573. [Google Scholar]
  32. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
  33. Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2636–2645. [Google Scholar]
  34. Jayasinghe, O.; Hemachandra, S.; Anhettigama, D.; Kariyawasam, S.; Rodrigo, R.; Jayasekara, P. Ceymo: See more on roads-a novel benchmark dataset for road marking detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 3104–3113. [Google Scholar]
  35. Lee, S.; Kim, J.; Shin Yoon, J.; Shin, S.; Bailo, O.; Kim, N.; Lee, T.H.; Seok Hong, H.; Han, S.H.; So Kweon, I. Vpgnet: Vanishing point guided network for lane and road marking detection and recognition. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1947–1955. [Google Scholar]
  36. Liu, Y.L.; Lai, W.S.; Chen, Y.S.; Kao, Y.L.; Yang, M.H.; Chuang, Y.Y.; Huang, J.B. Single-image HDR reconstruction by learning to reverse the camera pipeline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1651–1660. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.