1. Introduction
High dynamic range (HDR) imaging has gained significant attention in recent years due to its ability to present images with enhanced realism compared to standard dynamic range (SDR) imaging. HDR enables a greater dynamic range of luminance, accurately representing brightness variations from the most intense highlights to the darkest shadows. The increasing maturity of display technology and consumer demand for high-quality visuals have further driven the adoption of HDR in various camera-equipped devices.
HDR imaging typically involves three primary stages: (1) image acquisition, (2) tone mapping, and (3) HDR display. Each stage has been extensively studied in separate research communities. However, many consumer-grade cameras still lack native HDR support, resulting in the widespread capture of low dynamic range (LDR) images. LDR images typically have an 8-bit color depth, whereas HDR formats generally exceed 10 bits to better preserve luminance information. While LDR images use 24 bits per pixel (8-bit per RGB channel), HDR images often incorporate an additional 8-bit exposure channel (E-channel) for luminance storage, yielding 32-bit pixel representation [
1].
To address this limitation, various methods have been proposed to convert LDR images into HDR representations [
2]. These methods can be broadly categorized into two approaches:
Transformation-based methods, which apply linear scaling [
3] or non-linear functions [
4] to enhance dynamic range.
Reconstruction-based methods, which attempt to recover lost information in saturated regions [
5,
6].
Although these methods produce visually plausible results, most operate with fixed algorithms that struggle to adapt to varying scene conditions. Consequently, visually unnatural artifacts may appear in overexposed regions, reducing the accuracy and perceptual quality of HDR reconstruction when viewed on HDR-enabled displays.
In recent years, artificial intelligence has advanced rapidly and has been widely applied to computational imaging tasks. In the imaging domain, deep learning methods have been applied to improve feature representation and image quality and have more recently been extended to HDR reconstruction tasks [
7]. A common deep learning strategy is to extract features from differently exposed regions and learn a mapping to HDR ground truth for image reconstruction. In 2017, Eilertsen et al. proposed the HDRCNN [
8] network for HDR image reconstruction, designing a hybrid dynamic range autoencoder within a CNN framework [
9], which resembles a deep autoencoder architecture. Marnerides et al. introduced ExpandNet [
10], which is also based on a CNN architecture. To handle the large number of parameters that need to be transmitted, ExpandNet is characterized as a fully automatic, end-to-end, parameter-free model. It comprises three layers that handle different tasks according to image features, such as frequency distribution.
The central idea of FHDR [
11] is the adoption of a feedback mechanism, drawing inspiration from the architectural designs of other feedback networks. After each iteration, the HDR image is reconstructed by a reconstruction block that follows the feedback mechanism. Ronneberger et al. proposed HDRUNet [
12], which achieved notable results in the NTIRE 2021 competition. They observed that most network models were still ineffective at addressing denoising and quantization errors. Their architecture consists of multiple distinct models, each performing a specific function. This includes a base network based on the U-Net architecture [
13], which can be trained with fewer images, and a condition network that integrates spatial feature transform (SFT) layers to provide spatial information adjustments. The condition network takes an LDR image as input and predicts the corresponding conditional maps, which are then used as feature inputs for the base network.
With the rapid proliferation of mobile devices, Liu et al. proposed Mobile-HDR [
14], constructing a dataset using three mobile phone cameras to collect paired LDR-HDR images in the raw image domain, covering different noise levels. They then introduced a transformer-based model with a pyramid cross-attention alignment module to aggregate highly correlated features from different exposure frames, performing joint HDR denoising and fusion.
The models described above are based on a single LDR image as input. However, another class of models uses multiple images as inputs, applied to dynamic scenes such as moving objects. SingleHDR [
15] generates an HDR image from a single LDR image in dynamic scenes. The process involves integrating the imaging pipeline of the LDR image into the model and simulating the HDR-to-LDR transformation, including dynamic range clipping, nonlinear mapping via the camera response function, and quantization. HDR-GAN [
16] is based on the GAN (generative adversarial network) architecture [
17], and its network design specifically addresses the ghosting problem in dynamic scene environments. Through adversarial learning, it directly optimizes parameters and final results. The network consists of a generator (G) and a discriminator (D), with the input comprising three LDR images of different exposure levels that include motion information. Recent studies on efficient visual representation learning also suggest that multi-scale feature extraction and adaptive fusion are important for balancing accuracy and computational efficiency. For example, Zhang et al. proposed a pyramid-structured multi-scale transformer for efficient semi-supervised video object segmentation with adaptive fusion [
18]. Although their task differs from single-image HDR reconstruction, their use of pyramid-structured multi-scale representation and scale-adaptive fusion is conceptually related to the present study, in which ResNet50-based [
19] multi-scale feature extraction, attention-guided skip fusion, and SE-based channel recalibration are used to enhance structural information while reducing computational cost. This related work further supports the motivation for exploring efficient feature fusion mechanisms in reconstruction and enhancement tasks.
Based on the above review, this study is motivated by three practical challenges in single-image HDR reconstruction. First, although deep learning-based HDR reconstruction has shown promising results, many consumer-captured images are still LDR images obtained under complex lighting conditions, such as overexposure, underexposure, noise, blur, and local contrast variation. These factors directly affect the reconstruction of saturated regions and structural details, making robust training-data preparation an important issue. However, the configuration-level effects of combined preprocessing and augmentation operations, such as unsharp masking, denoising, Gaussian blur, and brightness–contrast adjustment, remain insufficiently discussed in HDR reconstruction.
Second, practical HDR reconstruction should not only pursue image quality but also consider model complexity, training time, and hardware accessibility. Existing CNN-based models such as HDRCNN, ExpandNet, and HDRUNet have demonstrated the feasibility of HDR reconstruction, but their training requirements or backbone complexity may limit deployment on consumer-grade hardware. This issue is crucial because many real-world imaging applications, including mobile photography, multimedia display, and local image enhancement, require efficient models that can be trained or deployed without high-end computing resources.
Third, the feature extraction backbone plays an important role in balancing representation capacity and computational efficiency. A heavy backbone may improve feature representation but increase parameter count and training cost, whereas an overly lightweight backbone may reduce computational burden but weaken multi-scale structural representation. Therefore, this study investigates a practical HDRCNN-derived framework that combines composite augmentation configuration analysis, a ResNet50-based backbone with attention and SE-based feature recalibration, and hardware-aware training optimization. The goal is not to claim a fundamentally new HDR reconstruction theory, but to examine whether an efficiency-oriented reconstruction pipeline can improve structural similarity and reduce training costs under a limited but reproducible experimental setting.
In response to these issues, this work does not claim to introduce a fundamentally new HDR reconstruction theory or a complete component-level ablation study. Instead, it presents an engineering-oriented modification of HDRCNN that combines composite augmentation configuration analysis, a ResNet50-based backbone with attention and SE-based feature recalibration, and hardware-aware training optimization. The goal is to examine whether a practical HDR reconstruction framework can improve structural similarity and reduce training costs under a limited but reproducible experimental setting.
In this paper, HDRCNN is adopted as the baseline framework because it provides a representative CNN-based architecture for single-image HDR reconstruction. We first compare several composite preprocessing and augmentation configurations to examine how training-data composition affects reconstruction quality. We then integrate mixed-precision training and cosine annealing learning-rate scheduling to improve training efficiency. Finally, we replace the original VGG16 [
20] backbone with a ResNet50-based encoder enhanced with attention blocks and squeeze-and-excitation (SE) blocks. This design aims to reduce computational cost while maintaining structural reconstruction quality. Because the present evaluation is limited in scale, the results should be interpreted as practical evidence of an efficient HDRCNN-derived framework rather than as a definitive benchmark comparison across all HDR reconstruction models.
In summary, this paper makes the following contributions:
Composite augmentation configuration analysis: We conduct a configuration-level comparison of different composite preprocessing and augmentation settings for single-image HDR reconstruction under the HDRCNN framework. The best configuration achieves PSNR of 22.10 dB and SSIM of 0.7714 in the adopted experimental setting, suggesting that the composition of training data can influence reconstruction quality. However, this comparison does not isolate the independent effect of each augmentation operation, and operation-level ablation remains necessary in future work.
Hardware-aware training optimization: We integrate mixed-precision training [
21] with cosine annealing and validation loss-based learning-rate adjustment into the HDRCNN pipeline. Under the adopted setting, this optimization reduces training time from 11 h 14 min to 7 h 56 min and improves validation stability. This result indicates that hardware-aware training can reduce computational burden, although its independent contribution should be further verified through controlled experiments.
Efficient backbone redesign: We replace the VGG16 encoder in HDRCNN with a ResNet50-based architecture augmented with attention blocks and SE blocks. This modification reduces the number of parameters from approximately 138 M to 25 M and shortens training time in the adopted setting. In the single-crop comparison, the ResNet50-based model improves SSIM from 0.2705 to 0.8512 compared with the VGG16-based HDRCNN setting. In the model-level comparison, the proposed model achieves the shortest training time and slightly higher PSNR than HDRUNet, whereas HDRUNet obtains a higher SSIM. These results indicate a trade-off among computational efficiency, pixel-wise fidelity, and structural similarity. Therefore, the proposed model should be viewed as a practical, efficiency-oriented alternative rather than a universally superior HDR reconstruction model.
It should be noted that the current study has several scope limitations. The evaluation is conducted on the SI-HDR dataset with a limited testing setting, and the comparison with HDRUNet and ExpandNet follows the adopted model-specific configurations rather than a fully unified benchmark protocol. In addition, the augmentation experiment compares composite configurations rather than isolating each individual preprocessing operation, and the evaluation relies on PSNR and SSIM without including HDR-specific or learned perceptual metrics such as HDR-VDP-2, TMQI, or LPIPS. These limitations are explicitly discussed in the experimental analysis and conclusion, and they define the directions for future work.
This paper is organized as follows.
Section 2 introduces the preliminary concepts and background knowledge related to HDR image reconstruction and the key techniques referenced in this study.
Section 3 presents the proposed scheme, detailing the overall framework, implementation strategies, and optimization methods.
Section 4 provides experimental results and discussions, including performance evaluations and comparative analyses of different models and preprocessing techniques. Finally,
Section 5 concludes the paper and outlines potential directions for future work.
3. The Proposed Scheme
The proposed scheme consists of three engineering-oriented phases: (1) composite dataset preprocessing and augmentation, (2) hardware-aware training optimization, and (3) backbone replacement with attention-based feature recalibration. The architecture diagram is shown in
Figure 1.
It should be emphasized that the purpose of this section is to describe a practical HDRCNN-derived reconstruction pipeline rather than to introduce a fundamentally new HDR reconstruction theory. The augmentation settings are designed as composite configurations for configuration-level comparison, not as a complete operation-level ablation study. Similarly, the backbone and training optimization choices are motivated by practical considerations of training efficiency, representation capacity, and deployability on consumer-grade hardware.
3.1. Dataset Preprocessing and Data Augmentation
Data preprocessing and augmentation can influence feature learning and reconstruction quality in deep learning-based HDR reconstruction. The SI-HDR dataset used in this study contains high-quality HDR scenes with exposure variations; however, the available training data remain limited in scale. Therefore, composite augmentation configurations are used to increase the diversity of edge sharpness, noise characteristics, blur level, brightness, and contrast.
The purpose of this experiment is not to isolate the independent contribution of each augmentation operation. Instead, we compare several composite augmentation configurations to examine whether different training-data compositions are associated with changes in HDR reconstruction quality. This wording more accurately reflects the experimental design and avoids interpreting the results as a complete ablation study.
3.1.1. Unsharp Masking
The unsharp mask (USM) [
24], originally derived from silver halide photography, works by applying a blur to the image and subtracting the blurred version from the original to enhance edge sharpness. In this study, the radius parameter is set to 10 pixels (defining the edge detection range), and the amount parameter is set to 0.5 to achieve optimal sharpening while preventing noise artifacts.
3.1.2. Denoising
Inspired by HDRUNet, we apply denoising as part of the data augmentation. We use the GEGL (Generic Graphics Library) denoise filter [
25] in GIMP, which effectively reduces random noise caused by image sensors or compression. The strength parameter is set to 5, balancing noise reduction and the preservation of image clarity.
3.1.3. Gaussian Blur
We apply the Gaussian blur filter [
26] in GIMP to soften images. The blur parameters are set to 2.5 for both the horizontal (X) and vertical (Y) directions, meaning each pixel’s color value is smoothed based on the surrounding pixels within a 2.5-pixel radius. This operation smooths edges and reduces abrupt intensity transitions.
3.1.4. Brightness–Contrast Adjustment
Brightness and contrast adjustment are crucial for enhancing image visibility. We use GIMP to adjust brightness and contrast, considering different dynamic ranges for different formats. For PNG images, the range is from −127 to 127, while for HDR images it is from −0.5 to 0.5. A proportional adjustment (ratio of 254:1) ensures consistent visual effects across formats.
These techniques aim to improve the model’s training effectiveness and generalization ability, ensuring the generation of high-quality HDR images. Based on the above preprocessing methods, we prepared three datasets with different preprocessing ratios for comparison.
Dataset 1: 100% original.
Dataset 2: 20% original, 20% unsharp masking, 30% denoising, and 30% blurring.
Dataset 3: 15% original, 15% unsharp masking, 20% denoising, 20% blurring, and 30% brightness and contrast adjustment.
The ratios in the three dataset configurations were selected as practical exploratory settings rather than theoretically optimized proportions. Dataset 1 serves as the unaugmented baseline. Dataset 2 introduces edge enhancement, denoising, and blur to simulate common variations in image sharpness and noise. Dataset 3 further includes brightness–contrast adjustment because luminance variation is central to HDR reconstruction. Since multiple operations and proportions are changed simultaneously, the results should be interpreted as configuration-level evidence rather than as proof of the marginal contribution of any single augmentation operation.
3.2. Training Code Optimization
During training, HDR reconstruction networks require repeated convolutional operations and large-scale floating-point computation. To improve practical training efficiency, mixed-precision training is integrated into the HDRCNN pipeline. In this setting, FP16 computation is used for selected forward and backward operations to reduce memory usage and computational cost, while FP32 precision is retained for numerically sensitive operations such as gradient accumulation and weight updates. This design is intended to benefit from FP16 acceleration on modern GPUs while maintaining training stability.
In addition, cosine annealing learning-rate scheduling is adopted to gradually reduce the learning rate during training, allowing larger updates in the early stage and more stable refinement in the later stage. A validation loss-based learning-rate reduction is also used when the validation loss does not improve for consecutive epochs. These training choices are motivated by practical convergence stability and computational efficiency. However, the present study evaluates them as part of the overall optimized pipeline; their independent effects are not isolated and should be examined in future controlled experiments.
3.3. Model Architecture Replacement
The original HDRCNN model adopts a VGG16-based feature extraction architecture. Although VGG16 provides a straightforward and widely used convolutional backbone, it contains a large number of parameters and may increase computational cost. Therefore, this study replaces the VGG16 encoder with a ResNet50-based encoder as a practical compromise between representation capacity and parameter efficiency.
ResNet50 is selected because its residual connections facilitate gradient propagation and help mitigate degradation in deeper networks. Compared with VGG16, ResNet50 provides deeper hierarchical feature extraction with fewer parameters. Compared with lighter alternatives such as ResNet18 or MobileNet, ResNet50 may preserve stronger multi-scale representation capacity, which is important for reconstructing structural details in HDR images. Nevertheless, this study does not claim that ResNet50 is the optimal backbone among all possible lightweight models; a systematic comparison with ResNet18, MobileNet, and other efficient backbones remains for future work (
Table 1).
The modified architecture further incorporates attention blocks, SE blocks, and skip-connection-based feature fusion. Attention blocks are inserted at decoder–encoder fusion stages because these stages combine high-level semantic features from the decoder with spatial details from the encoder. The attention mechanism is therefore used to emphasize spatially relevant features before fusion. SE blocks are placed after decoder convolutional blocks to recalibrate channel-wise responses after feature aggregation, allowing the network to adjust the relative importance of different feature channels. Finally, channel-wise concatenation is used to fuse decoder features with encoder skip-connection outputs, preserving multi-scale spatial information that is important for HDR reconstruction.
These design choices are motivated by common architectural principles in image restoration and enhancement networks. However, the present experiments evaluate the ResNet50, attention block, and SE block combination as an integrated architecture. The independent contribution of each component is not isolated in this study and should be examined through future component-level ablation.
The modified HDRCNN architecture is summarized in
Table 2. It consists of a ResNet50-based encoder, decoder blocks with SE-based channel recalibration, attention-guided skip fusion, and a final convolutional reconstruction layer. Compared with the original VGG16-based HDRCNN, this design reduces the number of parameters and introduces feature recalibration mechanisms. The purpose is to improve structural feature preservation while maintaining practical training efficiency.
4. Experimental Results and Discussion
This section presents the quantitative and qualitative results obtained from the proposed preprocessing, training optimization, and architecture modification strategies. The SI-HDR dataset [
27] is adopted as the experimental dataset. It contains 183 HDR images captured using a Canon 5D Mark III camera (Canon Inc., Tokyo, Japan) and includes natural landscapes, urban scenes, and day/night variations. Multiple exposure levels of RAW images were merged into HDR images using an estimator that accounts for photon noise, and the dataset provides camera response function information for simulating the camera imaging pipeline.
Following the HDRCNN-based experimental setting adopted in this study, five images are randomly selected for testing. We acknowledge that this is a limited testing setting and does not constitute a full benchmark evaluation. Therefore, the reported quantitative results should be interpreted as preliminary evidence under the adopted experimental protocol rather than as statistically comprehensive performance estimates. Future work should evaluate the model on a full test split and additional standard HDR benchmarks, reporting mean and standard deviation across a larger testing set.
To improve the clarity and reproducibility of the experimental setup, the main training settings are summarized here. Unless otherwise specified, Dataset 3 is used as the main training dataset for the optimized HDRCNN and the proposed ResNet50-based model. For these experiments, the batch size is set to 16, the number of training epochs is set to 200, and the initial learning rate is set to 0.01. The optimized training pipeline applies mixed-precision training, cosine annealing learning-rate scheduling, and validation loss-based learning-rate reduction. Specifically, the learning rate is further reduced to 10% of its current value when the validation loss does not improve for 10 consecutive epochs. The image input and output resolution of the model is fixed at 256 × 256 pixels. PSNR and SSIM are used as the quantitative evaluation metrics. For the comparison models, HDRUNet is trained using a batch size of 4, 200 epochs, and a learning rate of 0.01, while ExpandNet is trained using a batch size of 12, 10,000 epochs, and a learning rate of 0.00007, following their documented model-specific settings. Therefore, these settings should be interpreted as model-specific training configurations rather than a fully unified benchmark protocol.
HDRCNN is used as the baseline model, while ExpandNet and HDRUNet are included as reference models for comparison. The image quality is evaluated using peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM), defined in Equation (2) and Equation (3), respectively. PSNR mainly reflects pixel-wise reconstruction fidelity, whereas SSIM evaluates luminance, contrast, and structural similarity. Therefore, these two metrics capture different aspects of HDR reconstruction quality. We acknowledge that PSNR and SSIM alone do not fully represent perceptual HDR quality. HDR-specific and learned perceptual metrics, such as HDR-VDP-2, TMQI, and LPIPS, are not included in the current evaluation. Consequently, the results should be interpreted as a distortion and structural-similarity analysis rather than as a complete perceptual HDR quality assessment.
In Equation (2), represents the maximum possible pixel value of the image, and MSE denotes the mean squared error between the reference and reconstructed images. In Equation (3), , , and denote the luminance, contrast, and structure components, respectively, computed from two image windows of equal size extracted from corresponding locations in the compared images. The parameters α, β, and γ are positive constants that adjust the relative importance of each component.
4.1. Experimental Results of Dataset Preprocessing and Data Augmentation
This experiment compares the configuration-level effects of different composite preprocessing strategies on HDRCNN training. Three dataset configurations are evaluated (
Table 3): Dataset 1 serves as the unaugmented baseline, Dataset 2 introduces unsharp masking, denoising, and Gaussian blur, and Dataset 3 further incorporates brightness–contrast adjustment. Each configuration was trained three times under identical hyperparameter settings, and the best-performing run, determined by minimum validation loss, is reported for training loss, validation loss, PSNR, and SSIM.
It should be emphasized that this experiment is a composite configuration comparison rather than a fully controlled operation-level ablation study. Because multiple augmentation operations and ratios are changed simultaneously, the independent contribution of each augmentation technique cannot be isolated from the present results. Therefore, the following analysis focuses on the overall behavior of each dataset configuration rather than making causal claims about individual preprocessing operations.
The results show that Dataset 3 achieves the best PSNR and SSIM among the three tested configurations and also obtains lower training and validation losses. Notably, the computational efficiency, as measured by total training duration, remains comparable across all three dataset configurations, indicating that the quality improvements were achieved without additional computational overhead.
The experimental results presented in
Table 3 demonstrate that the different preprocessing configurations significantly impact HDRCNN performance. Based on the experimental results in
Table 3, the following observations are made at the configuration level. These observations should not be interpreted as isolated effects of individual augmentation operations.
Unsharp Masking (USM): USM is intended to enhance edge information in the training data. Comparing Dataset 2 and Dataset 3, the latter achieves a gain of 3.70 dB in PSNR and 0.1809 in SSIM. However, because multiple augmentation proportions differ simultaneously between these two configurations, the present results do not isolate the individual contribution of USM. Instead, they suggest that a lower USM proportion, when combined with other augmentation adjustments, is associated with improved reconstruction quality under the current experimental setting.
Denoising: Denoising augmentation is expected to improve robustness to noise in real-world LDR inputs. Although Dataset 3 contains a smaller proportion of denoised samples than Dataset 2, it achieves a 50.5% reduction in validation loss and a 3.70 dB improvement in PSNR. These results suggest that denoising can be effective as part of a broader augmentation configuration, but its standalone contribution cannot be isolated from the present experiments.
Blur: Gaussian blur is used to simulate real-world capture imperfections such as motion blur and optical aberrations. Compared with Dataset 2, Dataset 3 uses a lower blur proportion yet achieves a 42.6% lower training loss and a 0.1809 higher SSIM. This indicates that the augmentation composition in Dataset 3 is more effective overall, although the current results are insufficient to determine the marginal effect of blur alone.
Brightness and Contrast Adjustment: Dataset 3, which includes brightness–contrast adjustment as part of its composite configuration, achieves the best PSNR and SSIM among the three settings. Relative to Dataset 1, Dataset 3 improves PSNR from 19.05 dB to 22.10 dB and SSIM from 0.6444 to 0.7714 without increasing training time. Relative to Dataset 2, it also improves PSNR and SSIM. These results suggest that the overall Dataset 3 composition is more suitable for the adopted HDRCNN training setting. However, because brightness–contrast adjustment is introduced together with changes in the ratios of other operations, its independent effect cannot be confirmed from the present experiment.
Furthermore, the experimental results suggest that the composite configuration used in Dataset 3 is more effective than the other tested configurations under the adopted setting.
Training and Validation Loss: Dataset 3 achieved the lowest training loss (0.0159) and validation loss (0.0196), indicating that this configuration enables more effective model optimization and better generalization to unseen data. Notably, Dataset 3 achieved a substantial reduction in validation loss, outperforming Dataset 1 and Dataset 2 by approximately 91% and 51%, respectively.
Image Quality Metrics: The results for Dataset 3 are further supported by image quality metrics, with the highest PSNR (22.10 dB) and SSIM (0.7714). Compared to Dataset 1 (PSNR: 19.05 dB, SSIM: 0.6444), Dataset 3 provides substantial improvements in both objective quality measures.
Computational Efficiency and Comparative Summary: Dataset 3 achieves better reconstruction results among the tested configurations while maintaining the same training time as Dataset 1 (7 h 41 min), suggesting that the quality improvements are associated with a more effective augmentation configuration rather than increased computational cost.
Table 4 summarizes the relative performance changes in each dataset with respect to the baseline, providing a consolidated comparison of the augmentation configurations.
As shown in
Table 4, Dataset 3 reduces training loss by 79.5% and validation loss by 90.7% relative to the baseline Dataset 1, while improving PSNR by 3.05 dB and SSIM by 0.127. Importantly, these gains are achieved without any increase in training time. These results indicate that the augmentation configuration adopted in Dataset 3 is more effective overall, although the isolated contribution of each individual preprocessing operation requires further ablation study.
4.2. Experimental Evaluation of Training Optimization
Dataset 3 was employed as the baseline training dataset with standardized hyperparameters: batch size of 16, 200 training epochs, and an initial learning rate of 0.01. We implemented two key optimization strategies to enhance computational efficiency while preserving model performance:
Dynamic Learning Rate Scheduling: The standard fixed learning rate approach was replaced with a cosine annealing schedule that gradually decreased from an initial value of 0.01 throughout the training process. Additionally, the learning rate was further reduced to 10% of its current value whenever the validation loss failed to decrease significantly over 10 consecutive epochs. This adaptive strategy improved convergence stability, as reflected by the smoother validation loss curve shown in
Figure 2.
Mixed-Precision Training: By leveraging 16-bit floating-point operations for forward passes while maintaining 32-bit precision for gradient accumulation and weight updates, we substantially reduced memory requirements and computational overhead. This technique was particularly effective for the memory-intensive encoder layers while preserving numerical stability.
The results in
Table 5 and
Figure 2 indicate that the optimized training pipeline improved training efficiency and convergence behavior under the adopted setting.
Training Time Reduction: Overall training time decreased from 11 h 14 min to 7 h 56 min, representing a 29.5% improvement in computational efficiency.
Convergence Quality: The optimized implementation achieved slightly better final training loss (0.0159 vs. 0.0162) while dramatically improving validation loss (0.0196 vs. 0.0408), indicating a 52% enhancement in generalization capability.
Training Stability: The loss curves clearly illustrate that the optimized version achieved faster initial convergence (note the steeper decline in the first 25 epochs) and eliminated the validation loss fluctuations present in the unoptimized version, suggesting more stable gradient updates throughout training.
The reduced training time and lower validation loss indicate that the optimized training pipeline is beneficial under the adopted HDRCNN setting. However, because mixed-precision training, cosine annealing, and validation loss-based learning-rate reduction are applied together, the present experiment does not isolate the independent contribution of each optimization component. Therefore, the results should be interpreted as evidence for the effectiveness of the overall training optimization pipeline rather than as component-level ablation evidence.
4.3. Changing Model Architecture
Using Dataset 3 as the training dataset, identical hyperparameters were maintained across experiments: batch size of 16, 200 epochs, and a learning rate of 0.01. Both the original VGG16 and proposed ResNet50-based models incorporated dynamic learning rate adjustment and mixed-precision training. The training time and final loss values after architectural replacement with ResNet50 are presented in
Table 6, and the corresponding training and validation loss curves are shown in
Figure 3. The experimental results in
Table 6 and
Figure 3 show that the ResNet50-based model reduces the number of parameters and shortens the training time under the adopted setting. The model also achieves higher PSNR and SSIM than the VGG16-based HDRCNN setting. However,
Table 6 and
Figure 3 also show that the ResNet50-based model has higher final training and validation losses than the VGG16-based model. This discrepancy indicates that the training loss and the evaluation metrics capture different aspects of reconstruction quality and should be interpreted carefully.
The apparent discrepancy between higher loss values and improved PSNR/SSIM can be explained by the difference between the optimization objective and the evaluation metrics. The training loss directly reflects the adopted pixel-level reconstruction objective, whereas SSIM emphasizes luminance, contrast, and structural similarity. The ResNet50-based architecture, together with attention-guided feature fusion and SE-based channel recalibration, may preserve structural patterns more effectively even when the final loss value is higher. Therefore, a higher loss does not necessarily imply inferior structural reconstruction quality.
Nevertheless, this divergence also reveals a limitation of the current loss design. The adopted loss function may not be fully aligned with perceptual or structural quality metrics. Future work should consider SSIM-aware, perceptual, or HDR-specific loss functions to better align the training objective with the desired reconstruction quality.
In this experiment, the ResNet50-based architecture shows practical advantages while maintaining identical hyperparameters. Training time reduced by 43% (from 7 h 56 min to 4 h 51 min) due to ResNet50 containing substantially fewer parameters (~25 M) compared to VGG16 (~138 M), representing an 83% reduction in model size.
The ResNet50-based model achieved higher PSNR and SSIM than the VGG16-based HDRCNN setting under the following adopted protocol:
PSNR improved from 13.01 dB to 16.08 dB (3.07 dB gain).
SSIM increased dramatically from 0.2705 to 0.8512.
The narrower gap between training/validation loss (0.0446/0.0437) compared to VGG16 (0.0159/0.0196) indicates better generalization.
The observed improvement in PSNR and SSIM may be associated with the following architectural characteristics of ResNet50:
Residual connections that facilitate gradient flow during backpropagation;
Deeper hierarchical feature extraction that may support multi-scale representation;
Attention-guided skip fusion and SE-based recalibration that may help preserve structural information.
These interpretations provide possible explanations for the observed results. However, because the ResNet50 backbone, attention blocks, and SE blocks are evaluated as an integrated architecture, the individual contribution of each component cannot be isolated in the present experiment.
As shown in
Figure 4, the ResNet50-based model shows improved PSNR and SSIM under the adopted setting, particularly in high dynamic range areas, edge preservation, and color fidelity across different illumination levels.
This architectural modification suggests that network design choices can affect HDR reconstruction performance beyond parameter tuning and may provide an efficient solution for improving reconstruction quality while requiring fewer computational resources.
4.4. Comparison with Other Model Architectures
Dataset 3 was also used to train HDRUNet and ExpandNet as reference models. HDRUNet and ExpandNet were trained using their documented model-specific hyperparameter settings rather than a fully unified training protocol. Specifically, HDRUNet used a batch size of four, 200 epochs, and a learning rate of 0.01, while ExpandNet used a batch size of 12, 10,000 epochs, and a learning rate of 0.00007.
This comparison is intended to provide a practical reference for training cost and reconstruction tendencies under commonly used or documented model-specific settings. It should not be interpreted as a fully controlled same-dataset, same-epoch, same-resolution benchmark comparison. A rigorous fair comparison would require retraining all models under a unified protocol, full test split, and additional perceptual/HDR-specific metrics, which is left for future work.
Figure 5 presents the visual reconstruction results corresponding to the quantitative metrics reported in
Table 7. The qualitative comparison in
Figure 5 should also be interpreted cautiously. HDRUNet and ExpandNet may enhance background regions more strongly in some cases, which can be advantageous when background visibility is prioritized. However, stronger background enhancement may also affect foreground contrast or local structural consistency. The proposed model tends to preserve structural similarity more effectively according to SSIM, but this does not imply that it is visually preferable in all regions or all application scenarios. This observation further supports the need for additional perceptual and HDR-specific evaluation metrics in future work.
The primary objective in this study was to compare the modified HDRCNN architecture with reference models under documented model-specific settings as reported in their respective publications. This approach ensures that each comparative model demonstrates its best-documented performance, rather than potentially suboptimal performance under unified hyperparameters. The training time differs considerably: HDRUNet required over three days, whereas the proposed model converged in under five hours under the adopted comparison protocol. This disparity underscores the architectural efficiency achieved by our modifications.
Table 7 shows a clear trade-off among the models compared. HDRUNet achieves the highest PSNR, indicating stronger pixel-wise reconstruction fidelity under the adopted metric. In contrast, the proposed model achieves the shortest training time and slightly higher PSNR than HDRUNet, whereas HDRUNet achieves the highest SSIM. This result suggests that the proposed model provides an efficiency-oriented alternative, but it should not be interpreted as structurally superior to HDRUNet under all evaluation metrics. Therefore, the proposed model should not be interpreted as universally superior to HDRUNet or ExpandNet. Rather, it represents an efficiency-oriented alternative that prioritizes structural similarity and reduced training cost.
The exceptionally high epoch count for ExpandNet follows the original authors’ implementation, where convergence was demonstrated to require significantly more iterations due to the network’s unique architecture and extremely small learning rate (0.00007). This configuration is necessary to achieve the reported performance in the original publication. Our experiment with reduced epoch counts for ExpandNet resulted in substantially degraded reconstruction quality, confirming the necessity of extended training despite its computational cost. The stark difference in training requirements between models (10,000 for ExpandNet versus 200 for HDRCNN) further emphasizes our contribution: developing an HDR reconstruction architecture with reduced computational demands under the adopted comparison protocol, thus expanding practical applicability in resource-constrained environments.
The observed divergence, lower SSIM (0.5856 vs. 0.661 for HDRUNet) but higher PSNR (21.72 dB vs. 21.25 dB), suggests a trade-off in the proposed architecture: structural similarity is prioritized over pixel-wise accuracy. This pattern can be attributed to several factors.
The modified HDRCNN architecture emphasizes structural preservation through multi-scale feature fusion, attention mechanisms, and channel-wise recalibration rather than strict pixel-wise fidelity. This design choice deliberately prioritizes the preservation of visually salient features and structural information over minimizing mean squared error, which heavily influences PSNR calculations.
SSIM correlates more strongly with human visual perception by evaluating structural similarity, contrast, and luminance patterns. In HDR reconstruction, such perceptual fidelity is particularly important, since the primary objective is to produce images that faithfully represent the full dynamic range as perceived by human observers. The lower PSNR value, while traditionally considered a limitation, reveals our model’s tendency to preserve perceptually important high-frequency details and local contrast variations that may deviate slightly from the ground truth at the pixel level. This characteristic is especially relevant in HDR reconstruction, where global tone mapping can significantly alter individual pixel values while maintaining—or even enhancing—perceptual quality.
Furthermore, this metric divergence is consistent with prior discussions on the trade-off between distortion-based and perceptual quality measures in image restoration. However, because this study evaluates only PSNR and SSIM, additional HDR-specific and learned perceptual metrics are still required to support stronger perceptual-quality claims. For instance, Zhang et al. [
28] and Blau and Michaeli [
29] have established theoretical foundations explaining the fundamental trade-off between distortion measures (like PSNR) and perceptual quality in image restoration tasks. In practical applications of HDR imaging, such as computational photography, multimedia display, and visual content creation, structural similarity captured by SSIM can provide complementary information to pixel-wise reconstruction accuracy measured by PSNR. Our model’s emphasis on structural fidelity facilitates more natural tone mapping and detail preservation in high-contrast regions, which are critical aspects of HDR visualization.
To further illustrate the trade-off among reconstruction quality and training efficiency,
Figure 6 visualizes the relationship between training time, PSNR, and SSIM for HDRUNet, ExpandNet, and the proposed model. The proposed model requires the shortest training time and achieves the highest PSNR among the compared models, whereas HDRUNet obtains the highest SSIM but requires substantially longer training time. This result suggests that the proposed model provides a computationally efficient alternative, while HDRUNet remains stronger in terms of structural similarity under the adopted comparison protocol.
4.5. Limitations of the Experimental Evaluation
The experimental results should be interpreted with several limitations in mind. First, the testing setting is limited to five randomly selected images from the SI-HDR dataset; therefore, the results do not represent a full benchmark evaluation. Second, the comparison with HDRUNet and ExpandNet follows model-specific training configurations rather than a fully unified comparison protocol. Third, the evaluation uses PSNR and SSIM only. Although these metrics are useful for measuring pixel-wise fidelity and structural similarity, they do not fully capture perceptual HDR quality. Future work should include HDR-VDP-2, TMQI, LPIPS, and other perceptual or HDR-specific metrics. Fourth, the augmentation experiment compares composite configurations rather than isolating the independent effect of each augmentation operation. Fifth, the proposed ResNet50 backbone, attention blocks, SE blocks, mixed-precision training, and learning-rate scheduling are evaluated as an integrated pipeline; their individual contributions require future component-level ablation studies.
These limitations do not invalidate the observed efficiency and SSIM improvements under the adopted setting, but they restrict the generalizability of the conclusions. Therefore, the present results should be viewed as practical evidence for an efficient HDRCNN-derived framework rather than as definitive benchmark evidence across all HDR reconstruction models.
5. Conclusions and Future Work
In this paper, we presented an engineering-oriented HDR reconstruction framework derived from HDRCNN to improve training efficiency and structural reconstruction quality under practical hardware constraints. The proposed framework combines composite data augmentation configurations, mixed-precision training with cosine annealing learning-rate scheduling, and replacement of the original VGG16 backbone with a ResNet50-based encoder enhanced by attention blocks and squeeze-and-excitation (SE) blocks. The experimental results show that the best composite augmentation configuration improved PSNR and SSIM compared with the unaugmented baseline under the adopted HDRCNN setting. The optimized training pipeline reduced training time, and the ResNet50-based model further reduced the number of parameters and shortened training compared with the VGG16-based HDRCNN. Under the adopted comparison protocol, the proposed model achieved the shortest training time and slightly higher PSNR than HDRUNet, while HDRUNet retained a higher SSIM. This indicates a trade-off among computational efficiency, pixel-wise fidelity, and structural similarity. This indicates a trade-off among pixel-wise fidelity, structural similarity, and computational efficiency. However, the present study should be interpreted as a practical preliminary evaluation rather than a definitive benchmark comparison. Several limitations remain. First, the evaluation was conducted on the SI-HDR dataset with a limited five-image testing setting; future work should use a full test split and additional standard HDR benchmarks with mean and standard deviation reporting. Second, the comparison with HDRUNet and ExpandNet followed model-specific training settings rather than a fully unified protocol. A fairer comparison should use the same data split, resolution, epoch setting, and evaluation procedure. Third, the current evaluation used only PSNR and SSIM, which do not fully capture perceptual HDR quality. Future work should include HDR-VDP-2, TMQI, LPIPS, and other perceptual or HDR-specific metrics. Fourth, the augmentation experiment compared composite configurations rather than isolating the independent effect of each operation. Therefore, operation-level ablation studies are needed to clarify the individual roles of unsharp masking, denoising, Gaussian blur, and brightness–contrast adjustment. Finally, the ResNet50 backbone, attention blocks, SE blocks, mixed-precision training, and learning-rate scheduling were evaluated as an integrated pipeline. Future work should conduct component-level ablation studies and compare additional lightweight backbones such as ResNet18 and MobileNet. The current model is also limited to a fixed 256 × 256 output resolution, so future research should support variable and higher-resolution reconstruction. Overall, the proposed framework provides practical evidence that a HDRCNN-derived architecture can improve training efficiency and structural similarity on consumer-grade hardware, but broader benchmark validation and more complete ablation studies are still required.