Efficient HDR Image Reconstruction: A ResNet Approach with Enhanced Data Augmentation

He, Ting-Wei; Chen, Pei-Chi; Chen, Tzung-Her

doi:10.3390/electronics15122595

Open AccessArticle

Efficient HDR Image Reconstruction: A ResNet Approach with Enhanced Data Augmentation

by

Ting-Wei He

,

Pei-Chi Chen

and

Tzung-Her Chen

^*

Department of Computer Science and Information Engineering, National Chiayi University, Chia-Yi City 60004, Taiwan

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(12), 2595; https://doi.org/10.3390/electronics15122595

Submission received: 14 April 2026 / Revised: 8 June 2026 / Accepted: 10 June 2026 / Published: 12 June 2026

(This article belongs to the Special Issue Computer Vision and Image Processing in Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

High dynamic range (HDR) image reconstruction from a single low dynamic range (LDR) input remains an important problem for computational photography, particularly when practical deployment on consumer-grade hardware is considered. With the increasing availability of hardware supporting HDR, public demand for capturing and viewing HDR images has grown significantly. Recent research has explored deep learning-based approaches to reconstruct HDR images from low dynamic range (LDR) inputs by extracting regional pixel features or leveraging the camera response function (CRF) for model training. Many of these approaches employ Convolutional Neural Network (CNN) architectures and utilize skip connections to preserve learned information. Nevertheless, the configuration-level effects of data augmentation in HDR reconstruction remain insufficiently discussed. Existing CNN-based approaches, such as HDRCNN, HDRUNet, and ExpandNet, have demonstrated promising reconstruction ability, but they may involve a heavy backbone architecture, a long training time, or a limited discussion of how preprocessing configurations affect reconstruction performance. This study presents an engineering-oriented HDR reconstruction framework derived from HDRCNN, focusing on practical efficiency, structural fidelity, and training feasibility. The proposed framework introduces three modifications: (1) a configuration-level comparison of composite data augmentation settings, including unsharp masking, denoising, Gaussian blur, and brightness–contrast adjustment; (2) the replacement of the original VGG16 backbone with a ResNet50-based encoder enhanced with attention blocks and squeeze-and-excitation (SE) blocks for improved multi-scale feature extraction and channel-wise recalibration; and (3) the integration of mixed-precision training with cosine annealing learning-rate scheduling to reduce computational cost. Experimental results on the SI-HDR dataset show that the best composite augmentation configuration improves PSNR from 19.05 dB to 22.10 dB and SSIM from 0.6444 to 0.7714 without increasing the training time. Compared with the original VGG16-based HDRCNN setting, the ResNet50-based model reduces training time while improving SSIM from 0.2705 to 0.8512. Under the adopted comparison protocol, the proposed model achieves the shortest training time and slightly higher PSNR than HDRUNet, while HDRUNet retains a higher SSIM. This indicates a trade-off among pixel-wise fidelity, structural similarity, and computational efficiency. The current evaluation is limited by a small test setting, composite rather than operation-level augmentation analysis, and the use of PSNR and SSIM only; therefore, future work should include full benchmark evaluation, additional perceptual/HDR-specific metrics, and controlled component-level ablation studies.

Keywords:

HDR image reconstruction; deep learning; data augmentation; ResNet

1. Introduction

High dynamic range (HDR) imaging has gained significant attention in recent years due to its ability to present images with enhanced realism compared to standard dynamic range (SDR) imaging. HDR enables a greater dynamic range of luminance, accurately representing brightness variations from the most intense highlights to the darkest shadows. The increasing maturity of display technology and consumer demand for high-quality visuals have further driven the adoption of HDR in various camera-equipped devices.

HDR imaging typically involves three primary stages: (1) image acquisition, (2) tone mapping, and (3) HDR display. Each stage has been extensively studied in separate research communities. However, many consumer-grade cameras still lack native HDR support, resulting in the widespread capture of low dynamic range (LDR) images. LDR images typically have an 8-bit color depth, whereas HDR formats generally exceed 10 bits to better preserve luminance information. While LDR images use 24 bits per pixel (8-bit per RGB channel), HDR images often incorporate an additional 8-bit exposure channel (E-channel) for luminance storage, yielding 32-bit pixel representation [1].

To address this limitation, various methods have been proposed to convert LDR images into HDR representations [2]. These methods can be broadly categorized into two approaches:

Transformation-based methods, which apply linear scaling [3] or non-linear functions [4] to enhance dynamic range.
Reconstruction-based methods, which attempt to recover lost information in saturated regions [5,6].

Although these methods produce visually plausible results, most operate with fixed algorithms that struggle to adapt to varying scene conditions. Consequently, visually unnatural artifacts may appear in overexposed regions, reducing the accuracy and perceptual quality of HDR reconstruction when viewed on HDR-enabled displays.

In recent years, artificial intelligence has advanced rapidly and has been widely applied to computational imaging tasks. In the imaging domain, deep learning methods have been applied to improve feature representation and image quality and have more recently been extended to HDR reconstruction tasks [7]. A common deep learning strategy is to extract features from differently exposed regions and learn a mapping to HDR ground truth for image reconstruction. In 2017, Eilertsen et al. proposed the HDRCNN [8] network for HDR image reconstruction, designing a hybrid dynamic range autoencoder within a CNN framework [9], which resembles a deep autoencoder architecture. Marnerides et al. introduced ExpandNet [10], which is also based on a CNN architecture. To handle the large number of parameters that need to be transmitted, ExpandNet is characterized as a fully automatic, end-to-end, parameter-free model. It comprises three layers that handle different tasks according to image features, such as frequency distribution.

The central idea of FHDR [11] is the adoption of a feedback mechanism, drawing inspiration from the architectural designs of other feedback networks. After each iteration, the HDR image is reconstructed by a reconstruction block that follows the feedback mechanism. Ronneberger et al. proposed HDRUNet [12], which achieved notable results in the NTIRE 2021 competition. They observed that most network models were still ineffective at addressing denoising and quantization errors. Their architecture consists of multiple distinct models, each performing a specific function. This includes a base network based on the U-Net architecture [13], which can be trained with fewer images, and a condition network that integrates spatial feature transform (SFT) layers to provide spatial information adjustments. The condition network takes an LDR image as input and predicts the corresponding conditional maps, which are then used as feature inputs for the base network.

With the rapid proliferation of mobile devices, Liu et al. proposed Mobile-HDR [14], constructing a dataset using three mobile phone cameras to collect paired LDR-HDR images in the raw image domain, covering different noise levels. They then introduced a transformer-based model with a pyramid cross-attention alignment module to aggregate highly correlated features from different exposure frames, performing joint HDR denoising and fusion.

The models described above are based on a single LDR image as input. However, another class of models uses multiple images as inputs, applied to dynamic scenes such as moving objects. SingleHDR [15] generates an HDR image from a single LDR image in dynamic scenes. The process involves integrating the imaging pipeline of the LDR image into the model and simulating the HDR-to-LDR transformation, including dynamic range clipping, nonlinear mapping via the camera response function, and quantization. HDR-GAN [16] is based on the GAN (generative adversarial network) architecture [17], and its network design specifically addresses the ghosting problem in dynamic scene environments. Through adversarial learning, it directly optimizes parameters and final results. The network consists of a generator (G) and a discriminator (D), with the input comprising three LDR images of different exposure levels that include motion information. Recent studies on efficient visual representation learning also suggest that multi-scale feature extraction and adaptive fusion are important for balancing accuracy and computational efficiency. For example, Zhang et al. proposed a pyramid-structured multi-scale transformer for efficient semi-supervised video object segmentation with adaptive fusion [18]. Although their task differs from single-image HDR reconstruction, their use of pyramid-structured multi-scale representation and scale-adaptive fusion is conceptually related to the present study, in which ResNet50-based [19] multi-scale feature extraction, attention-guided skip fusion, and SE-based channel recalibration are used to enhance structural information while reducing computational cost. This related work further supports the motivation for exploring efficient feature fusion mechanisms in reconstruction and enhancement tasks.

Based on the above review, this study is motivated by three practical challenges in single-image HDR reconstruction. First, although deep learning-based HDR reconstruction has shown promising results, many consumer-captured images are still LDR images obtained under complex lighting conditions, such as overexposure, underexposure, noise, blur, and local contrast variation. These factors directly affect the reconstruction of saturated regions and structural details, making robust training-data preparation an important issue. However, the configuration-level effects of combined preprocessing and augmentation operations, such as unsharp masking, denoising, Gaussian blur, and brightness–contrast adjustment, remain insufficiently discussed in HDR reconstruction.

Second, practical HDR reconstruction should not only pursue image quality but also consider model complexity, training time, and hardware accessibility. Existing CNN-based models such as HDRCNN, ExpandNet, and HDRUNet have demonstrated the feasibility of HDR reconstruction, but their training requirements or backbone complexity may limit deployment on consumer-grade hardware. This issue is crucial because many real-world imaging applications, including mobile photography, multimedia display, and local image enhancement, require efficient models that can be trained or deployed without high-end computing resources.

Third, the feature extraction backbone plays an important role in balancing representation capacity and computational efficiency. A heavy backbone may improve feature representation but increase parameter count and training cost, whereas an overly lightweight backbone may reduce computational burden but weaken multi-scale structural representation. Therefore, this study investigates a practical HDRCNN-derived framework that combines composite augmentation configuration analysis, a ResNet50-based backbone with attention and SE-based feature recalibration, and hardware-aware training optimization. The goal is not to claim a fundamentally new HDR reconstruction theory, but to examine whether an efficiency-oriented reconstruction pipeline can improve structural similarity and reduce training costs under a limited but reproducible experimental setting.

In response to these issues, this work does not claim to introduce a fundamentally new HDR reconstruction theory or a complete component-level ablation study. Instead, it presents an engineering-oriented modification of HDRCNN that combines composite augmentation configuration analysis, a ResNet50-based backbone with attention and SE-based feature recalibration, and hardware-aware training optimization. The goal is to examine whether a practical HDR reconstruction framework can improve structural similarity and reduce training costs under a limited but reproducible experimental setting.

In this paper, HDRCNN is adopted as the baseline framework because it provides a representative CNN-based architecture for single-image HDR reconstruction. We first compare several composite preprocessing and augmentation configurations to examine how training-data composition affects reconstruction quality. We then integrate mixed-precision training and cosine annealing learning-rate scheduling to improve training efficiency. Finally, we replace the original VGG16 [20] backbone with a ResNet50-based encoder enhanced with attention blocks and squeeze-and-excitation (SE) blocks. This design aims to reduce computational cost while maintaining structural reconstruction quality. Because the present evaluation is limited in scale, the results should be interpreted as practical evidence of an efficient HDRCNN-derived framework rather than as a definitive benchmark comparison across all HDR reconstruction models.

In summary, this paper makes the following contributions:

Composite augmentation configuration analysis: We conduct a configuration-level comparison of different composite preprocessing and augmentation settings for single-image HDR reconstruction under the HDRCNN framework. The best configuration achieves PSNR of 22.10 dB and SSIM of 0.7714 in the adopted experimental setting, suggesting that the composition of training data can influence reconstruction quality. However, this comparison does not isolate the independent effect of each augmentation operation, and operation-level ablation remains necessary in future work.
Hardware-aware training optimization: We integrate mixed-precision training [21] with cosine annealing and validation loss-based learning-rate adjustment into the HDRCNN pipeline. Under the adopted setting, this optimization reduces training time from 11 h 14 min to 7 h 56 min and improves validation stability. This result indicates that hardware-aware training can reduce computational burden, although its independent contribution should be further verified through controlled experiments.
Efficient backbone redesign: We replace the VGG16 encoder in HDRCNN with a ResNet50-based architecture augmented with attention blocks and SE blocks. This modification reduces the number of parameters from approximately 138 M to 25 M and shortens training time in the adopted setting. In the single-crop comparison, the ResNet50-based model improves SSIM from 0.2705 to 0.8512 compared with the VGG16-based HDRCNN setting. In the model-level comparison, the proposed model achieves the shortest training time and slightly higher PSNR than HDRUNet, whereas HDRUNet obtains a higher SSIM. These results indicate a trade-off among computational efficiency, pixel-wise fidelity, and structural similarity. Therefore, the proposed model should be viewed as a practical, efficiency-oriented alternative rather than a universally superior HDR reconstruction model.

It should be noted that the current study has several scope limitations. The evaluation is conducted on the SI-HDR dataset with a limited testing setting, and the comparison with HDRUNet and ExpandNet follows the adopted model-specific configurations rather than a fully unified benchmark protocol. In addition, the augmentation experiment compares composite configurations rather than isolating each individual preprocessing operation, and the evaluation relies on PSNR and SSIM without including HDR-specific or learned perceptual metrics such as HDR-VDP-2, TMQI, or LPIPS. These limitations are explicitly discussed in the experimental analysis and conclusion, and they define the directions for future work.

This paper is organized as follows. Section 2 introduces the preliminary concepts and background knowledge related to HDR image reconstruction and the key techniques referenced in this study. Section 3 presents the proposed scheme, detailing the overall framework, implementation strategies, and optimization methods. Section 4 provides experimental results and discussions, including performance evaluations and comparative analyses of different models and preprocessing techniques. Finally, Section 5 concludes the paper and outlines potential directions for future work.

2. Background and Preliminaries

In this section, we provide a brief overview of HDR technology, inverse tone mapping, and Eilertsen et al.’s scheme [8] with an analysis of its image formulation and the mechanism underlying HDRCNN.

2.1. HDR Image and Inverse Tone Mapping

The concept of high dynamic range (HDR) imaging can be traced back to 1997, when Debevec and Malik [22] introduced a pioneering method for generating HDR radiance maps using multiple low dynamic range (LDR) images captured at varying exposure levels. In their approach, multiple exposures—ranging from 30 s to 1/1000 s—were used to estimate the camera response function and construct radiance maps. By applying inverse nonlinear functions to recover scene irradiance and merging the data, they successfully created HDR images that preserved a wider range of luminance details than any single exposure could capture.

Following the introduction of HDR imaging, many researchers developed algorithms to reconstruct HDR images through various mapping strategies. Notably, in 2006, Banterle et al. proposed an inverse tone mapping method [23], which takes a single low dynamic range image (LDRI) as input. Their approach introduced the inverse tone mapping operator (iTMO), which influences the formation of the expanded LDRI based on luminance values. Additionally, by applying a medium cut and estimating the light source density from the LDRI, they utilized an expanded map as an interpolation weight to linearly blend the expanded LDRI with the original LDRI, ultimately producing the final HDR image.

2.2. HDRCNN Architecture and Image Formation

In Eilertsen et al.’s method [8], an LDR image is used as the input to an encoder network to transform it into a compact feature representation of the scene. This encoded representation is then mapped back to the logarithmic domain, where an HDR decoder network reconstructs the HDR image, compensating for lost details in overexposed or underexposed regions. Moreover, a skip-connection mechanism is implemented between the LDR encoder and the HDR decoder to preserve and fully utilize high-resolution image details during reconstruction. During training, the dataset is created by sampling from a large collection of existing HDR images.

The image formation model used in HDRCNN is summarized in Equation (1). In this equation,

H_{i, c}

denotes the reconstructed HDR value at spatial location

i

and color channel

c

, while

D_{i, c}

denotes the corresponding input LDR pixel. The function

f^{- 1} (\cdot)

represents the inverse camera response function, which maps the LDR pixel value back to the linear intensity domain. The term

y_{i, c}

is the CNN prediction in the logarithmic HDR domain, and

e x p (y_{i, c})

converts this prediction back to the HDR intensity domain. The blending factor

α_{i}

is a spatially defined weighting factor that controls the contribution of the linearized LDR value and the CNN-predicted HDR value at location

i

. Specifically,

α_{i}

is defined as a linear ramp that begins when the normalized pixel intensity exceeds the saturation threshold

τ = 0.95

. This

α_{i}

should not be confused with the exponent parameter

α

used later in the SSIM formulation. This formulation allows the network to focus on reconstructing saturated or under-represented regions while preserving reliable information from the original LDR input.

H_{i, c} = (1 - α_{i}) f^{- 1} (D_{i, c}) + α_{i} e x p (y_{i, c})

(1)

Although HDRCNN provides a useful baseline for single-image HDR reconstruction, its original implementation relies on a VGG16-based feature extractor, which contains a large number of parameters and may increase computational cost. This motivates the exploration of more efficient backbone architectures. In this study, ResNet50 is selected as a practical compromise between feature representation capacity and parameter efficiency. Compared with VGG16, ResNet50 uses residual connections to improve gradient propagation and reduce degradation in deeper networks. Compared with lighter alternatives such as ResNet18 or MobileNet, ResNet50 provides stronger multi-scale feature representation, which is beneficial for preserving structural information in HDR reconstruction. Nevertheless, a complete comparison with lightweight backbones remains outside the scope of the present work and is identified as future work.

3. The Proposed Scheme

The proposed scheme consists of three engineering-oriented phases: (1) composite dataset preprocessing and augmentation, (2) hardware-aware training optimization, and (3) backbone replacement with attention-based feature recalibration. The architecture diagram is shown in Figure 1.

It should be emphasized that the purpose of this section is to describe a practical HDRCNN-derived reconstruction pipeline rather than to introduce a fundamentally new HDR reconstruction theory. The augmentation settings are designed as composite configurations for configuration-level comparison, not as a complete operation-level ablation study. Similarly, the backbone and training optimization choices are motivated by practical considerations of training efficiency, representation capacity, and deployability on consumer-grade hardware.

3.1. Dataset Preprocessing and Data Augmentation

Data preprocessing and augmentation can influence feature learning and reconstruction quality in deep learning-based HDR reconstruction. The SI-HDR dataset used in this study contains high-quality HDR scenes with exposure variations; however, the available training data remain limited in scale. Therefore, composite augmentation configurations are used to increase the diversity of edge sharpness, noise characteristics, blur level, brightness, and contrast.

The purpose of this experiment is not to isolate the independent contribution of each augmentation operation. Instead, we compare several composite augmentation configurations to examine whether different training-data compositions are associated with changes in HDR reconstruction quality. This wording more accurately reflects the experimental design and avoids interpreting the results as a complete ablation study.

3.1.1. Unsharp Masking

The unsharp mask (USM) [24], originally derived from silver halide photography, works by applying a blur to the image and subtracting the blurred version from the original to enhance edge sharpness. In this study, the radius parameter is set to 10 pixels (defining the edge detection range), and the amount parameter is set to 0.5 to achieve optimal sharpening while preventing noise artifacts.

3.1.2. Denoising

Inspired by HDRUNet, we apply denoising as part of the data augmentation. We use the GEGL (Generic Graphics Library) denoise filter [25] in GIMP, which effectively reduces random noise caused by image sensors or compression. The strength parameter is set to 5, balancing noise reduction and the preservation of image clarity.

3.1.3. Gaussian Blur

We apply the Gaussian blur filter [26] in GIMP to soften images. The blur parameters are set to 2.5 for both the horizontal (X) and vertical (Y) directions, meaning each pixel’s color value is smoothed based on the surrounding pixels within a 2.5-pixel radius. This operation smooths edges and reduces abrupt intensity transitions.

3.1.4. Brightness–Contrast Adjustment

Brightness and contrast adjustment are crucial for enhancing image visibility. We use GIMP to adjust brightness and contrast, considering different dynamic ranges for different formats. For PNG images, the range is from −127 to 127, while for HDR images it is from −0.5 to 0.5. A proportional adjustment (ratio of 254:1) ensures consistent visual effects across formats.

These techniques aim to improve the model’s training effectiveness and generalization ability, ensuring the generation of high-quality HDR images. Based on the above preprocessing methods, we prepared three datasets with different preprocessing ratios for comparison.

Dataset 1: 100% original.
Dataset 2: 20% original, 20% unsharp masking, 30% denoising, and 30% blurring.
Dataset 3: 15% original, 15% unsharp masking, 20% denoising, 20% blurring, and 30% brightness and contrast adjustment.

The ratios in the three dataset configurations were selected as practical exploratory settings rather than theoretically optimized proportions. Dataset 1 serves as the unaugmented baseline. Dataset 2 introduces edge enhancement, denoising, and blur to simulate common variations in image sharpness and noise. Dataset 3 further includes brightness–contrast adjustment because luminance variation is central to HDR reconstruction. Since multiple operations and proportions are changed simultaneously, the results should be interpreted as configuration-level evidence rather than as proof of the marginal contribution of any single augmentation operation.

3.2. Training Code Optimization

During training, HDR reconstruction networks require repeated convolutional operations and large-scale floating-point computation. To improve practical training efficiency, mixed-precision training is integrated into the HDRCNN pipeline. In this setting, FP16 computation is used for selected forward and backward operations to reduce memory usage and computational cost, while FP32 precision is retained for numerically sensitive operations such as gradient accumulation and weight updates. This design is intended to benefit from FP16 acceleration on modern GPUs while maintaining training stability.

In addition, cosine annealing learning-rate scheduling is adopted to gradually reduce the learning rate during training, allowing larger updates in the early stage and more stable refinement in the later stage. A validation loss-based learning-rate reduction is also used when the validation loss does not improve for consecutive epochs. These training choices are motivated by practical convergence stability and computational efficiency. However, the present study evaluates them as part of the overall optimized pipeline; their independent effects are not isolated and should be examined in future controlled experiments.

3.3. Model Architecture Replacement

The original HDRCNN model adopts a VGG16-based feature extraction architecture. Although VGG16 provides a straightforward and widely used convolutional backbone, it contains a large number of parameters and may increase computational cost. Therefore, this study replaces the VGG16 encoder with a ResNet50-based encoder as a practical compromise between representation capacity and parameter efficiency.

ResNet50 is selected because its residual connections facilitate gradient propagation and help mitigate degradation in deeper networks. Compared with VGG16, ResNet50 provides deeper hierarchical feature extraction with fewer parameters. Compared with lighter alternatives such as ResNet18 or MobileNet, ResNet50 may preserve stronger multi-scale representation capacity, which is important for reconstructing structural details in HDR images. Nevertheless, this study does not claim that ResNet50 is the optimal backbone among all possible lightweight models; a systematic comparison with ResNet18, MobileNet, and other efficient backbones remains for future work (Table 1).

The modified architecture further incorporates attention blocks, SE blocks, and skip-connection-based feature fusion. Attention blocks are inserted at decoder–encoder fusion stages because these stages combine high-level semantic features from the decoder with spatial details from the encoder. The attention mechanism is therefore used to emphasize spatially relevant features before fusion. SE blocks are placed after decoder convolutional blocks to recalibrate channel-wise responses after feature aggregation, allowing the network to adjust the relative importance of different feature channels. Finally, channel-wise concatenation is used to fuse decoder features with encoder skip-connection outputs, preserving multi-scale spatial information that is important for HDR reconstruction.

These design choices are motivated by common architectural principles in image restoration and enhancement networks. However, the present experiments evaluate the ResNet50, attention block, and SE block combination as an integrated architecture. The independent contribution of each component is not isolated in this study and should be examined through future component-level ablation.

The modified HDRCNN architecture is summarized in Table 2. It consists of a ResNet50-based encoder, decoder blocks with SE-based channel recalibration, attention-guided skip fusion, and a final convolutional reconstruction layer. Compared with the original VGG16-based HDRCNN, this design reduces the number of parameters and introduces feature recalibration mechanisms. The purpose is to improve structural feature preservation while maintaining practical training efficiency.

4. Experimental Results and Discussion

This section presents the quantitative and qualitative results obtained from the proposed preprocessing, training optimization, and architecture modification strategies. The SI-HDR dataset [27] is adopted as the experimental dataset. It contains 183 HDR images captured using a Canon 5D Mark III camera (Canon Inc., Tokyo, Japan) and includes natural landscapes, urban scenes, and day/night variations. Multiple exposure levels of RAW images were merged into HDR images using an estimator that accounts for photon noise, and the dataset provides camera response function information for simulating the camera imaging pipeline.

Following the HDRCNN-based experimental setting adopted in this study, five images are randomly selected for testing. We acknowledge that this is a limited testing setting and does not constitute a full benchmark evaluation. Therefore, the reported quantitative results should be interpreted as preliminary evidence under the adopted experimental protocol rather than as statistically comprehensive performance estimates. Future work should evaluate the model on a full test split and additional standard HDR benchmarks, reporting mean and standard deviation across a larger testing set.

To improve the clarity and reproducibility of the experimental setup, the main training settings are summarized here. Unless otherwise specified, Dataset 3 is used as the main training dataset for the optimized HDRCNN and the proposed ResNet50-based model. For these experiments, the batch size is set to 16, the number of training epochs is set to 200, and the initial learning rate is set to 0.01. The optimized training pipeline applies mixed-precision training, cosine annealing learning-rate scheduling, and validation loss-based learning-rate reduction. Specifically, the learning rate is further reduced to 10% of its current value when the validation loss does not improve for 10 consecutive epochs. The image input and output resolution of the model is fixed at 256 × 256 pixels. PSNR and SSIM are used as the quantitative evaluation metrics. For the comparison models, HDRUNet is trained using a batch size of 4, 200 epochs, and a learning rate of 0.01, while ExpandNet is trained using a batch size of 12, 10,000 epochs, and a learning rate of 0.00007, following their documented model-specific settings. Therefore, these settings should be interpreted as model-specific training configurations rather than a fully unified benchmark protocol.

HDRCNN is used as the baseline model, while ExpandNet and HDRUNet are included as reference models for comparison. The image quality is evaluated using peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM), defined in Equation (2) and Equation (3), respectively. PSNR mainly reflects pixel-wise reconstruction fidelity, whereas SSIM evaluates luminance, contrast, and structural similarity. Therefore, these two metrics capture different aspects of HDR reconstruction quality. We acknowledge that PSNR and SSIM alone do not fully represent perceptual HDR quality. HDR-specific and learned perceptual metrics, such as HDR-VDP-2, TMQI, and LPIPS, are not included in the current evaluation. Consequently, the results should be interpreted as a distortion and structural-similarity analysis rather than as a complete perceptual HDR quality assessment.

P S N R = 10 \log_{10} (\frac{{M A X}_{I}^{2}}{M S E}) = 20 \log_{10} (\frac{{M A X}_{I}}{\sqrt{M S E}})

(2)

S S I M (x, y) = {[l (x, y)]}^{α} \cdot {[c (x, y)]}^{β} \cdot {[s (x, y)]}^{γ}

(3)

In Equation (2),

{M A X}_{I}

represents the maximum possible pixel value of the image, and MSE denotes the mean squared error between the reference and reconstructed images. In Equation (3),

l

,

c

, and

s

denote the luminance, contrast, and structure components, respectively, computed from two image windows of equal size extracted from corresponding locations in the compared images. The parameters α, β, and γ are positive constants that adjust the relative importance of each component.

4.1. Experimental Results of Dataset Preprocessing and Data Augmentation

This experiment compares the configuration-level effects of different composite preprocessing strategies on HDRCNN training. Three dataset configurations are evaluated (Table 3): Dataset 1 serves as the unaugmented baseline, Dataset 2 introduces unsharp masking, denoising, and Gaussian blur, and Dataset 3 further incorporates brightness–contrast adjustment. Each configuration was trained three times under identical hyperparameter settings, and the best-performing run, determined by minimum validation loss, is reported for training loss, validation loss, PSNR, and SSIM.

It should be emphasized that this experiment is a composite configuration comparison rather than a fully controlled operation-level ablation study. Because multiple augmentation operations and ratios are changed simultaneously, the independent contribution of each augmentation technique cannot be isolated from the present results. Therefore, the following analysis focuses on the overall behavior of each dataset configuration rather than making causal claims about individual preprocessing operations.

The results show that Dataset 3 achieves the best PSNR and SSIM among the three tested configurations and also obtains lower training and validation losses. Notably, the computational efficiency, as measured by total training duration, remains comparable across all three dataset configurations, indicating that the quality improvements were achieved without additional computational overhead.

The experimental results presented in Table 3 demonstrate that the different preprocessing configurations significantly impact HDRCNN performance. Based on the experimental results in Table 3, the following observations are made at the configuration level. These observations should not be interpreted as isolated effects of individual augmentation operations.

Unsharp Masking (USM): USM is intended to enhance edge information in the training data. Comparing Dataset 2 and Dataset 3, the latter achieves a gain of 3.70 dB in PSNR and 0.1809 in SSIM. However, because multiple augmentation proportions differ simultaneously between these two configurations, the present results do not isolate the individual contribution of USM. Instead, they suggest that a lower USM proportion, when combined with other augmentation adjustments, is associated with improved reconstruction quality under the current experimental setting.
Denoising: Denoising augmentation is expected to improve robustness to noise in real-world LDR inputs. Although Dataset 3 contains a smaller proportion of denoised samples than Dataset 2, it achieves a 50.5% reduction in validation loss and a 3.70 dB improvement in PSNR. These results suggest that denoising can be effective as part of a broader augmentation configuration, but its standalone contribution cannot be isolated from the present experiments.
Blur: Gaussian blur is used to simulate real-world capture imperfections such as motion blur and optical aberrations. Compared with Dataset 2, Dataset 3 uses a lower blur proportion yet achieves a 42.6% lower training loss and a 0.1809 higher SSIM. This indicates that the augmentation composition in Dataset 3 is more effective overall, although the current results are insufficient to determine the marginal effect of blur alone.
Brightness and Contrast Adjustment: Dataset 3, which includes brightness–contrast adjustment as part of its composite configuration, achieves the best PSNR and SSIM among the three settings. Relative to Dataset 1, Dataset 3 improves PSNR from 19.05 dB to 22.10 dB and SSIM from 0.6444 to 0.7714 without increasing training time. Relative to Dataset 2, it also improves PSNR and SSIM. These results suggest that the overall Dataset 3 composition is more suitable for the adopted HDRCNN training setting. However, because brightness–contrast adjustment is introduced together with changes in the ratios of other operations, its independent effect cannot be confirmed from the present experiment.

Furthermore, the experimental results suggest that the composite configuration used in Dataset 3 is more effective than the other tested configurations under the adopted setting.

Training and Validation Loss: Dataset 3 achieved the lowest training loss (0.0159) and validation loss (0.0196), indicating that this configuration enables more effective model optimization and better generalization to unseen data. Notably, Dataset 3 achieved a substantial reduction in validation loss, outperforming Dataset 1 and Dataset 2 by approximately 91% and 51%, respectively.
Image Quality Metrics: The results for Dataset 3 are further supported by image quality metrics, with the highest PSNR (22.10 dB) and SSIM (0.7714). Compared to Dataset 1 (PSNR: 19.05 dB, SSIM: 0.6444), Dataset 3 provides substantial improvements in both objective quality measures.
Computational Efficiency and Comparative Summary: Dataset 3 achieves better reconstruction results among the tested configurations while maintaining the same training time as Dataset 1 (7 h 41 min), suggesting that the quality improvements are associated with a more effective augmentation configuration rather than increased computational cost. Table 4 summarizes the relative performance changes in each dataset with respect to the baseline, providing a consolidated comparison of the augmentation configurations.

As shown in Table 4, Dataset 3 reduces training loss by 79.5% and validation loss by 90.7% relative to the baseline Dataset 1, while improving PSNR by 3.05 dB and SSIM by 0.127. Importantly, these gains are achieved without any increase in training time. These results indicate that the augmentation configuration adopted in Dataset 3 is more effective overall, although the isolated contribution of each individual preprocessing operation requires further ablation study.

4.2. Experimental Evaluation of Training Optimization

Dataset 3 was employed as the baseline training dataset with standardized hyperparameters: batch size of 16, 200 training epochs, and an initial learning rate of 0.01. We implemented two key optimization strategies to enhance computational efficiency while preserving model performance:

Dynamic Learning Rate Scheduling: The standard fixed learning rate approach was replaced with a cosine annealing schedule that gradually decreased from an initial value of 0.01 throughout the training process. Additionally, the learning rate was further reduced to 10% of its current value whenever the validation loss failed to decrease significantly over 10 consecutive epochs. This adaptive strategy improved convergence stability, as reflected by the smoother validation loss curve shown in Figure 2.
Mixed-Precision Training: By leveraging 16-bit floating-point operations for forward passes while maintaining 32-bit precision for gradient accumulation and weight updates, we substantially reduced memory requirements and computational overhead. This technique was particularly effective for the memory-intensive encoder layers while preserving numerical stability.

The results in Table 5 and Figure 2 indicate that the optimized training pipeline improved training efficiency and convergence behavior under the adopted setting.

Training Time Reduction: Overall training time decreased from 11 h 14 min to 7 h 56 min, representing a 29.5% improvement in computational efficiency.
Convergence Quality: The optimized implementation achieved slightly better final training loss (0.0159 vs. 0.0162) while dramatically improving validation loss (0.0196 vs. 0.0408), indicating a 52% enhancement in generalization capability.
Training Stability: The loss curves clearly illustrate that the optimized version achieved faster initial convergence (note the steeper decline in the first 25 epochs) and eliminated the validation loss fluctuations present in the unoptimized version, suggesting more stable gradient updates throughout training.

The reduced training time and lower validation loss indicate that the optimized training pipeline is beneficial under the adopted HDRCNN setting. However, because mixed-precision training, cosine annealing, and validation loss-based learning-rate reduction are applied together, the present experiment does not isolate the independent contribution of each optimization component. Therefore, the results should be interpreted as evidence for the effectiveness of the overall training optimization pipeline rather than as component-level ablation evidence.

4.3. Changing Model Architecture

Using Dataset 3 as the training dataset, identical hyperparameters were maintained across experiments: batch size of 16, 200 epochs, and a learning rate of 0.01. Both the original VGG16 and proposed ResNet50-based models incorporated dynamic learning rate adjustment and mixed-precision training. The training time and final loss values after architectural replacement with ResNet50 are presented in Table 6, and the corresponding training and validation loss curves are shown in Figure 3. The experimental results in Table 6 and Figure 3 show that the ResNet50-based model reduces the number of parameters and shortens the training time under the adopted setting. The model also achieves higher PSNR and SSIM than the VGG16-based HDRCNN setting. However, Table 6 and Figure 3 also show that the ResNet50-based model has higher final training and validation losses than the VGG16-based model. This discrepancy indicates that the training loss and the evaluation metrics capture different aspects of reconstruction quality and should be interpreted carefully.

The apparent discrepancy between higher loss values and improved PSNR/SSIM can be explained by the difference between the optimization objective and the evaluation metrics. The training loss directly reflects the adopted pixel-level reconstruction objective, whereas SSIM emphasizes luminance, contrast, and structural similarity. The ResNet50-based architecture, together with attention-guided feature fusion and SE-based channel recalibration, may preserve structural patterns more effectively even when the final loss value is higher. Therefore, a higher loss does not necessarily imply inferior structural reconstruction quality.

Nevertheless, this divergence also reveals a limitation of the current loss design. The adopted loss function may not be fully aligned with perceptual or structural quality metrics. Future work should consider SSIM-aware, perceptual, or HDR-specific loss functions to better align the training objective with the desired reconstruction quality.

In this experiment, the ResNet50-based architecture shows practical advantages while maintaining identical hyperparameters. Training time reduced by 43% (from 7 h 56 min to 4 h 51 min) due to ResNet50 containing substantially fewer parameters (~25 M) compared to VGG16 (~138 M), representing an 83% reduction in model size.

The ResNet50-based model achieved higher PSNR and SSIM than the VGG16-based HDRCNN setting under the following adopted protocol:

PSNR improved from 13.01 dB to 16.08 dB (3.07 dB gain).
SSIM increased dramatically from 0.2705 to 0.8512.
The narrower gap between training/validation loss (0.0446/0.0437) compared to VGG16 (0.0159/0.0196) indicates better generalization.

The observed improvement in PSNR and SSIM may be associated with the following architectural characteristics of ResNet50:

Residual connections that facilitate gradient flow during backpropagation;
Deeper hierarchical feature extraction that may support multi-scale representation;
Attention-guided skip fusion and SE-based recalibration that may help preserve structural information.

These interpretations provide possible explanations for the observed results. However, because the ResNet50 backbone, attention blocks, and SE blocks are evaluated as an integrated architecture, the individual contribution of each component cannot be isolated in the present experiment.

As shown in Figure 4, the ResNet50-based model shows improved PSNR and SSIM under the adopted setting, particularly in high dynamic range areas, edge preservation, and color fidelity across different illumination levels.

This architectural modification suggests that network design choices can affect HDR reconstruction performance beyond parameter tuning and may provide an efficient solution for improving reconstruction quality while requiring fewer computational resources.

4.4. Comparison with Other Model Architectures

Dataset 3 was also used to train HDRUNet and ExpandNet as reference models. HDRUNet and ExpandNet were trained using their documented model-specific hyperparameter settings rather than a fully unified training protocol. Specifically, HDRUNet used a batch size of four, 200 epochs, and a learning rate of 0.01, while ExpandNet used a batch size of 12, 10,000 epochs, and a learning rate of 0.00007.

This comparison is intended to provide a practical reference for training cost and reconstruction tendencies under commonly used or documented model-specific settings. It should not be interpreted as a fully controlled same-dataset, same-epoch, same-resolution benchmark comparison. A rigorous fair comparison would require retraining all models under a unified protocol, full test split, and additional perceptual/HDR-specific metrics, which is left for future work.

Figure 5 presents the visual reconstruction results corresponding to the quantitative metrics reported in Table 7. The qualitative comparison in Figure 5 should also be interpreted cautiously. HDRUNet and ExpandNet may enhance background regions more strongly in some cases, which can be advantageous when background visibility is prioritized. However, stronger background enhancement may also affect foreground contrast or local structural consistency. The proposed model tends to preserve structural similarity more effectively according to SSIM, but this does not imply that it is visually preferable in all regions or all application scenarios. This observation further supports the need for additional perceptual and HDR-specific evaluation metrics in future work.

The primary objective in this study was to compare the modified HDRCNN architecture with reference models under documented model-specific settings as reported in their respective publications. This approach ensures that each comparative model demonstrates its best-documented performance, rather than potentially suboptimal performance under unified hyperparameters. The training time differs considerably: HDRUNet required over three days, whereas the proposed model converged in under five hours under the adopted comparison protocol. This disparity underscores the architectural efficiency achieved by our modifications. Table 7 shows a clear trade-off among the models compared. HDRUNet achieves the highest PSNR, indicating stronger pixel-wise reconstruction fidelity under the adopted metric. In contrast, the proposed model achieves the shortest training time and slightly higher PSNR than HDRUNet, whereas HDRUNet achieves the highest SSIM. This result suggests that the proposed model provides an efficiency-oriented alternative, but it should not be interpreted as structurally superior to HDRUNet under all evaluation metrics. Therefore, the proposed model should not be interpreted as universally superior to HDRUNet or ExpandNet. Rather, it represents an efficiency-oriented alternative that prioritizes structural similarity and reduced training cost.

The exceptionally high epoch count for ExpandNet follows the original authors’ implementation, where convergence was demonstrated to require significantly more iterations due to the network’s unique architecture and extremely small learning rate (0.00007). This configuration is necessary to achieve the reported performance in the original publication. Our experiment with reduced epoch counts for ExpandNet resulted in substantially degraded reconstruction quality, confirming the necessity of extended training despite its computational cost. The stark difference in training requirements between models (10,000 for ExpandNet versus 200 for HDRCNN) further emphasizes our contribution: developing an HDR reconstruction architecture with reduced computational demands under the adopted comparison protocol, thus expanding practical applicability in resource-constrained environments.

The observed divergence, lower SSIM (0.5856 vs. 0.661 for HDRUNet) but higher PSNR (21.72 dB vs. 21.25 dB), suggests a trade-off in the proposed architecture: structural similarity is prioritized over pixel-wise accuracy. This pattern can be attributed to several factors.

The modified HDRCNN architecture emphasizes structural preservation through multi-scale feature fusion, attention mechanisms, and channel-wise recalibration rather than strict pixel-wise fidelity. This design choice deliberately prioritizes the preservation of visually salient features and structural information over minimizing mean squared error, which heavily influences PSNR calculations.

SSIM correlates more strongly with human visual perception by evaluating structural similarity, contrast, and luminance patterns. In HDR reconstruction, such perceptual fidelity is particularly important, since the primary objective is to produce images that faithfully represent the full dynamic range as perceived by human observers. The lower PSNR value, while traditionally considered a limitation, reveals our model’s tendency to preserve perceptually important high-frequency details and local contrast variations that may deviate slightly from the ground truth at the pixel level. This characteristic is especially relevant in HDR reconstruction, where global tone mapping can significantly alter individual pixel values while maintaining—or even enhancing—perceptual quality.

Furthermore, this metric divergence is consistent with prior discussions on the trade-off between distortion-based and perceptual quality measures in image restoration. However, because this study evaluates only PSNR and SSIM, additional HDR-specific and learned perceptual metrics are still required to support stronger perceptual-quality claims. For instance, Zhang et al. [28] and Blau and Michaeli [29] have established theoretical foundations explaining the fundamental trade-off between distortion measures (like PSNR) and perceptual quality in image restoration tasks. In practical applications of HDR imaging, such as computational photography, multimedia display, and visual content creation, structural similarity captured by SSIM can provide complementary information to pixel-wise reconstruction accuracy measured by PSNR. Our model’s emphasis on structural fidelity facilitates more natural tone mapping and detail preservation in high-contrast regions, which are critical aspects of HDR visualization.

To further illustrate the trade-off among reconstruction quality and training efficiency, Figure 6 visualizes the relationship between training time, PSNR, and SSIM for HDRUNet, ExpandNet, and the proposed model. The proposed model requires the shortest training time and achieves the highest PSNR among the compared models, whereas HDRUNet obtains the highest SSIM but requires substantially longer training time. This result suggests that the proposed model provides a computationally efficient alternative, while HDRUNet remains stronger in terms of structural similarity under the adopted comparison protocol.

4.5. Limitations of the Experimental Evaluation

The experimental results should be interpreted with several limitations in mind. First, the testing setting is limited to five randomly selected images from the SI-HDR dataset; therefore, the results do not represent a full benchmark evaluation. Second, the comparison with HDRUNet and ExpandNet follows model-specific training configurations rather than a fully unified comparison protocol. Third, the evaluation uses PSNR and SSIM only. Although these metrics are useful for measuring pixel-wise fidelity and structural similarity, they do not fully capture perceptual HDR quality. Future work should include HDR-VDP-2, TMQI, LPIPS, and other perceptual or HDR-specific metrics. Fourth, the augmentation experiment compares composite configurations rather than isolating the independent effect of each augmentation operation. Fifth, the proposed ResNet50 backbone, attention blocks, SE blocks, mixed-precision training, and learning-rate scheduling are evaluated as an integrated pipeline; their individual contributions require future component-level ablation studies.

These limitations do not invalidate the observed efficiency and SSIM improvements under the adopted setting, but they restrict the generalizability of the conclusions. Therefore, the present results should be viewed as practical evidence for an efficient HDRCNN-derived framework rather than as definitive benchmark evidence across all HDR reconstruction models.

5. Conclusions and Future Work

In this paper, we presented an engineering-oriented HDR reconstruction framework derived from HDRCNN to improve training efficiency and structural reconstruction quality under practical hardware constraints. The proposed framework combines composite data augmentation configurations, mixed-precision training with cosine annealing learning-rate scheduling, and replacement of the original VGG16 backbone with a ResNet50-based encoder enhanced by attention blocks and squeeze-and-excitation (SE) blocks. The experimental results show that the best composite augmentation configuration improved PSNR and SSIM compared with the unaugmented baseline under the adopted HDRCNN setting. The optimized training pipeline reduced training time, and the ResNet50-based model further reduced the number of parameters and shortened training compared with the VGG16-based HDRCNN. Under the adopted comparison protocol, the proposed model achieved the shortest training time and slightly higher PSNR than HDRUNet, while HDRUNet retained a higher SSIM. This indicates a trade-off among computational efficiency, pixel-wise fidelity, and structural similarity. This indicates a trade-off among pixel-wise fidelity, structural similarity, and computational efficiency. However, the present study should be interpreted as a practical preliminary evaluation rather than a definitive benchmark comparison. Several limitations remain. First, the evaluation was conducted on the SI-HDR dataset with a limited five-image testing setting; future work should use a full test split and additional standard HDR benchmarks with mean and standard deviation reporting. Second, the comparison with HDRUNet and ExpandNet followed model-specific training settings rather than a fully unified protocol. A fairer comparison should use the same data split, resolution, epoch setting, and evaluation procedure. Third, the current evaluation used only PSNR and SSIM, which do not fully capture perceptual HDR quality. Future work should include HDR-VDP-2, TMQI, LPIPS, and other perceptual or HDR-specific metrics. Fourth, the augmentation experiment compared composite configurations rather than isolating the independent effect of each operation. Therefore, operation-level ablation studies are needed to clarify the individual roles of unsharp masking, denoising, Gaussian blur, and brightness–contrast adjustment. Finally, the ResNet50 backbone, attention blocks, SE blocks, mixed-precision training, and learning-rate scheduling were evaluated as an integrated pipeline. Future work should conduct component-level ablation studies and compare additional lightweight backbones such as ResNet18 and MobileNet. The current model is also limited to a fixed 256 × 256 output resolution, so future research should support variable and higher-resolution reconstruction. Overall, the proposed framework provides practical evidence that a HDRCNN-derived architecture can improve training efficiency and structural similarity on consumer-grade hardware, but broader benchmark validation and more complete ablation studies are still required.

Author Contributions

All authors contributed to the study conception and design. Material preparation, data collection, and analysis were performed by T.-W.H., P.-C.C. and T.-H.C. The first draft of the manuscript was written by T.-W.H. and all authors commented on previous versions of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Ministry of Science and Technology of Taiwan under grants MOST 113-2813-C-415-009-E and NSTC 114-2221-E-415-013.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

This study does not involve human participants.

Data Availability Statement

The SI-HDR dataset used in this study is publicly available and can be accessed through the SI-HDR benchmark project [27]. The project page is available at https://www.cl.cam.ac.uk/research/rainbow/projects/sihdr_benchmark/ (accessed on 9 June 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ward, G. Real pixels. In Graphics Gems II; Morgan Kaufmann: San Francisco, CA, USA, 1991; Volume 2, pp. 80–83. [Google Scholar]
Banterle, F.; Artusi, A.; Debattista, K.; Chalmers, A. Advanced High Dynamic Range Imaging; AK Peters/CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar]
Akyüz, A.O.; Fleming, R.; Riecke, B.E.; Reinhard, E.; Bülthoff, H.H. Do HDR displays support LDR content? A psychophysical evaluation. ACM Trans. Graph. 2007, 26, 38. [Google Scholar] [CrossRef]
Masia, B.; Agustin, S.; Fleming, R.W.; Sorkine, O.; Gutierrez, D. Evaluation of reverse tone mapping through varying exposure conditions. In Proceedings of the ACM SIGGRAPH Asia 2009, Yokohama, Japan, 16–19 December 2009; Association for Computing Machinery: New York, NY, USA, 2009; pp. 1–8. [Google Scholar]
Banterle, F.; Ledda, P.; Debattista, K.; Bloj, M.; Artusi, A.; Chalmers, A. A psychophysical evaluation of inverse tone mapping techniques. Comput. Graph. Forum 2009, 28, 13–25. [Google Scholar] [CrossRef]
Meylan, L.; Daly, S.; Süsstrunk, S. The reproduction of specular highlights on high dynamic range displays. In Proceedings of the IS&T/SID 14th Color Imaging Conference (CIC), Scottsdale, AZ, USA, 6–10 November 2006; Society of Imaging Science and Technology: Springfield, VA, USA, 2006; pp. 333–338. [Google Scholar]
Wang, L.; Yoon, K.-J. Deep learning for HDR imaging: State-of-the-art and future trends. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 8874–8895. [Google Scholar] [CrossRef] [PubMed]
Eilertsen, G.; Kronander, J.; Denes, G.; Mantiuk, R.K.; Unger, J. HDR image reconstruction from a single exposure using deep CNNs. ACM Trans. Graph. 2017, 36, 1–15. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Marnerides, D.; Bashford-Rogers, T.; Hatchett, J.; Debattista, K. ExpandNet: A deep convolutional neural network for high dynamic range expansion from low dynamic range content. Comput. Graph. Forum 2018, 37, 37–49. [Google Scholar] [CrossRef]
Khan, Z.; Khanna, M.; Raman, S. FHDR: HDR image reconstruction from a single LDR image using feedback network. In Proceedings of the IEEE Global Conference on Signal and Information Processing (GlobalSIP), Ottawa, ON, Canada, 11–14 November 2019; IEEE: New York, NA, USA, 2019; pp. 1–5. [Google Scholar]
Chen, X.; Liu, Y.; Zhang, Z.; Qiao, Y.; Dong, C. HDRUNet: Single image HDR reconstruction with denoising and dequantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NA, USA, 2021; pp. 354–363. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI 2015), Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Liu, S.; Zhang, X.; Sun, L.; Liang, Z.; Zeng, H.; Zhang, L. Joint HDR denoising and fusion: A real-world mobile HDR image dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: New York, NA, USA, 2023; pp. 13966–13975. [Google Scholar]
Liu, Y.-L.; Lai, W.-S.; Chen, Y.-S.; Kao, Y.-L.; Yang, M.-H.; Chuang, Y.-Y.; Huang, J.-B. Single-image HDR reconstruction by learning to reverse the camera pipeline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NA, USA, 2020; pp. 1651–1660. [Google Scholar]
Niu, Y.; Wu, J.; Liu, W.; Guo, W.; Lau, R.W. HDR-GAN: HDR image reconstruction from multi-exposed LDR images with large motions. IEEE Trans. Image Process. 2021, 30, 3885–3896. [Google Scholar] [CrossRef] [PubMed]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
Zhang, Y.; Yu, P.; Xiao, Y.; Wang, S. Pyramid-structured multi-scale transformer for efficient semi-supervised video object segmentation with adaptive fusion. Pattern Recognit. Lett. 2025, 194, 48–54. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NA, USA, 2016; pp. 770–778. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G.; et al. Mixed precision training. arXiv 2017, arXiv:1710.03740. [Google Scholar] [CrossRef]
Debevec, P.E.; Malik, J. Recovering high dynamic range radiance maps from photographs. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), Los Angeles, CA, USA, 3–8 August 1997; ACM Press/Addison-Wesley Publishing Co.: New York, NA, USA, 1997. [Google Scholar]
Banterle, F.; Ledda, P.; Debattista, K.; Chalmers, A. Inverse tone mapping. In Proceedings of the 4th International Conference on Computer Graphics and Interactive Techniques in Australasia and Southeast Asia (GRAPHITE), Kuala Lumpur, Malaysia, 29 November–2 December 2006; Association for Computing Machinery: New York, NY, USA, 2006; pp. 349–356. [Google Scholar]
GIMP Documentation. Sharpen (Unsharp Mask). Available online: https://docs.gimp.org/en/gimp-filter-unsharp-mask.html (accessed on 9 March 2025).
GIMP Documentation. Noise Reduction. Available online: https://docs.gimp.org/en/gimp-filter-noise-reduction.html (accessed on 9 March 2025).
GIMP Documentation. Gaussian Blur. Available online: https://docs.gimp.org/2.10/en/gimp-filter-gaussian-blur.html (accessed on 9 March 2025).
Hanji, P.; Mantiuk, R.; Eilertsen, G.; Hajisharif, S.; Unger, J. SI-HDR: Dataset for Comparison of Single-Image High Dynamic Range Reconstruction Methods; University of Cambridge: Cambridge, UK, 2022. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NA, USA, 2018; pp. 586–595. [Google Scholar]
Blau, Y.; Michaeli, T. The perception-distortion tradeoff. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NA, USA, 2018; pp. 6228–6237. [Google Scholar]

Figure 1. The architecture diagram of the proposed scheme.

Figure 2. Training and validation loss curves of HDRCNN before and after code optimization using a logarithmic y-axis.

Figure 3. Training and validation loss curves of the VGG16-based HDRCNN and the proposed ResNet50-based model using a shared logarithmic y-axis.

Figure 4. (a) Performance comparison of HDR reconstruction: original VGG16-based architecture (top) versus proposed ResNet50-based model (bottom), illustrating the differences in PSNR and SSIM between the VGG16-based and ResNet50-based settings. For the tone-mapping operator, we used the same settings as HDRCNN. Regarding the exposure setting, we set all results to EV + 1.0. (b) Zoomed-in reconstruction images comparison of different scenes using the proposed ResNet50-based model. Red boxes indicate regions of interest, and the corresponding enlarged views are presented on the right for detailed comparison.

Figure 5. Visual comparison of HDR reconstruction results across modified HDRCNN, HDRUNet, and ExpandNet models corresponding to the quantitative metrics presented in Table 7.

Figure 6. Performance trade-off among HDRUNet, ExpandNet, and the proposed model under the adopted comparison protocol. The x-axis represents training time, the left y-axis represents PSNR, and the right y-axis represents SSIM. The proposed model achieves the shortest training time and the highest PSNR, whereas HDRUNet obtains the highest SSIM.

Table 1. Structural comparison between VGG16 and ResNet50 architectures.

Model	VGG16	ResNet50
Number of Parameters	About 138 Million	About 25 Million
Network Layers	16 Layers	50 Layers

Table 2. Detailed network layer configuration after modifying the HDRCNN model.

	Layer/Block	Input Channels	Output Channels	Operations
Encoder	FirstConv (ResNet50)	3	64	7 × 7 Conv, stride = 2, BN, ReLU
	MaxPool	64	64	3 × 3 MaxPool, stride = 2
	Encoder1 (Layer1)	64	256	ResNet50 Layer1 (3 blocks)
	Encoder2 (Layer2)	256	512	ResNet50 Layer2 (4 blocks)
	Encoder3 (Layer3)	512	1024	ResNet50 Layer3 (6 blocks)
	Encoder4 (Layer4)	1024	2048	ResNet50 Layer4 (3 blocks)
Decoder	Decoder4	2048	1024	2 × (3 × 3 Conv, BN, ReLU), SEBlock
	Upsample + Attention4	1024 (g), 1024 (×)	1024	Bilinear (×2), Attention (F_int = 512)
	Concat + Upsample	1024 + 1024	2048	Concatenate with ×3
	Decoder3	2048	512	2× (3 × 3 Conv, BN, ReLU), SEBlock
	Upsample + Attention3	512 (g), 512 (×)	512	Bilinear (×2), Attention (F_int = 256)
	Concat + Upsample	512 + 512	1024	Concatenate with ×2
	Decoder2	1024	256	2 × (3 × 3 Conv, BN, ReLU), SEBlock
	Upsample + Attention2	256 (g), 256 (×)	256	Bilinear (×2), Attention (F_int = 128)
	Concat + Upsample	256 + 256	512	Concatenate with ×1
	Decoder1	512	64	2 × (3 × 3 Conv, BN, ReLU), SEBlock
	Upsample + Attention1	64 (g), 64 (×)	64	Bilinear (×2), Attention (F_int = 32)
	Concat + Upsample	64 + 64	128	Concatenate with ×0
	Decoder0	128	64	2 × (3 × 3 Conv, BN, ReLU), SEBlock
	FinalConv	64	3	1 × 1 Conv

Table 3. Results of data preprocessing ratios for training HDRCNN.

Dataset	Processing Ratio	Training Time	Training Loss	Validation Loss	PSNR dB	SSIM
1	100% Original	7 h 41 min	0.0774	0.2118	19.05	0.6444
2	20% Original 20% Unsharp Masking 30% Denoising 30% Blur	7 h 34 min	0.0277	0.0396	18.4	0.5905
3	15% Original 15% Unsharp Masking 20% Denoising 20% Blur 30% Brightness and Contrast	7 h 41 min	0.0159	0.0196	22.10	0.7714

Table 4. Summary of relative performance changes in augmentation configurations with respect to dataset 1 (Baseline).

Metric	Dataset 1 (Baseline)	Dataset 2	Dataset 3	Δ (Dataset 3 vs. Dataset 1)
Training Loss	0.0774	0.0277	0.0159	−79.5%
Validation Loss	0.2118	0.0396	0.0196	−90.7%
PSNR (dB)	19.05	18.40	22.10	+3.05 dB
SSIM	0.6444	0.5905	0.7714	+0.127
Training Time	7 h 41 min	7 h 34 min	7 h 41 min	0 min (0.0%)

Table 5. Comparison of HDRCNN performance before and after code optimization.

	Before Optimization	With Optimization
Training Time	11 h 14 min	7 h 56 min
Final Training Loss	0.0162	0.0159
Final Validation Loss	0.0408	0.0196

Table 6. Comparison of HDRCNN performance before and after model modification.

	VGG16	ResNet50
Training Time	7 h 56 min	4 h 51 min
Final Training Loss	0.0159	0.0446
Final Validation Loss	0.0196	0.0437

Table 7. Performance comparison between modified HDRCNN, HDRUNet, and ExpandNet.

Model	Training Time	PSNR	SSIM
HDRUNet	3 days 14 h 32 min	21.25 dB	0.661
ExpandNet	19 h 21 min	12.26 dB	0.3022
Ours	4 h 51 min	21.72 dB	0.5856

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

He, T.-W.; Chen, P.-C.; Chen, T.-H. Efficient HDR Image Reconstruction: A ResNet Approach with Enhanced Data Augmentation. Electronics 2026, 15, 2595. https://doi.org/10.3390/electronics15122595

AMA Style

He T-W, Chen P-C, Chen T-H. Efficient HDR Image Reconstruction: A ResNet Approach with Enhanced Data Augmentation. Electronics. 2026; 15(12):2595. https://doi.org/10.3390/electronics15122595

Chicago/Turabian Style

He, Ting-Wei, Pei-Chi Chen, and Tzung-Her Chen. 2026. "Efficient HDR Image Reconstruction: A ResNet Approach with Enhanced Data Augmentation" Electronics 15, no. 12: 2595. https://doi.org/10.3390/electronics15122595

APA Style

He, T.-W., Chen, P.-C., & Chen, T.-H. (2026). Efficient HDR Image Reconstruction: A ResNet Approach with Enhanced Data Augmentation. Electronics, 15(12), 2595. https://doi.org/10.3390/electronics15122595

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient HDR Image Reconstruction: A ResNet Approach with Enhanced Data Augmentation

Abstract

1. Introduction

2. Background and Preliminaries

2.1. HDR Image and Inverse Tone Mapping

2.2. HDRCNN Architecture and Image Formation

3. The Proposed Scheme

3.1. Dataset Preprocessing and Data Augmentation

3.1.1. Unsharp Masking

3.1.2. Denoising

3.1.3. Gaussian Blur

3.1.4. Brightness–Contrast Adjustment

3.2. Training Code Optimization

3.3. Model Architecture Replacement

4. Experimental Results and Discussion

4.1. Experimental Results of Dataset Preprocessing and Data Augmentation

4.2. Experimental Evaluation of Training Optimization

4.3. Changing Model Architecture

4.4. Comparison with Other Model Architectures

4.5. Limitations of the Experimental Evaluation

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI