Multi-Scale Attention-Guided Non-Local Network for HDR Image Reconstruction

High-dynamic-range (HDR) image reconstruction methods are designed to fuse multiple Low-dynamic-range (LDR) images captured with different exposure values into a single HDR image. Recent CNN-based methods mostly perform local attention- or alignment-based fusion of multiple LDR images to create HDR contents. Depending on a single attention mechanism or alignment causes failure in compensating ghosting artifacts, which can arise in the synthesized HDR images due to the motion of objects or camera movement across different LDR image inputs. In this study, we propose a multi-scale attention-guided non-local network called MSANLnet for efficient HDR image reconstruction. To mitigate the ghosting artifacts, the proposed MSANLnet performs implicit alignment of LDR image features with multi-scale spatial attention modules and then reconstructs pixel intensity values using long-range dependencies through non-local means-based fusion. These modules adaptively select useful information that is not damaged by an object’s movement or unfavorable lighting conditions for image pixel fusion. Quantitative evaluations against several current state-of-the-art methods show that the proposed approach achieves higher performance than the existing methods. Moreover, comparative visual results show the effectiveness of the proposed method in restoring saturated information from original input images and mitigating ghosting artifacts caused by large movement of objects. Ablation studies show the effectiveness of the proposed method, architectural choices, and modules for efficient HDR reconstruction.


Introduction
High-dynamic-range (HDR) imaging techniques extend the range of brightness in an image such that a photograph taken by a camera becomes as similar as possible to the scene as it would be observed by the naked eye. HDR images have been used in a wide range of applications such as photography, film, and games [1,2] as they contain more information from a given scene and can provide a better visual experience. However, because camera image sensors generally have a narrow dynamic range, capturing scenes with more widely varying ranges of brightness using a single exposure value is considered relatively difficult [3,4]. Recent advances in imaging technology have enabled the acquisition of HDR images with only a single-sensor camera, but these special cameras are often very expensive for consumers in general. Therefore, the primary method used to obtain HDR imaging is to continuously photograph and overlap multiple images with a low dynamic range (LDR) with different exposure values and then select or fuse the best photo pixels or segments to reconstruct a single HDR image [5][6][7].
In this study, we propose a deep neural network, called MSANLnet, that uses the multi-scale attention mechanism and non-local means technique to effectively alleviate the ghosting artifact problem in HDR image reconstruction. The proposed MSANLnet has two distinctive parts-(a) multi-scale attention for object motion and saturation mitigation, and (b) non-local means-based fusion. Existing spatial-wise attention only captures important regions on a single scale, which is not effective for distinguishing both object movements and saturated regions in the case of producing HDR contents. In the proposed method, LDR features are aligned based on multi-scale spatial attention modules. During this process, spatial attention is performed for each scale of features to expand the size of the receptive field and progressively correct and fuse important information, such as large movements and saturation, which is found to be effective in producing high-quality HDR content. Again, the non-local means module looks at the whole image and only fuses the correlated global contents with the reference image. Hence, the multi-scale attention mechanism can effectively capture the local movements of the objects and the non-local means-based fusion can reduce ghosting artifacts on a global level by looking at the whole image, which effectively mitigates ghosting artifacts for large object motions and global shifts such as camera translation.
This two-stage design effectively selects different exposure values for each input frame and useful information that is not altered by motion and then fuses images to minimize locality to reduce noise and mitigate the ghosting artifact problem. We validated that the proposed method performs better than the existing methods quantitatively and qualitatively through the experiments on a publicly available dataset that has been mainly used for HDR image restoration. We summarize our contributions as follows.

•
We propose MSANLnet, a multi-scale attention-guided non-local network for HDR image reconstruction, that extracts important features from the LDR features using the multi-scale spatial attention and adaptively fuses the contextual features to obtain HDR images. • We show that the multi-scale spatial attention, combined with the non-local meansbased fusion, can effectively alleviate the "ghosting artifact" and produce aesthetic HDR images. • Our proposed method outperforms the existing methods in both qualitative and quantitative measures, validating the efficacy of the attention modules, non-local means-based fusion, and architectural choices.
The remainder of this study is organized as follows. Related work is discussed in Section 2. The proposed MSANLnet is presented along with detailed descriptions of each module in Section 3. The experimental setup and results are presented in Section 4. Finally, Section 5 presents our conclusions and suggests some possible avenues for further research.

Related Work
First proposed by Madden [40] and Mann [41] and popularized by Debevec and Malik [42] for digital imaging, multi-exposure-fusion-based HDR reconstruction methods can be classified into two major classes-(i) traditional approaches, and (ii) deep learningbased approaches. For brevity, we restrict our related work only to multi-exposure-fusionbased approaches.
Rejection-based methods try to find the pixels/regions with motions and select only pixels/regions using a reference image or the static contents of the LDR images [5,8,12], i.e., these methods reject the moving contents. Rejection mechanisms are typically implemented by performing patch-wise comparison [9], illumination consistency and linear relationships among pixels of the LDR images [16,43,44], thresholding [5,11,12], background probability map [45], super-pixel comparison [15] etc. Such approaches can produce fast results, however, are limited to producing LDR contents in the dynamic regions.
These methods can produce plausible results for static scenes and small motions, however, perform poorly in recovering saturated pixels.
Patch-based optimization methods typically synthesize LDR patches based on the structure reference patches by finding dense correspondence among the LDR images and fuse the synthesized LDR images to produce HDR contents [24,25,47]. These methods can produce high-quality HDR content but suffer from computational complexity. Though most of the traditional methods produce plausible HDR contents for static scenes, they suffer from poor synthesis/misalignment of LDR contents in dynamic scenes with large motions and recovering saturated pixels/regions from the input LDR images.
Deep learning-based methods are capable of synthesizing novel HDR contents based on the input LDR images, making them suitable for complex dynamic scenes. Early methods employed CNN-based methods, e.g., optical flow estimation [28,29], multi-scale feature extraction [38], etc., to compensate ghosting artifacts. Subsequently, attentionbased alignment methods are used to remove ghosting artifacts and recover saturated regions [26,27,39,48].
Recently, generative adversarial networks (GANs) have been employed to supervise the HDR reconstruction process, which can effectively correct large motions with higher efficiency [82]. Despite the success of deep learning-based approaches, removing ghosting artifacts in HDR images for large motions is still an active area of research.

Overview of MSANLnet
The MSANLnet takes three input LDR tensors as inputs. Each tensor consists of an LDR image and the tone-mapped HDR image of the corresponding LDR image. First, the three LDR images in the input LDR set (i.e., L = {L 1 , L 2 , L 3 }, where L i={1,2,3} ∈ R H×W×3 ) are arranged according to their exposure lengths t i , i.e., short exposure (L 1 ), intermediate exposure (L 2 ), and long exposure length (L 3 ). Then, a mapping process is performed on the input LDR image set L to convert them into the input HDR image set, H = {H 1 , H 2 , H 3 }, H i={1,2,3} ∈ R H×W×3 . Here, the gamma correction (γ = 2.2, following [26]) is used for the mapping process as follows: Then, a pair of input tensors X i for each L i is obtained by concatenating the gammacorrected H i and the corresponding L i in the channel dimension, i.e., H×W×6 . Among these tensors, X 2 with the intermediate exposure length is selected as a reference tensor. The set of input tensors X is then fed to MSANLnet to produce an HDR image.
We define the process of generating the final HDR image H f by fusing the three tensors, X = {X 1 , X 2 , X 3 }, as follows, where f represents the proposed MSANLnet and θ is a parameter of the corresponding network. The proposed MSANLnet model uses a CNN-based encoder-decoder structure and is largely composed of multi-scale spatial attention modules and non-local means-based fusion modules. Figure 2 shows the overview of the proposed network. As shown in Figure 2, the encoder extracts spatial attention features based on the scale of LDR tensor set X i through the multi-scale spatial attention module (please refer to Section 3.2 and Supplementary Materials). The multi-scale spatial attention features contribute to compensating motions and recovering feature values from the saturated regions adaptively. The decoder uses a transposed convolution block to restore the reduced resolution to the original resolution while concatenating spatial attention features and reference image features. Finally, the final HDR image is output through a non-local means-based fusion mechanism (please refer to Section 3.3 and Supplementary Materials) which considers the global relationships among feature values, thus effectively removing ghosting artifacts for larger motions during the HDR image reconstruction.

Multi-Scale Spatial Attention Module
The proposed approach is designed to effectively identify areas to be used in images with long exposure values and images with short exposure values. Specifically, a multiscale attention mechanism is devised in the encoder to implicitly align the two images with the reference image, as shown in Figure 2.
First, three 6-channel input tensors X 1 , X 2 , X 3 are encoded in spatial multi-scales, respectively, via a convolutional layer to extract features. During this process, max pooling is applied to reduce the spatial size of the feature map extracted at each step by half from the previous step. Thus, the receptive field is increased to capture a larger foreground movement in the multi-scale attention module. Note that when we decrease the spatial size of a feature map by half, we increase its number of channels by double. For example, in our experiment, we used 64 channels for the first scale and 128 channels for the second scale.
As shown in Figure 2, a spatial attention block is executed for each scale to extract the spatial attention maps of the non-reference images. A spatial attention block is shown in Figure 3. In the spatial attention block, the input is the concatenation of the non-reference image features, X i ; i = 2, and the reference image X 2 features. Then, the spatial attention map is extracted at each scale s as follows, where s denotes the number of spatial scales and g() is the spatial attention map extractor. The attention map extractor consists of two consecutive convolutions and sigmoid function, as used in a previous study [26]. Note that, in our experiment, we used N=2 (i.e., two scales as depicted in Figure 2). An element of the spatial attention map A i is in a range of [0 − 1] and the spatial size of the generated A (s) Finally, an element-wise multiplication between the input feature X i (s) and the ex- As seen in Figure 4, two spatial attention maps focus on important regions for implicit alignment of the input LDR images to the reference image. As the network is provided with explicit multi-scale contextual information, it can compensate for large foreground motions and can effectively mitigate the ghosting artifacts. Moreover, as the spatial attention maps represent the differences among the LDR images with varying exposures, the network implicitly learns to recover saturated pixel values.

Non-Local Means-Based Fusion
In some local areas of the image feature, sufficient information is not available due to occlusion or saturation caused by the movement of objects [26]. Therefore, in image fusion networks, useful information must be extracted into a large receptive field. However, a single CNN convolution filter can be used in only a limited area of the receptive field. Previous studies have shown the importance of short and long-range dependencies among the pixel/feature values for various computer vision tasks [54,[83][84][85][86]. Non-local means is a method of calculating long-range dependencies and restoring pixels through weighted averages based on these dependencies, intuitively beneficial for removing ghosting artifacts. To this end, we applied a non-local block to image fusion so that global information is utilized and locality is reduced to effectively alleviate ghosting artifacts.
First, we utilize the residual dense convolutional block [87] before applying the nonlocal block, such that the feature Z recovered to the original size through the decoding process learns sufficient local information. By concatenating all layers, useful local information is adaptively extracted through the features of the previous layer and the features of the current layer. The extracted X enters a non-local block input and outputs Y of the same size, H × W × C, as a result. Figure 5 illustrates the structure of a residual dense convolutional block composed of three convolutional blocks. For the non-local based fusion, we adopt the asymmetric pyramid non-local block (APNB) [88] to construct the non-local blocks. In APNB, spatial pyramid pooling is applied to a standard non-local block to reduce computation by sampling part of the feature map based on meaningful global information. Figure 6 shows the structure of an asymmetric pyramidal non-local block, including spatial pyramid pooling. First, the input feature map X is converted to different embeddings, Φ, θ, and γ, respectively, through three 1 × 1 convolutions, W Φ , W θ and W γ .
The spatial pyramid pooling module is composed of four pooling layers that derive the results of different sizes in parallel. Such a pyramid pooling mechanism improves the expressive power of global features as previous studies have shown the effectiveness of the global and multi-scale representations for capturing scene semantics. The sampling process through pyramid pooling P θ and P γ are represented by the following equation.
The number of spatially sampled anchor points S can be expressed as, where S denotes the number of sampled anchor points and n denotes the width of the output features processed through the pooling layer. Thereafter, based on the standard non-local block method [89], the pseudo-matrix V P of θ P and Φ is calculated as follows, Instance normalization is then applied to V P to generate a normalized pseudo-matrix V P , and the Softmax function of the self-attention mechanism is used to derive the attention result as follows, The final output of the asymmetric pyramidal non-local block is obtained as follows, where W o is composed of 1 × 1 convolution and can be learned by weighting the parts where parameters are important in the non-local operation. Subsequently, the network generates a final 3-channel HDR image through a convolutional layer.

Dataset
We used an open dataset [28] for HDR performance comparison to train and evaluate the proposed method. Specifically, out of a total of 89 image sets, 74 image sets were used as a training set, 5 were used as a validation set, and the remaining 10 were used as a testing set. Each image set comprised three LDR images and one HDR ground-truth image. Here, three LDR images were captured using exposure bias {−2, 0, +2} or {−3, 0, +3}. As in an existing study on HDR [26], in the training stage, a total of 1775 image patches were used to perform training by cropping LDR images with a size of 1000 × 1500 into 256 × 256 patch sizes and using them with a stride of 128 through data augmentation to mitigate over-fitting.

Objective Function
As proposed in a previous work [28], HDR images were mainly presented after being tone-mapped, and thus, it was more effective to optimize the network in the tone-mapping domain than in the HDR domain. A tone-mapping process using a µ-law was performed on the HDR image H and ground-truth image GT derived in this study as follows.
where µ is a parameter indicating the degree of compression, and T(X) denotes a tonemapped image. We set µ = 5000 [28], and the range of the resulting T(X) is [0, 1]. Subsequently, to minimize the error value between T(H) and T(GT) by applying per-pixel L 1 -Loss, the network was trained after defining the loss function L as follows,

Implementation Details
In the training stage of the proposed method, the number of epochs and batch size were set to 200 and 4, respectively. The learning rate was initially set to 0.0001, and then the learning rate was reduced by multiplying by 0.1 at 100 epochs. We used the Adam optimizer [90], and both learning and evaluation were implemented using the Pytorch framework [91]. Our experiments are performed on a single NVIDIA 2080Ti GPU. The number of model parameters is 2.8 million (2,772,707) and the average inference time per image is 0.0181 s.

Evaluation Design
To evaluate the efficiency of the proposed method, we compared it with five state-ofthe-art methods, both in qualitative and quantitative manner. Specifically, we compared our method with AHDR [26], Wu [28], AD [27], NHDRR [30], and DAHDR [39]. AHDR [26] introduced spatial attention modules in HDR imaging to guide the merging according to the reference image. AD [27] used spatial attention module for feature-level attention and a multi-scale alignment module (i.e., PCD alignment [92]) to align the images in the feature-level. DAHDR [39] exploits both spatial attention and feature channel attention to achieve ghost-free merging. NHDRR [30] is based on U-Net-based architecture with non-local means to produce HDR images. All of the selected models are deep learningbased methods for generating an HDR image using multiple LDR images. In addition, all of these models used LDR images with large motions as training datasets, which were considered appropriate to objectively compare performance with our method. All models were trained in the same way and in the same environment as the proposed approach for a fair comparison.

Quantitative Evaluation
To quantitatively evaluate the performance of the model, we measured the peak signal-to-noise ratio (PSNR) and structural similarity index map (SSIM) on the testing set. PSNR is used to evaluate the impact of lost information on the quality of a generated or compressed image and represents the power of received noise with respect to the maximum power of the received signal. SSIM is a method designed to evaluate human visual quality differences rather than numerical loss and evaluates quality by comparing luminance, contrast, and structure values that make up images, rather than only performing comparisons between pixels. We used the PSNR-l and PSNR-µ indicators, which measure PSNR in linear and tone-mapped domains, respectively. For the SSIM, we used values measured in tone-mapped domains. Table 1 shows the average value of the quantitative indicators, and the larger values correspond to the higher quality. The proposed model achieved higher values than other models in all three quantitative performance indicators. The proposed method achieved better performance in terms of PSNR-l, PSNR-µ, and SSIM measurements compared to AHDR [26], Wu [28], AD [27], NHDRR [30], and DAHDR [39]. This occurred because the proposed method can restore details and effectively eliminate ghosting artifacts through non-local means-based fusion modules. NHDRR [30] also merges images using nonlocal means modules, but MSANLnet, which extracts features using scale-specific spatial attention, exhibited numerically better performance.  Figures 7 and 8 visually compare the results of HDR image generation via the proposed method and existing state-of-the-art models [26][27][28]30,39]. In particular, Figure 7 qualitatively compares the experimental results of images with saturated backgrounds and large foreground movements. Input images referenced by the network by exposure value are shown in Figure 7a, and the tone-mapped HDR result images produced by the proposed method are shown in Figure 7b. The corresponding testing set image had a saturated background area and large foreground movements. Areas with large movements and saturation levels were cut into patches as shown in Figure 7b,c, respectively. The details of the background were obscured by the movement of an object in an image with a small exposure length, and the information was lost even in an image with long or medium exposure lengths. Because the image was saturated, learning models were likely to use artifacts and distorted information from the first image with the background occluded for HDR imaging, as seen in Figure 7c.  [28], AHDR [26], AD [27], NHDRR [30], DAHDR [39], Proposed MSANLnet, and Ground Truth.

Qualitative Evaluation
As seen in Figure 7, Wu [28] produced an excessively smooth image and failed to restore the details of the saturated area and removed ghosting artifacts. Although AHDR [26] selectively incorporated useful areas through spatial attention modules to restore obscured or saturated details, they still exhibited ghosting artifacts. NHDRR [30] removed ghosting artifacts to some extent, but output some blurring artifacts and missed the details of many areas. AD [27] uses PCD alignment modules [92], which reduce ghosting artifacts to some extent. However, the over-exposed area was not fully recovered. In DAHDR [39], the afterimage of the finger remains intact, which is not effective in removing ghosting artifacts. Furthermore, it can be seen that the hidden area cannot be restored as well. The results of the proposed MSANLnet expressed more details compared to the existing methods, while also eliminating ghosting artifacts, as may be observed from the figures. The resulting images demonstrate that the proposed MSANLnet expresses more details of saturated regions through the scale-specific spatial attention modules, and effectively reduces ghosting artifacts through non-local means-based fusion. Figure 8 also shows a comparison of images with motion and supersaturation regions in the testing set. The proposed MSANLnet restored details more clearly compared to other models, and also removed visual artifacts to produce higher-quality HDR images. Figure 8. Visual comparison of HDR images restored from the dataset provided in [28]. From left, Wu [28], AHDR [26], NHDRR [30], AD [27], DAHDR [39], Proposed MSANLnet, and Ground Truth.

Ablation Study 4.4.1. Ablation on the Network Structure
We performed an ablation study of the proposed network structure and analyzed the results. We compared the proposed MSANLnet with the following variant models to identify the importance of the individual components. Table 2 shows the quantitative comparison among the baselines and the proposed MSANLnet. • ANLnet : The multi-scale attention module was removed from this version; i.e., this was a variant model using a single-scale attention module of the baseline model AHDR [26] and a non-local means-based fusion module. • MSANet: The non-local fusion module was removed from this version. That is, this variant adopted a multi-scale attention module and dilated residual dense block (DRDB)-based fusion as used in baseline models. • MSANLnet: The proposed MSANLnet model, which includes a multi-scale attention module and a non-local means-based fusion module. As shown in Table 2, when a multi-scale attention module was used, the performance in terms of PSNR was improved by 1.41 dB compared to the use of a single-scale module. As seen in Figure 9, an afterimage of arm movement remains in MSAnet, and the difference cannot be repaired naturally, creating boundaries. However, it can be seen that the proposed model with multi-scale attention effectively eliminates such artifacts. These results confirm that expanding the receptive field helps reduce ghosting artifacts and restore details.

Comparison of Degradation of HDR Restoration for Global Motion
We have performed additional experiments with the translation applied to the input LDR images and visually compared the HDR restoration performance between the proposed and existing methods. Specifically, we globally shifted the first input LDR image to the right by 50 pixels, kept the reference LDR image (i.e., the second image) unchanged, and shifted the third input LDR image to the left by 50 pixels, to simulate the global translation operation. The visual results are summarized in Figure 10. Figure 10. Visual comparison of degradation of HDR restoration for global motion (i.e., when applied translation). From left, Wu [28], AHDR [26], NHDRR [30], AD [27], DAHDR [39], and Proposed MSANLnet.
As seen in Figure 10, most of the existing methods fail to mitigate ghosting artifacts in the case of such global motion caused by the camera's translation. However, our proposed method can effectively produce HDR image contents without severe quality degradation, even though we do not employ any explicit alignment mechanisms. This is because the multi-scale attention mechanism can effectively capture the local movements of the objects and the non-local means-based fusion reduces ghosting artifacts on a global level by looking at the whole image, which essentially mitigates ghosting artifacts for large object motions and global shifts such as translation.

Generalization Ability on Different Dataset
We performed additional experiments on the generalization ability of the proposed method on a different test dataset. Specifically, we used test dataset provided in [93] and summarized the visual results in Figure 11. As seen in the figure, our method can generalize to unseen test datasets and performs comparatively better than the existing methods in restoring HDR contents, even though our method was not trained on the dataset. Figure 11. Visual comparison of additional HDR images from the dataset provided in [93]. From left, Wu [28], AHDR [26], NHDRR [30], AD [27], DAHDR [39], and Proposed MSANLnet.

Conclusions
In this study, we have proposed MSANLnet as a non-local network based on multiscale attention for effective HDR image restoration. In the encoder part, implicit alignment of features is performed at various resolutions with multi-scale spatial attention modules. In the decoder part, image restoration is performed by adaptively incorporating useful information utilizing a long-range dependency with a non-local means-based fusion module. The results show that the proposed method exhibited better performance than existing deep learning methods. The results of this study have confirmed the importance and impact of the expansion of the receptive field in HDR image restoration based on CNN models. In the future, we plan to further improve the quality of HDR images by applying and developing vision transformer-based [94] modules with long-range dependencies.

Conflicts of Interest:
The authors declare no conflict of interest.