Pyramid Inter-Attention for High Dynamic Range Imaging

This paper proposes a novel approach to high-dynamic-range (HDR) imaging of dynamic scenes to eliminate ghosting artifacts in HDR images when in the presence of severe misalignment (large object or camera motion) in input low-dynamic-range (LDR) images. Recent non-flow-based methods suffer from ghosting artifacts in the presence of large object motion. Flow-based methods face the same issue since their optical flow algorithms yield huge alignment errors. To eliminate ghosting artifacts, we propose a simple yet effective alignment network for solving the misalignment. The proposed pyramid inter-attention module (PIAM) performs alignment of LDR features by leveraging inter-attention maps. Additionally, to boost the representation of aligned features in the merging process, we propose a dual excitation block (DEB) that recalibrates each feature both spatially and channel-wise. Exhaustive experimental results demonstrate the effectiveness of the proposed PIAM and DEB, achieving state-of-the-art performance in terms of producing ghost-free HDR images.


Introduction
Humans can see in a wide range of lighting conditions because the human eye adjusts constantly to a broad range of natural luminance values in the environment. However, standard digital cameras typically fail to capture images with sufficient dynamic range because of the limited ranges of sensors. To alleviate this issue, high-dynamic-range (HDR) imaging has been developed to improve the range of color and contrast in captured images [1]. Given a series of low-dynamic-range (LDR) images captured at different exposures, an HDR image is produced by merging these LDR images.
Based on the recent development of convolutional neural networks (CNNs), the performance of HDR imaging using CNNs [17][18][19][20][21][22] has been significantly improved. Eilertsen et al. [22] proposed an autoencoder network to produce HDR images from only a single image. Endo et al. [17] proposed to synthesize LDR images captured with different exposures (i.e., bracketed images) and then reconstruct an HDR image by merging the synthesized images. However, the reliance on a single input LDR image cannot handle the highly contrastive scenes since it is an ill-posed problem. Kalantari et al. [19] attempted to handle the misalignment problem of dynamic scenes by implementing the classical optical flow algorithm [23] as an alignment process. However, the classical optical flow algorithm shows large alignment errors, which products artifacts in misalignment region. In addition, the classical optical flow algorithm requires significant computational time. Although Wu et al. [20] formulated HDR imaging as an image translation problem without alignment, they failed to reconstruct the details of an HDR image in occluded regions. Yan et al. [21] proposed an attention-guided deep network for suppressing misaligned features during the merging process to avoid ghosting artifacts. However, their method [21] still suffers from ghosting artifacts, because they excluded alignment between LDR images in the presence of camera motion or foreground motion.
In this paper, we propose a novel end-to-end flow-based HDR method, including pyramid inter-attention module (PIAM) and dual excitation block (DEB) for the alignment and merging processes, respectively. Our method is the first to jointly estimate the correspondence between LDR images and reconstruct HDR images. Specifically, during the alignment process, we can align the non-reference feature to a reference feature by leveraging the PIAM, as shown in Figure 1. Furthermore, we use the DEB to recalibrate the LDR features spatially and channel-wise for boosting the representation of features for generating ghost-free HDR images in the merging process. The main contributions of this paper can be summarized as follows: • We propose a novel CNN-based framework for ghost-free HDR imaging by leveraging pyramid inter-attention module (PIAM) which effectively aligns LDR images. • We propose a dual excitation block (DEB), which recalibrates features both spatially and channel-wise by highlighting the informative features and excluding harmful components.

•
Extensive experiments on HDR datasets [11,19,24] demonstrate that the synergy between the two aforementioned modules enables our framework to achieve state-of-the-art performance.  Figure 1. Given low-dynamic-range (LDR) images of a dynamic scene as inputs, the proposed method first generates the features using shared feature extraction network. Before merging them, the alignment network aligns non-reference features to a reference feature (i.e., EV0) using the pyramid inter-attention module (PIAM). In the merging process, we recalibrate these features to concentrate on more useful elements for producing a ghost-free high-dynamic-range (HDR) image, using both spatial and channel excitations. Finally, the proposed method outputs a tonemapped HDR image.

HDR Imaging without Alignment.
We first review HDR imaging algorithms using the assumption that input LDR images are globally registered. Early work presented by Mann and Picard [2] attempted to combine differently exposed images to obtain a single HDR image. Debevec and Malik [3] recovered camera response function using differently exposed photographs with a static camera. Unger et al. [25] designed an HDR imaging system using a highly programmable camera unit and multi-exposure images. Khan et al. [26] computed the probabilities of pixels for part of an image background by iteratively weighting the contribution of each pixel. Jacobs et al. [5] removed ghosting artifacts by addressing brightness changes. Pece and Kautz [7] proposed a motion map to compute median threshold bitmaps for each image. Heo et al. [8] assigned weights to emphasize well-exposed pixels using Gaussian-weighted distance. Zhang and Cham [4] detected movement using quality measures based on image gradients to generate a weighting map. Lee et al. [27] and Oh et al. [28] explored rank minimization in HDR deghosting to detect motion and reconstruct HDR images. However, these solutions are impractical because they are not able to handle moving objects or camera motion.

HDR Imaging with Alignment.
To solve the misalignment of dynamic scenes for HDR imaging, some approaches align LDR images prior to reconstructing an HDR image by applying dense correspondence algorithms (i.e., optical flow). Bogoni [10] aligned LDR images via warping using local motion vectors, which are estimated based on optical flow algorithm. Kang et al. [9] exploited the optical flow algorithm after performing exposure correction between LDR images. Jinno and Okuda [29] estimated dense correspondences based on a Markov random field model. Gallo et al. [14] proposed a fast non-rigid registration method for input images where small motion exists between them. However, these approaches cannot handle ghosting artifacts in the presence of large foreground motion, because they use a simple merging process for combining aligned LDR images.
There have been many attempts to integrate alignment and HDR reconstruction into a joint optimization process. Sen et al. [11] proposed a patch-based energy-minimization method that integrates alignment and reconstruction into a joint optimization process. Hu et al. [15] decomposed the optimization problem by using image alignment based on brightness and gradient consistency. Hafner et al. [12] proposed an energy-minimization approach that simultaneously calculates HDR irradiance and displacement fields. Despite these improvement of HDR imaging, such methods still have limitations when large motions and saturation exist in LDR images.

Deep-Learning-Based Methods.
Recently, several deep CNN-based methods for HDR imaging [17,[19][20][21][22] have been proposed. First, Eilertsen et al. [22] proposed a method for reconstructing HDR images from single LDR images using an autoencoder network. The method proposed by Endo et al. [17] predicts multiple LDR images with different exposures from a single LDR image, then reconstructs a final HDR image by merging the predicted images using a deep learning network. These methods have a limitation in that they use only a single LDR image, which makes it difficult to synthesize the details of an HDR image.
Kalantari et al. [19] attempted to solve the misalignment of LDR images by using an off-the-shelf optical flow algorithm [23]. They then merged the aligned LDR images to obtain an HDR image using CNNs. However, the optical flow algorithm [23] has a large computational time. Wu et al. [20] proposed a non-flow-based translation network that can elucidate plausible details from LDR inputs and generate ghost-free HDR images. Yan et al. [21] proposed an attention network to suppress the undesirable features due to the misalignment or saturation to avoid the ghosting artifacts. Although the methods discussed above represent remarkable advances in HDR imaging, they [20,21] cannot fully exploit the information from all LDR images. In contrast to these recent works [19][20][21], we incorporate a simple yet effective alignment network into the HDR imaging network to reconstruct details of HDR images by aligning LDR features.

Optical Flow.
Alignment between LDR images is a key factor for generating ghost-free HDR images. The optical flow algorithm can be to perform alignment by finding the correspondence between the images. As a classical optical flow algorithm, the SIFT-flow algorithm [23] is an optimization-based algorithm for finding the optical flow between images. However, optimization-based methods require large computational times. Inspired by the success of CNNs, FlowNet [30] was the first end-to-end learning approach for optical flow. This method estimates the dense optical flow between two images based on a U-Net autoencoder architecture [31]. FlowNet 2.0 [32] stacks several basic FlowNet models for iterative refinement and significantly improves accuracy. Recently, PWC-Net [33] was proposed to warp features in each feature pyramid in a coarse-to-fine approach and achieve state-of-the-art performance with a lightweight framework. However, these deep-learning-based flow estimation methods for estimating optical flows cannot handle the large object motions.

Attention Mechanisms.
Attention mechanisms have provided significant performance improvements for many computer vision tasks, such as image classification [34], semantic segmentation [35], and image generation [36,37]. In the works by Zhang et al. [36] and Wang et al. [34], self-attention mechanisms were proposed for modeling long-range dependencies solve the problem of limited local receptive fields that many deep generative model have. For stereoscopic super-resolution tasks, Wang et al. [38] proposed a parallax-attention module for finding stereo correspondence. They found reliable correspondences with smaller computational cost than other stereo matching networks [39][40][41] by leveraging a parallax-attention mechanism. Inspired by attention mechanisms, we effectively find correspondence between the LDR images captured in dynamic scenarios for reconstructing HDR images. Then we align the LDR features using the correspondences for fully exploiting these features. Although our method and Yan et al. [21] use the term "attention", there is a significant difference between these methods. The attention network proposed by the Yan et al. [21] focuses on highlighting meaning features for HDR imaging. In contrast, our method aligns LDR images for fully exploiting them for HDR imaging via inter-attention maps.

Overview
An overview of the proposed method is presented in Figure 2. Using a set of LDR images {I 1 , I 2 , ..., I k } of a dynamic scene sorted by their exposure values, the proposed method aims to reconstruct a ghost-free HDR image H r that is aligned to the reference LDR image I r . First, we apply gamma correction [19][20][21] for mapping each LDR image I i into the HDR domain according to its exposure time t i (i.e., J i = I i γ /t i , where we set γ to 2.2 in this work), as a preprocessing step. Similar to previous approaches [19][20][21], the input for the proposed method is a concatenation of I i and J i , where i = 1, 2, 3. After preprocessing, we feed each input into the feature extraction network, which is composed of several combinations of convolution and rectified linear unit (ReLU) function, resulting in E i .
To exploit the features E o , o ∈ {1, 3} from other LDR images (i.e., non-reference images), the alignment network warps other features {E 1 , E 3 } by leveraging the proposed a pyramid inter-attention module (PIAM). The reference-aligned features and the reference feature are then merged to synthesize the details of the target HDR image. Although the alignment network aligns these features, alignment errors still exist in case of homogeneous regions or repetitive patterns. To handle this problem, we propose a dual excitation block (DEB) to recalibrate features for highlighting the informative features and excluding harmful features. Finally, the dilated residual dense blocks (DRDB) are used to learn hierarchical features for HDR imaging effectively.

Alignment Network
Since the features from LDR images are not aligned, we conduct alignment for fully exploiting them prior to merging features. When camera motion or a moving object exists in a scene, the alignment process is a key factor for reconstructing an HDR image. Unlike the method proposed by [19], which

Figure 2.
Overall framework for the proposed method. Our framework consists of three sections: a feature extraction network, alignment network, and merging network. First, we extract features from multiple LDR images using a feature extraction network. The alignment network, termed as pyramid inter-attention module (PIAM), is used to align the features from the feature extraction network. In the merging network, the dual excitation block (DEB) recalibrates features both spatially and channel-wise. A dilated residual dense block (DRDB) is used to learn hierarchical features for HDR imaging effectively.
uses the classical optical flow algorithm [23], we propose a novel alignment network, called PIAM. Before we describe the details of the PIAM, we will illustrate inter-attention module (IAM).

Inter Attention Module.
The IAM is inspired by self-attention mechanism [34,36], which estimates feature similarities for all pixels in a single image. While the self-attention mechanism finds self-similarity in a single image, the proposed IAM calculates the inter-similarity between LDR images for every pixel, which are used to align non-reference features toward the reference feature. In this section, we discuss the mechanism of the proposed IAM for the training and testing phase. Given two feature pairs {F r , F o } ∈ R C×H×W , they are reshaped as R C×HW . As shown in Figure 3, both pairs pass through the 1 × 1 convolutions for source θ s and target θ t . By multiplying these two feature maps, a correlation map C o→r ∈ R HW×HW is generated such that C o→r = θ t (F r ) T θ s (F o ). This correlation map is softmax normalized to generate a soft inter-attention map A o→r ∈ R HW×HW .
As the soft inter-attention map A o→r is softmax normalized, it represents the matching probability for all spatial positions. However, in the optical flow algorithm, there is only one matching point for each pixel. To ensure that the inter-attention map represents only one matching point, a hard inter-attention map B o→r (i, j) ∈ R HW×HW is generated as follows: With the hard inter-attention map B o→r , we can warp the other feature F o toward reference one F r using matrix multiplication, resulting in For training the IAM, we take the following additional steps. First, we generate an additional soft inter-attention map A r→o . We can train the IAM using photometric loss in an unsupervised manner, as described in Section 3.4. Photometric loss requires forward warping results using the soft inter-attention map. However, the occlusion problem, which originates from forward warping using an inter-attention map, is inevitable. An occluded region causes the network to estimate unreliable correspondences when using photometric loss for flow estimation [42] in an unsupervised manner. To ensure that the alignment network estimates reliable correspondences, we generate a validation mask for training the network. As suggested in [38], pixels in occluded regions typically have small weights in the inter-attention map A r→o . We design the validation mask V r→o ∈ R HW for the reference image and it can be obtained as follows:

Valid Mask Generation
where HW is a multiplication of the height and width of feature F r and τ is a threshold. Here, we set the τ to 0.1 empirically. In the same manner, the validation mask V o→r can be generated. The validation masks {V r→o , V o→r } are used for photometric loss for training the IAM in an unsupervised manner, as described in Section 3.4.

Pyramid Inter-Attention Module.
Finding global correspondences using the IAM for a large image requires a large amount of memory, which is described in Table 1. To alleviate this issue, we propose the PIAM, which consists of global IAM and local IAM, based on coarse-to-fine approaches for estimating correspondences [23,33]. As illustrated in Figure 4, feature pairs {E r , E o } ∈ R C×H×W pass through two stages of feature extraction network. The first feature extraction network outputs feature pair F l r , F l o ∈ R C×H×W , the size of which is the same as the resolution of {E r , E o }. The second network, which consists of n convolutions with stride-2, outputs feature pair F  Figure 4. The feature-grouping operation first divides feature F l o ∈ R C×H×W into grid of patches whose shape is R C×2 n ×2 n and reshape each patch to the size of R C·2 2n ×1×1 , then combines these patches to make f l o ∈ R C·2 2n ×(H/2 n )×(W/2 n ) . The coarse-globally aligned feature F l o is generated by performing feature-regrouping, which is the inverse operation of feature-grouping, on warped first level feature B g o→r f l o . Finally, we can find the local correspondence between the feature pair F l r , F l o . To reduce the computational memory, in the local IAM, we divide both features F l r , F l o into grids of patches such that the size of the patches is k × k, and then perform alignment with local patches to find local correspondences. We divide a feature into a grid, such that F l,n r = F l,   (H/k) · (W/k) is the number of patches. It should be noted that F l,n denotes the n-th patch consisting of F l . The local IAM takes each input pairs F l,n r , F l,n o , and outputs local correspondence B l,n o→r . With these local correspondences, we finally generate warped feature E o .

Merging Network
After aligning other features {E 1 , E 3 } to the reference feature E 2 using the alignment network, we obtain the warped features E 1 , E 3 . Despite the alignment process based on PIAM, the alignment error that PIAM cannot handle still exists. In order to eliminate the harmful effect of features in a region of misalignment or saturation, we designed a novel network that incorporates the dual excitation block (DEB) ( Figure 5) and dilated residual dense block (DRDB) [21] during the merging process. Finally, the ghost-free HDR images are generated by reducing artifacts caused by misalignment and preserving details during the merging process.

Dual Excitation Block (DEB).
In contrast to other non-flow-based deep HDR methods [20,21], which only fuses misaligned features E 1 , E 2 , E 3 , we fuse warped features using the PIAM. As shown in Figure 5, the input of the DEB is a fusion of warped features and a reference feature. Feature fusion is defined as follows: where Concat() is a concatenation operation. The DEB recalibrates the fused feature G f use ∈ R C×H×W both spatially and channel-wise by multiplying its excitation. Excitation allocates weights spatially and channel-wise to the fused feature to suppress the harmful features and encourage informative features for generating ghost-free HDR images. The configuration of the DEB is illustrated in Figure 5. After G f use passes several convolutions followed by ReLU functions and a sigmoid function, the DEB finally generates dual excitations. We can recalibrate fused feature by multiplying the excitation. Unlike the attention of Yan [21], we calculate both spatial and channel-wise excitation to refine fused features, whereas attention only represents the spatial excitation that the DEB outputs.

Dilated Residual Dense Block (DRDB).
The DRDB consists of dilated convolutions to facilitate large receptive field for acquiring additional contextual information. The residual and dense connections in the DRDB enable us to use all of the hierarchical features contained in fused features. The details of the DRDB are described in [21].

Training Losses
The proposed method consists of two tasks: alignment and HDR generation. We designed a loss function for training the alignment task that finds the correspondences between LDR images. Based on the procedure described in [19][20][21], we also use the HDR reconstruction loss. The overall loss function is defined as follows: where λ controls the ratio of training alignment among the overall loss. λ was empirically set to 0.5.

Alignment Loss.
Since there are no labeled dense correspondences between LDR images in an HDR dataset, we train the PIAM in an unsupervised manner. We introduce photometric loss for training the alignment network, following [38,43]. Photometric loss works for the images with the same exposure value. However, in our case, the LDR images have different exposures. Therefore, we set the same brightness values, as suggested in [19]. The brightness constancy is maintained by raising the exposure of darker images to that of brighter images. For example, if I 1 is darker than I 2 , then their exposures are matched such that M 1 = clip I 1 (t 2 /t 1 ) 1/γ and M 2 = I 2 , where clip ensures the range of the output is [0, 1], t 1 and t 2 are the exposure times of I 1 and I 2 , respectively.
With exposure-corrected matched pairs {M s , M t }, the PIAM can be trained using the soft inter-attention maps A s→t in an unsupervised manner by minimizing photometric error in valid region V s→t . To train the global IAM using {M s , M t }, we define global alignment loss such that: where s denotes a source, t denotes a target, denotes element-wise multiplication and m is generated by feature-grouping on M. The global IAM first warps M s to M t globally, generating M s . We can train the local IAM using the local alignment loss as follows: where s denotes a source, t denotes a target, and denotes element-wise multiplication. In this work, we set the reference r to 2, and other o to 1 or 3. Therefore, the overall alignment loss for training the PIAM is defined as follows.
Since the HDR images are usually displayed after tonemapping, the proposed HDR imaging network estimates a tonemapped HDR image H using the µ-law described in [19] as follows: where µ is a parameter that controls the amount of compression. In this work, we set µ to 5000. This tonemapping function is differentiable, which facilitates the training of our model in an end-to-end manner. The loss function for estimating an HDR image H with H gt is defined as follows:

Implementation Details
All convolutional filters in feature extraction network are 3 × 3 filters, followed by ReLU functions. In the PIAM, the second level feature extraction network consists of three convolutions for 8×down-sampling. For local the IAM, we set the size of the local patch to 32 × 32 for both training and testing. The growth rate was set to 32 in the DRDB. Our network was implemented using Pytorch on a PC with an Nvidia RTX 2080 GPU. The network was trained using the Adam optimizer [44] with β 1 = 0.9, β 2 = 0.99. The HDR imaging network was trained with a batch size of one and learning rate 1 × 10 −5 , respectively. Data augmentation was performed by flipping the images or performing color channel swapping in the images. During training, the input images were randomly cropped to a size of 256 × 256 pixels. Training was completed after 200,000 iterations, when additional iterations could not provide any further improvements for alignment or HDR imaging. All methods including our method were implemented to produce 640 × 960 HDR images in the experiments.

Datasets.
The proposed HDR imaging network was trained using Kalantari's HDR dataset [19] according to the process presented in previous works [19][20][21]. Kalantari's HDR dataset provides ground truth HDR images, which facilitate training an HDR imaging network in a supervised manner. It consists of 74 sets for training and 15 sets for testing. Each set consists of three LDR images captured with different exposure values ({−2, 0 + 2} or {−3, 0 + 3}) and the ground truth HDR image is aligned (b) Testing data (008) from Kalantari's dataset Figure 6. Visual comparisons on (a) testing 007 data and (b) testing 008 data from from Kalantari's dataset. In the top section, we present the input LDR images, tonemapped HDR image produced by the proposed method, and LDR image patches. In the bottom section, we compare magnified local patches of the HDR images generated by our method and the state-of-the-art methods. Our network produces high-quality HDR images in the presence of saturation and object motions.
to the reference image (middle exposure). The details of constructing the ground truth HDR image are discussed in [19]. After training our network on Kalantari's HDR dataset [19], we compared the performance of our HDR imaging method with that of other state-of-the-art methods by testing on this dataset both qualitatively and quantitatively. We also used Sen's dataset [11] and Tursun's dataset [24] for visual comparisons since they do not contain ground truth HDR images.

Evaluation Metrics.
We compared our method with the various state-of-the-art methods quantitatively on Kalantari's dataset [19] because ground truth HDR images are available for this dataset. The evaluation metrics selected for measuring the quality of HDR images were PSNR-µ. PSNR-M, PSNR-L, and HDR-VDP-2.
We computed the PSNR-µ values between the generated HDR images and ground truth HDR images after tonemapping using µ law. Additionally, evaluation metrics based on Matlab's tonemap function (PSNR-M) and linear domains (PSNR-L) were adopted. To focus on the visual quality of HDR iamges, we also measured HDR-VDP-2 values [45].

Comparison With the State-of-the-Art Methods
We compare our method with the recent state-of-the-art methods, including hand-crafted [11,15,28] and CNN-based methods [17,[19][20][21][22], on Kalantari et al.'s dataset [19] in Section 4.4 and datasets without ground truth images [11,24] in Section 4.5. For fair comparison, we used the same environment such as training dataset and implementation details for CNN-based methods [17,[19][20][21][22]. All results were obtained using the code provided by the original authors. Figure 6 presents visual comparisons of HDR images for the proposed method and the state-of-the-art methods on the testing set of the Kalantari HDR dataset [19]. The method proposed by Oh et al. [28] cannot detect object motion, resulting in large ghosting artifacts due to the misalignment. Especially, the results of Oh et al. [28] are strongly influenced by LDR images with low exposure values. HDR imaging methods using single images, such as TMO [17] and HDRCNN [22], cannot elucidate the details of ground truth HDR images, since they only use a single reference image. Among the CNN-based methods for fusing LDR images, Wu et al. [20] and Yan et al. [21] do not conduct alignment prior to merging. Therefore, they suffer from ghosting artifacts caused by misalignment. The method proposed by Yan et al. [21] generates more plausible results than that proposed by Wu et al. because it uses attention maps, which is a similar mechanism to our spatial excitations. Although the method proposed by Kalantari et al. [19] conducts alignment prior to merging, it produces saturated results because it cannot suppress harmful features during the merging process. In contrast, our method is free from any artifacts, resulting in more plausible results than any other method, since we conduct alignment and recalibrate features by levering the PIAM and DEB.

Quantitative Comparison.
We measured the performance of recent state-of-the-art methods and our method for quantitative evaluation on Kalantari HDR dataset [19]. We tested 15 images from testing dataset, measured the all evaluation metrics described above, and calculated average values. The results are presented in Table  2. In terms of all of the evaluation metrics, our method yields the best HDR imaging results. This is mainly because our method can fully exploit the all LDR features through alignment and recalibrate features for highlighting the informative features and excluding harmful components. Table 2. Quantitative comparisons of the proposed method with state-of-the-art methods on [19], where bold indicates the best performance.  Figure 7 presents visual comparisons of HDR images for the proposed method and the state-of-the-art methods on the testing set of datasets without ground truth [24]. Oh et al.'s [28] method cannot detect large object motion, resulting in large ghosting artifacts. The methods relying on single images [17,22] and Kalantari et al.'s method [19] exhibit similar color distortions. Wu et al.'s method [20] yields color distortions and ghosting artifacts. The method proposed by Yan et al. [21] fails to preserve color consistency and generates ghost artifacts due to misalignment. In contrast, our method generates visually plausible results preserves details and color consistency without ghosting artifacts.  In the top section, we present the input LDR images, tonemapped HDR images produced by the proposed method, and LDR image patches. In the bottom section, we compare magnified patches of the HDR images generated by our method and the state-of-the-art methods. Ground truths are not included because these datasets do not provide them. The proposed method yields plausible results without ghosting artifacts or color distortions.

Ablation Studies
To verify the effectiveness of our network architecture, we conducted ablation studies to quantify the effects of the proposed pyramid inter-attention module (PIAM) and dual excitation block (DEB). Table 3 compares the performances of HDR imaging networks with different components in terms of the target evaluation metrics. It can be observed that all of the evaluation metrics decrease where the PIAM or DEB are not applied in our network. (i.e., baseline network). As shown in Figure 8, the PIAM finds reliable correspondences between LDR features. By conducting alignment using the PIAM, performance increases because the PIAM enables the network to exploit well-aligned LDR features by providing more precise information to the merging network. Furthermore, the DEB also increases the performance of HDR imaging because it can re-calibrate features both spatially and channel-wise to boost the representation power of fused features for reconstructing a HDR image. Therefore, it refines fused features to make them more informative, resulting in improved performance. With the PIAM and DEB added to the baseline network, our method achieves the best performance. The PIAM boosts the performance by providing more precise information and the DEB boosts performance by recalibrating features. To demonstrate the superiority of our alignment process using the PIAM for HDR imaging, we compared our method with the conventional optical flow algorithm [23] and the deep-learning-based flow estimation method [33], by measuring the accuracy of these correspondence methods. To measure matching accuracy, we compare the structural difference between warped images and reference LDR images on testing set in Kalantari et al.'s dataset. Since the intensity of a reference-warped LDR image is different from that of the LDR reference image, we compared SSIM values. Figure 8 presents a qualitative comparison of alignment results for our method, SIFT-flow [23], and PWC-Net [33]. As shown in Figure 8, PWC-Net fails to find large correspondences between LDR images because it is designed to cover small displacement. Although SIFT-flow finds large correspondences, it cannot preserve the details around the boundary of the moving object in the warped image. In contrast to these methods, our method yields more reliable correspondences. In Table 4, it can be seen that the proposed PIAM yields more accurate alignment performance than conventional the optical flow algorithm [23] used in Kalantari et al.'s [19], resulting in enhanced performance for HDR imaging.  Table 5 presents the run time comparisons between various methods. All algorithms were executed on a PC with an i7-4790K (4.0GHz) CPU, 28 GB of RAM, and an Nvidia RTX 2080 GPU. It should be noted that the optimization-based HDR method [28] and HDR method [19] using the classical optical flow algorithm [23] were executed using the CPU. Our method is slower than the other deep-learning-based method except for Kalantari et al.'s method, which uses the conventional optical flow algorithm. Although the PIAM in our method increases the run time, it is still approximately 60 times faster than Kalantari et al.'s method. It should be noted that the other methods that are faster than our method do not contain alignment processes, resulting in the ghosting artifacts. Even though we conduct an alignment process similar to Kalantari et al.'s process, our method finds correspondences between LDR images more efficiently and effectively. We also tested our method on cellphone images of both static and dynamic scenes to verify its practicality. For dynamic scenes, we tested with different types of motions such as camera motion or object motion. The HDR results are presented in Figure 9. One can see that our network produces plausible results in various types of settings. The LDR images were captured using a Samsung Galaxy S20 device with different exposure values. The exposure values for the cellphone were {−4, −2, 0}, which are different from training settings for the proposed method. Even with different settings, the plausible results demonstrate the robustness of our network.
(a) Static scene (b) Dynamic scene (camera motion) (c) Dynamic scene (object motion) Figure 9. HDR results of using cellphones to capture for both static (camera, object motion) and dynamic scenes. HDR results are aligned to the middle exposure. All LDR images were captured using a Samsung Galaxy S20 device.

Conclusions
We developed a novel end-to-end approach to reconstructing ghost-free HDR images of dynamic scenes. The proposed PIAM effectively aligns LDR features to exploit all LDR features for HDR reconstruction, even when large motion exists. Additionally, the DEB recalibrates the aligned features by multiplying the excitations spatially and channel-wise to boost the representation power of them. Ablation studies clearly demonstrated the effectiveness of the PIAM and DEB of our model. Finally, we have demonstrated that the proposed method is robust to dynamic scenes with large foreground motion, and outperforms state-of-the-art methods on standard benchmarks by a significant margin.