Cyclic Learning-Based Lightweight Network for Inverse Tone Mapping

Recent studies on inverse tone mapping (iTM) have moved toward indirect mapping, which generates a stack of low dynamic range (LDR) images with multiple exposure values (multi-EV stack) and then merges them. In order to generate multi-EV stack(s), several large-scale networks with more than 20 M parameters have been proposed, but their high dynamic range (HDR) reconstruction and multi-EV stack generation performance were not acceptable. Also, some previous methods using cycle consistency should even have trained additional networks that are not used for multiEV stack generation, which results in large memory for training. Thus, this paper proposes novel cyclic learning based on cycle consistency to reduce the memory burden in training. In detail, we eliminated networks used only for training, so the proposed method enables efficient learning in terms of training-purpose memory. In addition, this paper presents a lightweight iTM network that dramatically reduces the network sizes of the existing networks. Actually, the proposed lightweight network requires only a small parameter size of 1/100 compared to the state-of-the-art (SOTA) method. The lightweight network contributes to the practical use of iTM. Therefore, the proposed method based on a lightweight network reliably generates a multi-EV stack. Experimental results show that the proposed method achieves quantitatively SOTA performance and is qualitatively comparable to conventional indirect iTM methods.


Introduction
With the rapid development of deep learning, a lot of methods for reconstructing a high dynamic range (HDR) image from low dynamic range (LDR) image(s) have been proposed [1][2][3][4]. They can be largely divided into two categories. The first one is the multiexposure fusion (MEF) approach [5][6][7] in which LDR images of different exposure values (EVs) are acquired and merged to generate a single HDR image. Conventional MEF methods often suffer from ghost artifacts due to moving object(s) while acquiring multiple LDR images with parallax. The second one is so-called inverse tone mapping (iTM) [8][9][10][11][12][13][14], which reconstructs an HDR image using only a single LDR image.
Meanwhile, the iTM approach is again classified into direct iTM and indirect iTM. Direct iTM is literally a one-to-one tone mapping between LDR and HDR [8,10,14]. Whereas direct iTM uses only one LDR, indirect iTM synthesizes LDRs of multiple EVs (multi-EV stack) from a single LDR and merges them to generate an HDR [9, [11][12][13]. For example, deep chain HDRI [12] employs a strategy to allocate a subnetwork for each target EV. As such, the subnetworks tend to increase as much as the number of target EVs, which can be computationally burdensome. On the other hand, DrTMO [9], deep recursive HDRI [11], and deep cycle HDRI [13] generate a multi-EV stack by using EV up/down networks in consideration of the increasing/decreasing direction toward a target EV. In addition, [15] proposed generating HDR images using LDR video sequence information. As iTM methods develop, the network parameter size has also increased. Unfortunately,

•
We propose a new learning method, i.e., cyclic learning, to train EV up/down networks with less training memory than existing multi-EV stack generation networks based on cycle consistency. • This paper demonstrates the practical applicability of deep learning-based iTM by presenting a lightweight network structure compared to existing iTM methods while maintaining reliable performance.
This paper is organized as follows. Section 2 describes the related work to understand the proposed method. Section 3 describes in detail the core elements of the proposed method. Section 4 shows experimental results through comparison with existing methods. Finally, Section 5 concludes this paper, and in the Appendix A, experimental results not covered in the main body are additionally explained.

Direct Inverse Tone Mapping
Direct iTM methods focus on extracting as much information as possible from a single LDR. For instance, HDRCNN [8], the first deep learning-based iTM, generated an HDR image by linearly combining the reconstructed saturated region and the image to which the inverse camera response function (CRF) was applied. ExpandNet [10] fused various features extracted through several branches. Still, such an LDR-to-HDR mapping in a one-to-one fashion inevitably faces restrictions on the domain information available in the saturated regions.
Recently, SingleHDR [14] dissected the LDR-generation process into several steps by reversing the image formation pipeline and assigning a neural network to each step of the inverse process. SingleHDR outperformed previous direct iTM methods, but its insufficient luminance reconstruction performance is a problem, together with halo artifacts.

Indirect Inverse Tone Mapping
To solve the aforementioned intrinsic problems of direct iTM, indirect iTM methods tried to generate LDRs of multi-EVs, i.e., multi-EV stack. DrTMO [9] generated a multi-EV stack through EV up/down networks and merged the generated LDRs using Debevec's algorithm [17]. Deep chain HDRI [12] set a total of six target EVs from EV −3 to +3 based on an input image of EV 0, and then configured a subnetwork per EV. Then, an HDR image was generated by applying Debevec's algorithm, in the same way as DrTMO, to the multi-EV stack obtained from subnetworks. Deep recursive HDRI [11] produced a multi-EV stack by recursively operating the EV up/down networks considering only the increasing/decreasing direction of a target EV. The subsequent deep cycle HDRI [13] introduced cycle consistency to promote stability of multi-EV stack generation in GANbased training of deep recursive HDRI. However, previous indirect iTM approaches have limitations, such as halo artifacts and color distortion, even with considerable parameter size. In addition, the existing learning methods using cycle consistency are inefficient in terms of training-purpose memory, because they require an auxiliary network that is used only for learning.

Inverse Tone Mapping with Cycle Consistency
The concept of cycle consistency came from the CycleGAN architecture [16]. Cycle consistency is a concept that an image output from the first generator could be used as input to the second generator, and the output of the second generator should match the original image. The reverse is also true. Cycle consistency loss is defined by where x and y indicate an input and the ground truth (GT), respectively. In this paper, x and y correspond to the input LDR and the LDR of the target EV, respectively. G 1 : x → y and G 2 : y → x stand for two different generators with opposite objectives. In this paper, they each corresponds to an EV up/down network.
) ≈ x is called forward-cycle consistency, and y → G 2 (y) → G 1 (G 2 (y)) ≈ y is called backward-cycle consistency. Both networks are updated and trained at the same time, but G 2 is used as an auxiliary tool to help G 1 learn. For convenience, let the learning mechanism using cycle consistency [16] like Deep cycle HDRI [5] be called cycle learning. If cycle learning is substituted for the EV up/down network, G 1 can be regarded as the EV up network, and G 2 can be regarded as the EV down network. When training the EV up network, cycle learning trains the EV down network simultaneously, but this EV down network is not used in the inference phase. The opposite is also true, i.e., cycle learning requires four trained networks to generate a multi-EV stack.
In practice, using four networks to train only two networks is a waste of memory. Therefore, as a learning strategy to solve this problem, we propose a novel learning method that does not require any auxiliary network, even though it is using cycle consistency. In other words, the proposed method not only trains two networks simultaneously with cycle consistency but also generates a multi-EV stack using both EV up network and EV down network in the inference phase (see Figure 1). Therefore, the proposed learning, named cyclic learning, is more efficient than cycle learning in terms of training-purpose memory. We cover the detailed explanation of cyclic learning in Section 3.1. introduced cycle consistency to promote stability of multi-EV stack generation in GANbased training of deep recursive HDRI. However, previous indirect iTM approaches have limitations, such as halo artifacts and color distortion, even with considerable parameter size. In addition, the existing learning methods using cycle consistency are inefficient in terms of training-purpose memory, because they require an auxiliary network that is used only for learning.

Inverse Tone Mapping with Cycle Consistency
The concept of cycle consistency came from the CycleGAN architecture [16]. Cycle consistency is a concept that an image output from the first generator could be used as input to the second generator, and the output of the second generator should match the original image. The reverse is also true. Cycle consistency loss is defined by where and indicate an input and the ground truth (GT), respectively. In this paper, and correspond to the input LDR and the LDR of the target EV, respectively. ∶ → and ∶ → stand for two different generators with opposite objectives. In this paper, they each corresponds to an EV up/down network. → → is called forward-cycle consistency, and → → is called backward-cycle consistency. Both networks are updated and trained at the same time, but is used as an auxiliary tool to help learn. For convenience, let the learning mechanism using cycle consistency [16] like Deep cycle HDRI [5] be called cycle learning. If cycle learning is substituted for the EV up/down network, can be regarded as the EV up network, and can be regarded as the EV down network. When training the EV up network, cycle learning trains the EV down network simultaneously, but this EV down network is not used in the inference phase. The opposite is also true, i.e., cycle learning requires four trained networks to generate a multi-EV stack.
In practice, using four networks to train only two networks is a waste of memory. Therefore, as a learning strategy to solve this problem, we propose a novel learning method that does not require any auxiliary network, even though it is using cycle consistency. In other words, the proposed method not only trains two networks simultaneously with cycle consistency but also generates a multi-EV stack using both EV up network and EV down network in the inference phase (see Figure 1). Therefore, the proposed learning, named cyclic learning, is more efficient than cycle learning in terms of trainingpurpose memory. We cover the detailed explanation of cyclic learning in Section 3.1.  Training and inference phases of cycle learning and cyclic learning. Blue and yellow blocks mean EV up and down networks, respectively. Here, the dotted block indicates an auxiliary network that is necessary only for learning. Blocks of the same color represent the same network. Also, both cycle learning and cyclic learning create a multi-EV stack through recursive use of the inference phase of (c). indicates relative EV.

Figure 1.
Training and inference phases of cycle learning and cyclic learning. Blue and yellow blocks mean EV up and down networks, respectively. Here, the dotted block indicates an auxiliary network that is necessary only for learning. Blocks of the same color represent the same network. Also, both cycle learning and cyclic learning create a multi-EV stack through recursive use of the inference phase of (c). i indicates relative EV.

Methods
The operation of conventional indirect iTM methods is illustrated in Figure 2. By recursively applying the trained EV up/down networks, a multi-EV stack with EV i is generated, where i = −3, −2, · · ·, 2, 3, and LDR images are merged by Debevec's algorithm [17]. Section 3.1 depicts the concept of cyclic learning for EV up/down networks and The operation of conventional indirect iTM methods is illustrated in Figure 2. By recursively applying the trained EV up/down networks, a multi-EV stack with EV is generated, where i = −3, −2, · · ·, 2, 3, and LDR images are merged by Debevec's algorithm [17]. Section 3.1 depicts the concept of cyclic learning for EV up/down networks and the training procedure. Section 3.2 describes the lightweight architecture of EV up/down networks and the details of loss functions for learning. Figure 2. Overview of recent indirect iTM methods. Curved arrows indicate the recursive usage of EV up/down networks. Debevec's algorithm [17] was used to merge a multi-EV stack for fair comparison with the previous works.

Cyclic Learning
If cycle consistency is applied to the multi-EV stack generation step, the noise amplification problem that may occur in the underexposed/overexposed regions can be mitigated [5]. Although the previous cycle learning uses only two networks to generate multi-EV stacks, it should train a total of four networks in the learning phase. In practice, the conventional learning method using cycle consistency used an auxiliary EV down network for training the EV up network and an auxiliary EV up network for training the EV down network. However, the auxiliary networks were not used in the inference phase. The proposed cyclic learning trains only two networks, and these are both used in the inference phase. In other words, the proposed cyclic learning can mitigate the memorywaste problem during the learning phase. This section describes the proposed cyclic learning in detail.
To eliminate unnecessary memory from the learning process of cycle learning, we propose a new learning method for EV up/down networks. When applying the existing concept of cycle learning to iTM, the contradictory objectives of increasing and decreasing EVs are both required for the inference phase. In addition, since the existing cycle learning method requires an auxiliary network for each of the two networks, they are inefficient in terms of memory. Therefore, we propose a way to simultaneously train two networks with cycle consistency and use both networks in the inference phase.
Before explaining the proposed method in detail, let us consider a naïve way first. The EV up network for increasing EV is trained to generate an LDR of EV 1 from an LDR of EV , and the EV down network is trained in the opposite direction. Then, assume that EV up/down networks are simultaneously updated and trained according to Equation (1). If an input image goes through the EV down network after the EV up network, the EV down network will apply the re-estimation process to the image estimated by the EV up network. Here, the update of the EV down network may cause undesirable training, resulting in such problems as over-sharpness in the output image. For example, in the somewhat dark image of EV−3 in Figure 3, we observe pixel-wise saturation due to oversharpness. If such images are included in the merging process, the same phenomenon may even occur in the HDR image.  [17] was used to merge a multi-EV stack for fair comparison with the previous works.

Cyclic Learning
If cycle consistency is applied to the multi-EV stack generation step, the noise amplification problem that may occur in the underexposed/overexposed regions can be mitigated [5]. Although the previous cycle learning uses only two networks to generate multi-EV stacks, it should train a total of four networks in the learning phase. In practice, the conventional learning method using cycle consistency used an auxiliary EV down network for training the EV up network and an auxiliary EV up network for training the EV down network. However, the auxiliary networks were not used in the inference phase. The proposed cyclic learning trains only two networks, and these are both used in the inference phase. In other words, the proposed cyclic learning can mitigate the memory-waste problem during the learning phase. This section describes the proposed cyclic learning in detail.
To eliminate unnecessary memory from the learning process of cycle learning, we propose a new learning method for EV up/down networks. When applying the existing concept of cycle learning to iTM, the contradictory objectives of increasing and decreasing EVs are both required for the inference phase. In addition, since the existing cycle learning method requires an auxiliary network for each of the two networks, they are inefficient in terms of memory. Therefore, we propose a way to simultaneously train two networks with cycle consistency and use both networks in the inference phase.
Before explaining the proposed method in detail, let us consider a naïve way first. The EV up network for increasing EV is trained to generate an LDR of EV i + 1 from an LDR of EV i, and the EV down network is trained in the opposite direction. Then, assume that EV up/down networks are simultaneously updated and trained according to Equation (1). If an input image goes through the EV down network after the EV up network, the EV down network will apply the re-estimation process to the image estimated by the EV up network. Here, the update of the EV down network may cause undesirable training, resulting in such problems as over-sharpness in the output image. For example, in the somewhat dark image of EV−3 in Figure 3, we observe pixel-wise saturation due to over-sharpness. If such images are included in the merging process, the same phenomenon may even occur in the HDR image.
To solve this problem, we propose updating two networks alternately in a single iteration and training them simultaneously (see Figure 4). Figure 4a shows the updating process of the EV up network. Here, the EV down network uses the weights of the previous iteration as they are. Training for updating the EV up network is based on two losses: (1) The EV up network loss is defined by the distance between the inferred LDR of EV i + 1 and the GT of EV i + 1, and (2) the forward-cycle consistency loss is defined by the distance between the re-inferred LDR of EV i (derived from the EV down network) and the GT of EV i. On the other hand, the training of the EV down network in Figure 4b is based on (1) the EV down network loss and (2) the backward-cycle consistency loss. Section 3.2 describes the mathematical definitions of the aforementioned losses. Also, Section 4.4 experimentally proves that the proposed cyclic learning realizes more effectively cycle consistency than the conventional cycle learning.  To solve this problem, we propose updating two networks alternately in a single iteration and training them simultaneously (see Figure 4). Figure 4a shows the updating process of the EV up network. Here, the EV down network uses the weights of the previous iteration as they are. Training for updating the EV up network is based on two losses: (1) The EV up network loss is defined by the distance between the inferred LDR of EV i + 1 and the GT of EV i + 1, and (2) the forward-cycle consistency loss is defined by the distance between the re-inferred LDR of EV i (derived from the EV down network) and the GT of EV i. On the other hand, the training of the EV down network in Figure 4b is based on (1) the EV down network loss and (2) the backward-cycle consistency loss.  Section 3.2 describes the mathematical definitions of the aforementioned losses. Also, Section 4.4 experimentally proves that the proposed cyclic learning realizes more effectively cycle consistency than the conventional cycle learning.

Lightweight Network
Conventional indirect inverse tone mapping (iTM) methods have more than 20 M parameters for exposure value (EV) up/down networks [8,9,[11][12][13][14]. As such, in order to build a baseline toward the lightweight network, we adopt WDSR-A residual block (WARB) [18] which is known to have good cost-effectiveness in the super-resolution field. The original WARB was used with weight normalization (WN) [19]. It is known that WN is effective at high learning rates, but does not work at low learning rates. This results in an increase in learning time. To solve this problem, we adopt WARB without WN. The proposed method does not suffer from the so-called convergence problem, because it is  To solve this problem, we propose updating two networks alternately in a single iteration and training them simultaneously (see Figure 4). Figure 4a shows the updating process of the EV up network. Here, the EV down network uses the weights of the previous iteration as they are. Training for updating the EV up network is based on two losses: (1) The EV up network loss is defined by the distance between the inferred LDR of EV i + 1 and the GT of EV i + 1, and (2) the forward-cycle consistency loss is defined by the distance between the re-inferred LDR of EV i (derived from the EV down network) and the GT of EV i. On the other hand, the training of the EV down network in Figure 4b is based on (1) the EV down network loss and (2) the backward-cycle consistency loss.  Section 3.2 describes the mathematical definitions of the aforementioned losses. Also, Section 4.4 experimentally proves that the proposed cyclic learning realizes more effectively cycle consistency than the conventional cycle learning.

Lightweight Network
Conventional indirect inverse tone mapping (iTM) methods have more than 20 M parameters for exposure value (EV) up/down networks [8,9,[11][12][13][14]. As such, in order to build a baseline toward the lightweight network, we adopt WDSR-A residual block (WARB) [18] which is known to have good cost-effectiveness in the super-resolution field. The original WARB was used with weight normalization (WN) [19]. It is known that WN is effective at high learning rates, but does not work at low learning rates. This results in an increase in learning time. To solve this problem, we adopt WARB without WN. The proposed method does not suffer from the so-called convergence problem, because it is

Lightweight Network
Conventional indirect inverse tone mapping (iTM) methods have more than 20 M parameters for exposure value (EV) up/down networks [8,9,[11][12][13][14]. As such, in order to build a baseline toward the lightweight network, we adopt WDSR-A residual block (WARB) [18] which is known to have good cost-effectiveness in the super-resolution field. The original WARB was used with weight normalization (WN) [19]. It is known that WN is effective at high learning rates, but does not work at low learning rates. This results in an increase in learning time. To solve this problem, we adopt WARB without WN. The proposed method does not suffer from the so-called convergence problem, because it is based on residual learning and the network depth is not so deep. In addition, the proposed method solves the noise amplification problem by introducing cycle consistency into learning. This consequently alleviates the overfitting phenomenon. Therefore, we can safely remove the WN layer while pursuing both lightweight and cyclic learning.
To further increase the HDR reconstruction performance, we place a luminance compensation module in front of the network (see Figure 5). This structure is based on the fact that the task of increasing/decreasing EV can be divided into compensation of global luminance and restoration of details. If the global luminance of an input image is properly compensated in advance, the subsequent network can concentrate on reconstructing the lost information or details during training. In this paper, the luminance compensation module is implemented in a simple and intuitive way, i.e., the average of the luminance differences of images with an EV gap of 1 is defined as a single parameter, and this is learned. The trained parameter, i.e., a sort of luminance offset is added to each input image. Note that the parameter is positive in the EV up network and negative in the EV down network. compensated in advance, the subsequent network can concentrate on reconstructing the lost information or details during training. In this paper, the luminance compensation module is implemented in a simple and intuitive way, i.e., the average of the luminance differences of images with an EV gap of 1 is defined as a single parameter, and this is learned. The trained parameter, i.e., a sort of luminance offset is added to each input image. Note that the parameter is positive in the EV up network and negative in the EV down network. Thus, as shown in Figure 5, the EV up/down network is composed of one luminance compensation module, five WARBs without WN (WARB-noWN), and two 3 × 3 convolution layers to adjust the number of input/output channels. The total losses for training the EV up/down networks are defined by , As mentioned in Section 3.1, a total loss is composed of EV up/down network loss ( * , * and forward/backward consistency loss ( * , * for cycle consistency. Each loss is again composed of a pixel-wise L1 loss ( * and a gradient difference loss ( * [20] to prevent blur phenomena in output images. Two coefficients and were experimentally set to 0.2 and 0.8, respectively. Note that * , * and * , * are composed in the same way. * and * are calculated based on the distance between network output and GT. Also, the pixel-wise losses of forward/backward consistency are defined by The gradient difference losses of forward/backward consistency are defined by where and represent EV up/down networks, respectively. and mean LDR images of EV and EV 1. Here, the input of the EV down network is . Also, and refer to the vertical and horizontal directions, and ∇ and ∇ stand for the vertical and horizontal gradients. Thus, as shown in Figure 5, the EV up/down network is composed of one luminance compensation module, five WARBs without WN (WARB-noWN), and two 3 × 3 convolution layers to adjust the number of input/output channels. The total losses for training the EV up/down networks are defined by As mentioned in Section 3.1, a total loss is composed of EV up/down network loss (L u * , L d * ) and forward/backward consistency loss (L f * , L b * ) for cycle consistency. Each loss is again composed of a pixel-wise L1 loss (L * pix ) and a gradient difference loss (L * gd ) [20] to prevent blur phenomena in output images. Two coefficients λ 1 and λ 2 were experimentally set to 0.2 and 0.8, respectively. Note that L u * , L d * and L f * , L b * are composed in the same way. L u * and L d * are calculated based on the distance between network output and GT. Also, the pixel-wise losses of forward/backward consistency are defined by The gradient difference losses of forward/backward consistency are defined by where U(·) and D(·) represent EV up/down networks, respectively. x and y mean LDR images of EV i and EV i + 1. Here, the input of the EV down network is y. Also, u and v refer to the vertical and horizontal directions, and ∇ u and ∇ v stand for the vertical and horizontal gradients.

Experimental Setup
For learning EV up/down networks, we adopted Fairchild [21], an open dataset composed of multi-EV stack GTs with an EV gap of 1. The training-purpose Fairchild dataset has a total of 105 multi-EV stacks photographed through a Nikon DX2. That is, it consists of 105 × 7 images. Each image in the dataset was cropped to 512 × 512 during training and was used in patch units. The learning rate was set to 5 × 10 −5 , and an Adam optimizer [22] with β 1 = 0.5, β 2 = 0.999 was used. The batch size was set to 2, and the number of training epochs was set to 10.

Quantitative Results
First, let us quantitatively evaluate the HDR reconstruction performance of the proposed method. Among the direct iTM, the indirect iTM, and the cycle learning-based tone mapping, we chose a few methods that can be directly compared with the proposed method. However, ideal quantitative evaluation of the generation performance of a multi-EV stack is impossible due to the limitations of the dataset. As such, we employed an HDR-VDP Q score [14,24], which is a representative metric for evaluating HDR reconstruction performance. In addition, PSNRs and SSIMs of the images that were tone mapped (γ = 2.2) by [25,26] were compared. In general, the PSNR values tend to be low overall because tone mapping causes significant contrast changes. However, since it is true that PSNR improvement is a measure of image quality improvement to some extent, PSNR is also adopted as a metric in this paper. HDR images are greatly affected by TMO and may also cause distortion. We adopted two TMOs for comparison with conventional methods. Results using additional TMOs are provided in Appendix A. Table 1 shows the quantitative evaluation results. The results of deep chain HDRI [12] and deep cycle HDRI [13] were quoted from the papers' figures as they are, and the rest from open results. It is noteworthy that the proposed method shows the best HDR reconstruction performance in terms of HDR-VDP Q score. The proposed method has a Q score of 0.7 or higher than deep cycle HDRI [13], which is the best among the existing methods. Also, the proposed method is the best, even in terms of PSNR and SSIM for both TMOs. Furthermore, the proposed method has a smaller parameter size than the SOTA iTM methods [13,14]. For example, it is only 1/100 of deep cycle HDRI [13]. Therefore, the proposed method provides SOTA performance in terms of HDR reconstruction, keeping almost the minimum parameter size.

Qualitative Results
This section qualitatively evaluates the HDR reconstruction and the multi-EV stack generation of the proposed method. We compared the tone mapped images generated by Reinhard's method [25] of HDR-Toolbox [27]. Let us take a look at the first example at the top of Figure 6. Here, magnified images are given together for clear comparison. DrTMO [9] and deep recursive HDRI [11] failed to reconstruct the saturated region(s), causing artifacts as a whole. Other methods also showed halo artifacts at the boundary between trees and sky. On the other hand, we can observe that the proposed method is robust to artifacts and also reconstructs the saturated region successfully. While the first example showed the reconstruction performance in the saturated region, the second example at the bottom of Figure 6 is presented for evaluating the reconstruction of the dark region and the quantization artifacts. For example, DrTMO [9] and HDRCNN [8] caused quantization artifacts remarkably. ExpandNet [10], deep recursive HDRI [11], and SingleHDR [14] showed relatively few quantization artifacts, but were not satisfactory in terms of reconstruction of dark areas. However, the proposed method reconstructed dark areas, which are close to GTs, well and hardly experienced quantization artifacts. As mentioned earlier, the images in Figure 6 were tone mapped by Reinhard's TMO [25] for demonstration purposes.
Next, let us evaluate the multi-EV stack generation performance of the proposed method. For this experiment, we compared the proposed method with deep recursive HDRI [11], which is available among conventional indirect iTM methods [11][12][13]. For an input image (top of Figure 7), the EV up/down network created images of EV−3, −2, −1, +1, +2, +3 from left to right. Images of EV−3, −2, −1, [11] showed stairs as if they were the sky and gradually reconstructed their color tone to sky blue. On the other hand, the proposed method reconstructed the stairs, properly maintaining the texture. Also, the proposed method reconstructed the sculpture area more successfully than [11]. Even for the images of EV+1, +2, +3, the proposed method generated stable brightness without losing color in walls and flowers.   Therefore, the proposed method shows comparable qualitative performance, even with the significantly small network size. In particular, the proposed method provides outstanding performance in terms of artifact robustness in both HDR reconstruction and multi-EV stack generation. More qualitative results are given in the Appendix A.

Effect of Each Component of the Proposed Method
This section analyzes the effect of each technique applied to the proposed method on overall performance. Table 2 quantitatively shows the effect of applying cyclic learning, WN presence in WARB, and luminance compensation module one by one. The baseline in this experiment is a model consisting of EV up/down networks with only 5 WARBs (see the first row of Table 2). Compared to the conventional methods in Table 1, it is worth noting that the proposed baseline was already competitive in performance. This experimentally proves that WARB is a network suitable for generating multi-EV stacks. First, when cyclic learning was applied to the baseline, the HDR reconstruction performance was improved by about 0.09 (see the second row). This is because cycle consistency was well maintained in the training process. Next, when WN is removed from WARB, the overall performance was additionally improved by as much as 0.21 (see the third row). This is because the network could learn diverse luminance distributions. Finally, when the luminance compensation module was applied, an additional performance improvement of 0.12 was obtained (see the last row). As a result, the proposed method achieved a Q score improvement of 0.42 compared to the baseline.

Realization of Cycle Consistency
This section evaluates whether the proposed cyclic learning realizes cycle consistency more successfully than conventional cycle learning. In this experiment, the proposed cyclic learning was compared with cycle learning using cycle consistency as well as the baseline that did not use cycle consistency. Figure 8 shows images of EV 0, +1, +2 for each method. EV 0, +1, +2 in Figure 8 were regenerated by the EV down network from EV +1, +2, +3, which were generated by the EV up network. By comparing the images sequentially passing through the EV up/down networks with the GTs, we could evaluate how well each method maintains cycle consistency. Let us look at EV 0 of Figure 8. The baseline is significantly different from the GT, especially in the sky area. Cycle learning reconstructed a sky region similar to GT than the baseline. However, cycle learning caused many artifacts. On the other hand, the proposed cyclic learning was most similar to GT, and no artifacts were observed. method. EV 0, +1, +2 in Figure 8 were regenerated by the EV down network from EV +1, +2, +3, which were generated by the EV up network. By comparing the images sequentially passing through the EV up/down networks with the GTs, we could evaluate how well each method maintains cycle consistency. Let us look at EV 0 of Figure 8. The baseline is significantly different from the GT, especially in the sky area. Cycle learning reconstructed a sky region similar to GT than the baseline. However, cycle learning caused many artifacts. On the other hand, the proposed cyclic learning was most similar to GT, and no artifacts were observed.

Conclusions
This paper proposes cycle consistency-based learning for inverse tone mapping. With the proposed cyclic learning, the EV up/down networks can be learned simultaneously without wasting memory in training. In addition, we present a lightweight network requiring only 1/100 of parameters of the existing SOTA network. Experimental results

Conclusions
This paper proposes cycle consistency-based learning for inverse tone mapping. With the proposed cyclic learning, the EV up/down networks can be learned simultaneously without wasting memory in training. In addition, we present a lightweight network requiring only 1/100 of parameters of the existing SOTA network. Experimental results show that the proposed method provides quantitatively SOTA performance in terms of HDR reconstruction and is also competitive in terms of subjective visual quality.

Future Work
This paper proposes a method to generate a stable multi-EV stack with a small network size for weight purposes. It is very important to expand the information about the limited LDR input in the inverse tone-mapping process. Therefore, as a tool to generate a virtual LDR image, the one presented in Yang et al. [28] can be employed. Meanwhile, the merging process still has a limitation in that it uses a rule-based method in the same way as the existing methods. If even the merging module can be replaced with a deep learning-based network, such as in [29], further technological advancements can be achieved.
Author Contributions: The work described in this article is the collaborative effort of all authors. All authors contributed to data processing and designed the algorithm. All authors made contributions to data measurement and analysis. All authors participated in the writing of the paper. All authors have read and agreed to the published version of the manuscript. In Figures A1 and A2, tone mapped HDR images differ slightly according to TMO. Note that even though the tone mapped image is similar to the tone mapped GT, there may be many blue regions in the P map. From these three examples, we can find that the proposed method provides comparable performance from the perspective of tone mapped HDR.

Appendix B
This section evaluates multi-EV stack generation performance in terms of the perception index [31] (PI), which is a subjective quality evaluation index. PI is a no-reference metric defined for subjective quality evaluation. In Table A1, the tendency of the proposed method is different from that of deep recursive HDRI [11] for bright and dark images. When evaluating the multi-EV stack generation performance, no full-reference metric can be used because GTs of HDR-eye multi-EV stacks are not available. Note that a no-reference metric, such as PI, can provide additional information. Figure A2. HDR reconstruction performance comparison (a-g). Here, all images are tone mapped by three TMOs and gamma corrected for demonstration.
In Figures A1 and A2, tone mapped HDR images differ slightly according to TMO. Note that even though the tone mapped image is similar to the tone mapped GT, there may be many blue regions in the P map. From these three examples, we can find that the proposed method provides comparable performance from the perspective of tone mapped HDR.

Appendix B
This section evaluates multi-EV stack generation performance in terms of the perception index [31] (PI), which is a subjective quality evaluation index. PI is a no-reference metric defined for subjective quality evaluation. In Table A1, the tendency of the proposed method is different from that of deep recursive HDRI [11] for bright and dark images. When evaluating the multi-EV stack generation performance, no full-reference metric can be used because GTs of HDR-eye multi-EV stacks are not available. Note that a no-reference metric, such as PI, can provide additional information. Also, we present more comparison results for multi-EV stack generation. As in the paper, Figures A3 and A4 show EV−3, −2, −1, +1, +2, +3 images generated by each method, and provide magnified ROIs. In all three examples, the proposed method generates a multi-EV stack more stably than deep recursive HDRI [11]. Here, stability has different meanings in EV up/down networks. In the EV−3, −2, −1 images that are generated by the EV down network, stability means that no artifacts are generated in the saturated regions. On the other hand, in EV+1, +2, +3 images generated by the EV up network, stability means that colors never change or stains are not generated. From this point of view, we can find that the proposed method outperforms deep recursive HDRI in terms of multi-EV stack generation performance.  Also, we present more comparison results for multi-EV stack generation. As in the paper, Figures A3 and A4 show EV−3, −2, −1, +1, +2, +3 images generated by each method, and provide magnified ROIs. In all three examples, the proposed method generates a multi-EV stack more stably than deep recursive HDRI [11]. Here, stability has different meanings in EV up/down networks. In the EV−3, −2, −1 images that are generated by the EV down network, stability means that no artifacts are generated in the saturated regions. On the other hand, in EV+1, +2, +3 images generated by the EV up network, stability means that colors never change or stains are not generated. From this point of view, we can find that the proposed method outperforms deep recursive HDRI in terms of multi-EV stack generation performance. Figure A3. Multi-EV stack generation performance comparison. The center image is an input, the left three images are the results from the EV down network, and the right three images are the results from the EV up network. EV increases from the far left (EV−3) to the right. Please refer to the magnified regions below each example for detail. Figure A3. Multi-EV stack generation performance comparison. The center image is an input, the left three images are the results from the EV down network, and the right three images are the results from the EV up network. EV increases from the far left (EV−3) to the right. Please refer to the magnified regions below each example for detail. Figure A4. Multi-EV stack generation performance comparison. The center image is an input, the left three images are the results from the EV down network, and the right three images are the results from the EV up network. EV increases from the far left (EV−3) to the right. Please refer to the magnified regions below each example for detail.