Multi-Exposure Image Fusion Techniques: A Comprehensive Review

: Multi-exposure image fusion (MEF) is emerging as a research hotspot in the ﬁelds of image processing and computer vision, which can integrate images with multiple exposure levels into a full exposure image of high quality. It is an economical and effective way to improve the dynamic range of the imaging system and has broad application prospects. In recent years, with the further development of image representation theories such as multi-scale analysis and deep learning, signiﬁcant progress has been achieved in this ﬁeld. This paper comprehensively investigates the current research status of MEF methods. The relevant theories and key technologies for constructing MEF models are analyzed and categorized. The representative MEF methods in each category are introduced and summarized. Then, based on the multi-exposure image sequences in static and dynamic scenes, we present a comparative study for 18 representative MEF approaches using nine commonly used objective fusion metrics. Finally, the key issues of current MEF research are discussed, and a development trend for future research is put forward.


Introduction
Brightness in a natural scene usually varies greatly. For example, sunlight is about 10 5 cd/m 2 , room light is about 10 2 cd/m 2 , and starlight is about 10 −3 cd/m 2 [1]. Owing to the limitations of imaging devices, the dynamic range of a single image is much lower than that of a natural scene [2]. The shooting scene may be affected by light, weather, solar altitude, and other factors. Overexposure and underexposure often occur. A single image cannot fully reflect the light and dark levels of the scene, and some information may be lost, resulting in unsatisfactory imaging. Solving the problem of incomplete dynamic range matching in existing imaging equipment, display monitors, and the human eye's dynamic response to real natural scenes is still challenging.
There are generally two ways to broaden the dynamic range of imaging detectors: hardware design and software technology. For the former, the CCD or CMOS detector needs to be redesigned, and a new optical modulation device may need to be introduced. Aggarwal [3] realized a camera design by dividing the aperture into multiple parts and using a set of mirrors to direct the light emitted by each piece to different directions. Tumblin [4] described a camera to measure the static gradient rather than static intensity and appropriately quantify the difference to capture HDR images. This kind of method can directly improve the efficiency of exposure quantity and imaging quality, but they are expensive and their practicability is limited. Through software technology, some researchers reconstruct the high dynamic range (HDR) image using the camera response function (CRF). Then, the HDR image can be displayed on the ordinary display device through tone mapping (TM). Others adopt MEF technology directly, fusing the input images with different exposure levels into an image with rich information and vivid colors, which do not need to consider camera curve calibration, HDR reconstruction, and tone mapping, as shown in Figure 1. Compared with the first way, MEF technology provides a simple, economical, and efficient manner to overcome the contradiction between HDR imaging and a low dynamic range (LDR) display. It avoids the complexity of imaging hardware circuit design and reduces the weight and power consumption of the whole device. It improves image quality and has essential application significance. MEF is a branch of image fusion, similar to other image fusion tasks [5]; for example, multi-focus image fusion, visible and infrared image fusion, PET and MRI medical image fusion, multispectral and panchromatic remote sensing image fusion, hyperspectral and multispectral remote sensing image fusion, and optical and SAR remote sensing image fusion. They combine multidimensional content from multiple-source images to generate high-quality images containing more important information. The main difference between these image fusion tasks is that the source images are different, and the source images of MEF are a series of images with different exposure levels. In addition, it can also be used for image enhancement under low illumination [6,7], defogging [8], and saliency detection [9] by fusing or generating pseudo exposure sequences.
Research on MEF has been ongoing for more than 30 years, during which hundreds of relevant scientific articles have been published. In particular, with the continuous increase in the number and quality of newly proposed methods in recent years, significant progress has been achieved in this field [10]. A suitable MEF method should work stably in both static and dynamic scenes, appropriate exposure, good visual quality, and low consumption of computing cost, especially when processing high-resolution images. Therefore, the design of the MEF algorithm is a very challenging research task. This paper analyzes and discusses the research status and development trends of MEF technology. The main contributions of this review are summarized as follows.
1. The existing MEF methods are comprehensively reviewed. Following the latest developments in this field, the current MEF methods are divided into three categories: spatial domain methods, transform domain methods, and deep learning methods. The deghosting MEF methods in a dynamic scene are also discussed as a supplement and further analyzed.
2. A detailed performance evaluation is conducted. We compare the 18 representative MEF methods on multiple groups of typical source images using nine commonly used objective fusion metrics. The performance of MEF methods in static and dynamic scenes is analyzed. Relevant resources, including source images, fusion results, and related curves, have provided corresponding download links at "https://github.com/xfupup/MEF_data" (accessed on 9 January 2022). This is convenient for the comparison and analysis of the current MEF algorithms.
3. The challenges in the current study of MEF are discussed, and future research prospects are put forward.
The remainder of this paper is organized as follows. Section 2 provides a comprehensive review of existing MEF methods. Section 3 presents a comparative study of MEF. In Section 4, some prospects and potential research directions are put forward. Section 5 concludes the paper.

A Review on MEF
MEF has attracted extensive attention because it can effectively generate high-quality images with a high dynamic range by combining different information from the image sequence with different exposure levels. In the past 30 years, many scholars have proposed a variety of MEF algorithms. According to the existing research data, Burt et al. [11] were one of the earliest research teams to study MEF. In their work, a pyramid-based method was proposed to perform multiple image fusion tasks, including visible and infrared image fusion, multi-focus image fusion, and multi-exposure image fusion. After that, a large number of traditional MEF algorithms were proposed. In recent years, research based on deep learning has become a very active direction in MEF. The MEF algorithms can be classified in different ways. Zhang [10] divided MEF into three categories based on the number of input source images, whether the imaging scene is static or dynamic, and whether deep learning is used. Our research found that some methods can only process two images [10,[12][13][14]. In [10], they compared the fusion results from the two images and gave a benchmark. However, most of the current MEF methods support multiple input images. Some MEF methods that can deal with the images in a static scene may also perform in a dynamic scene. Therefore, this paper presents a taxonomy of MEF methods that proposes to divide the existing MEF approaches into three categories: spatial domain methods, transform domain methods, and deep learning methods. In addition, MEF from a dynamic scene when camera jitters or moving objects are present has always been a challenge in this field. Ghost detection and elimination technology in a dynamic scene has already attracted much interest. This paper also further studies the MEF in a dynamic scene. The taxonomy is shown in Figure 2. It should be noted that although the presented taxonomy is valid in most cases, some hybrid algorithms are not easy to be classified into a single class. Such methods are classified according to their most dominant ideas. The MEF methods based on the spatial domain use certain spatial features to fuse the input source images directly in the spatial domain according to specific rules. The general processing flow of this class of method is to generate a weight-mapping map for each input image and calculate the fused image as a weighted average of all input images. According to the level of information extraction, the MEF methods based on the spatial domain can be roughly divided into three types: pixel-based methods, patch-based methods, and optimization-based methods.
The MEF methods based on the transform domain generally consist of three stages: image transformation, coefficient fusion, and inverse transformation [15], as shown in Figure 3. First, the input images are transformed into another domain by applying image decomposition or image representation. Then, the transformed coefficients are fused through some pre-designed fusion rules. Finally, the fused image is reconstructed by the corresponding inverse transformation on the fused coefficients. Compared with the MEF methods based on spatial domain, the most prominent feature includes the inverse transformation stage of reconstructing the fused image. According to the transformation used, the MEF methods based on the transformation domain can be further divided into the multi-scale decomposition-based approaches, gradient-domain-based methods, sparse representation-based methods, and other transform-based methods. In recent years, deep learning has become a very active direction in the field of MEF. Neural networks with a deep structure have been widely proved to have strong feature representation ability and are very useful for various image and vision tasks, including image fusion. Currently, deep learning models, including convolutional neural networks (CNNs) [16] and generative adversarial networks (GANs) [17], have been successfully applied to MEF. Depending on the model employed, deep learning-based methods can be further classified into supervised-based and unsupervised-based methods.
Due to the time difference in the image acquisition, camera jitter, and inconsistent object motion, it is challenging to avoid ghosts in fusion results [18,19]. The introduction of movement correction and relevant measures in the dynamic scene can effectively eliminate ghosts and improve the visual quality of the fused image. The deghosting algorithms in MEF can be broadly classified into three categories: global exposure registration, moving object removal, and moving object selection or registration.
Each category of the MEF methods is reviewed in detail as follows.

Pixel-Based Methods
This kind of method directly fuses the pixel-based features of the source images according to certain fusion rules. Due to its advantages in obtaining accurate pixel weight maps for fusing, it has become a popular direction for MEF. These methods directly act on pixels. Most pixel-based methods are designed in the framework of a linear weighted sum, that is, the fused image is calculated as a weighted sum of all input images. The core problem is to obtain the weight map for each input image, and various pixel-based MEF methods have been proposed on different strategies to compute the weight maps. Bruce [20] normalized each pixel from the input image sequence and converted it into a logarithmic domain. Taking each pixel as the center and R as the radius, they calculated the entropy in the circle and assigned a weight to each pixel in line with the information entropy. Finally, the input images were merged based on the weight after exiting the logarithmic domain. Although the information entropy of the fused image is high, the color is unnatural in some cases. Lee [21] proposed an MEF method based on adaptive weight. Specifically, they defined two weight functions that reflected the pixel quality related to the overall brightness and global gradient. The final weight was the combination of these two weights. To adjust the brightness of the input images, Kinoshita [22] presented a scene segmentation method based on brightness distribution and tried to obtain the appropriate exposure values to decrease the saturation area of the fused image. Xu [23] designed a multi-scale MEF method based on physical features. In their work, a new Retinex model was used to obtain the illumination maps of the original input images, and weight maps were built, combined with the extracted features. Ulucan [24] introduced an MEF method based on linear embedding and watershed masking using a static scene. Linear embedding weights were extracted from differently exposed images. The corresponding watershed templates were used to adjust these mappings according to the information of the input images for the final fusion. However, the visual quality and statistical score will be reduced when the input image sequence contains extremely overexposed or underexposed images. There are many other pixel-level image fusion algorithms that use filtering methods to process the weighted maps. Raman and Chaudhuri [25] designed a bilateral filter-based MEF method that preserved the texture details of the input images at different exposure levels. Later, Li [26] used a median filter and recursive filter to reduce the noise of weight maps. However, only the gradient information of individual pixels was considered, regardless of the local regions.
The main drawback of pixel-based MEF methods is that they are sensitive to noise, ignore neighborhood information, and are prone to various artifacts in the final fused image. Therefore, most methods require some pre-processing, such as histogram equalization, or post-processing of the weight map, such as edge-preserving filtering, to produce a higherquality fusion result. Even though the boundary filtering algorithms were added in some methods [25][26][27] and the halo artifacts can be reduced to some extent, the problem has not been solved at the root. Meanwhile, improvement strategies may bring new issues, such as breaking illumination relationships, over-relying on the bootstrap image, or significantly increasing computational complexity.

Patch-Based Methods
Unlike the pixel-based MEF method, the patch-based method divides the source images into multiple patch regions at a certain step size. Then, the patches at the same position corresponding to each image in the sequence are compared, and the patch containing the significant information is selected to form the final fused image. In [28], a patch-based method was first introduced to solve the MEF problem in the static scenes. The image was divided into uniform patches and the information entropy of each patch was used to measure the richness of the patch. They selected the most information patches and integrated them together using a patch-centered monotonically decreasing blending function to obtain the fused image. The disadvantage of this method is that it is easy to cause a halo at the boundary of different objects within the fused image. After that, many patch-based MEF methods were presented [29]. Ma [30] conducted a commonly used MEF method, which first extracted image patches from the input images, and decomposed them into three conceptually independent components: signal strength, signal structure, and average strength. These components were processed according to the patch intensity and exposure measurement to generate color image patches. These image patches were then put back into the fused image. Following this work, Ma [31] proposed a structural patch decomposition MEF approach (SPD-MEF). Compared with [30], the main improvement is that SPD-MEF can use the orientation of the signal structure components in the image patch to guide the verification of structural consistency for generating vivid images and overcoming ghost effects. This method does not need subsequent processing to improve visual quality or reduce spatial artifacts. However, the image patch size is fixed and has poor adaptability to the scene. The smaller size will cause the fused image to have serious spatial inconsistency, and the larger size will lead to the loss of detail in the fused image. On this basis, Li [32] proposed an improved multi-scale fast SPD-MEF method, which effectively reduced halo artifacts by recursively downsampling the patch size. In addition, the implicit implementation of structural patch decomposition also greatly improved the calculation efficiency. Later, Li [33] continued to add an edge detail retention factor and further designed a flexible bell curve for accurately estimating the weight function of the average intensity component. This function can retain the details in bright and dark regions and improve the fusion quality while maintaining a high computational speed. Wang [34] proposed an adaptive image patch segmentation method that used superpixel segmentation to divide the input images into non-overlapping patches composed of pixels with similar visual properties. Compared with the existing MEF methods that used fixed-size image patches, it avoided the patch effect and preserved the color properties of the source images.
In contrast to the pixel-based MEF methods, the main advantage of the approach based on patch is that the weight map has less noise because it combines the neighborhood information of the pixels and is robust to noise. However, since the patch in the image may span different objects, there are problems in edge detail retention, leading to edge blurring and halo, especially in edges with sharp changes in brightness.

Optimization-Based Methods
Several other MEF approaches are integrated into an optimization framework, and the weight maps are estimated by calculating the energy function. Shen [35] proposed a general random walk framework considering neighborhood information from the probability model and global optimization. The fusion was converted into a probabilistic estimation of the global optimal solution, and the computational complexity was reduced. However, since the method ultimately used a weighted average to fuse the pixels, it may degrade image details. In [2], by estimating the maximum a posteriori probability in the hierarchical multivariate Gaussian conditional random field, the optimal fusion weights can be obtained based on color saturation and local contrast. Li [36] performed MEF using fine detail enhancement for extracting the details from the input images based on quadratic optimization to improve the overall quality of the fused image. Song [37] approximated the ideal luminance image to a maximum contrast image using gradient constraints under the framework of maximum a posteriori probability. This fusion scheme integrated the gradient information of the input images and increased the fused image's detail information. In [38], an underexposed image enhancement method was proposed, where the optimal weights were obtained by the energy function to retain the details and boost edges. Ma [39] obtained a fusion result by globally optimizing the structural similarity index that directly operated in all input images. They used the gradient rise optimization method to search the image to be optimized, which was the color MEF structural similarity (MEF-SSIMc) index, iteratively moving toward improving the MEF-SSIMc until convergence. The proposed optimization framework was easily extended when MEF models with better objective quality were available. In fused images, using only global optimization may lead to local overexposure or underexposure. Similarly, using only local optimization may degrade the overall performance of the fusion result. Therefore, Qi [40] combined a priori exposure quality and a structural consistency test to improve the robustness of MEF. At the same time, through the evaluation of exposure quality and the decomposition of image patch structure, the global and local quality of the fused image were optimized.
The main advantage of the optimization-based methods is that they are general. Specifically, they can flexibly change optimization indicators if a better one is available. However, this is also the major disadvantage of these methods, since a single indicator may not be sufficient to obtain a high-quality fused image. Therefore, the performance of these methods is highly indicator dependent. Unfortunately, there is no indicator that can completely express the fused image quality. All these methods suffer from severe artifacts such as ringing effects, loss of detail, and color distortion, which lead to poor fusion results. In addition, these methods are computationally intensive and cannot meet real-time requirements.

Transform Domain Methods
Transform domain-based MEF methods can be mainly classified into multi-scale decomposition-based methods, gradient domain-based methods, sparse representation based-methods, and other transform-based methods.

Multi-Scale Decomposition-Based Methods
Burt [11] was one of the first to research the MEF algorithm, and proposed a gradient pyramid model based on directional filtering. Mertens [41] proposed a multi-scale fusion framework, as shown in Figure 4, which decomposed all the input images using the Laplace pyramid. The framework took the contrast, color saturation, and exposure to calculate and normalize the weight maps, which were smoothed by the Gaussian pyramid. Then, the Gaussian pyramid of the weight maps was multiplied by the Laplace pyramid of the multi exposure images to obtain the fusion result, which can better recover the image's brightness, but cannot restore the details of the severely overexposed region. Based on this framework, many studies have been proposed to further improve fusion performance. Li [42] presented a two-scale MEF method, which first decomposed the input images into base and detail layers and then calculated the weight maps by utilizing the significance measure. They fined weight maps using the guidance filter. The texture information of the input images could be retained, but halo artifacts still existed. An MEF method based on mixed weight and an improved Laplace pyramid was introduced to enhance the detail and color of the fused image in [43]. Based on the multi-scale guided filter, Singh [44] proposed an image fusion method to obtain detail-enhanced fusion images. This method had the advantages of both the multi-scale and guided filter methods, which can also be expanded in multi-focus image fusion task. Nejati [45] designed a fast MEF approach in which a guided filter was applied to decompose the input images for obtaining base and detail layers. To obtain the fused image, the brightness components of the input images were used to combine the base layers and the detail layers based on the blending weights of the exposure function. LZG [46] merged LDR images with different exposure levels by using the weighted guide filter to smooth the weight maps' Gaussian pyramid. They designed a detail-extraction component to manipulate the details in the fusion image according to the users' preference. Yan [47] proposed a simulated exposure model for white balance and image gradient processing. It integrated the input images under different exposure conditions into a fused image by using the linear fusion framework based on the Laplace pyramid. Wang [48] presented a multi-scale MEF algorithm based on YUV color space instead of RGB color space and designed a new weight smoothing pyramid used in YUV color space. A vector field construction algorithm was introduced to maintain the details of the brightest and the darkest areas in HDR scenes and avoid color distortion in the fused image. In some approaches, they often used edge-preserving smoothing technology to improve multi-scale MEF algorithms. Kou [49] proposed a multi-scale MEF method that introduced an edge-preserving smoothing pyramid to smooth the weight map. Owing to the edge-preserving characteristics of the filter, the details of the brightest/darkest regions in the fused image were well kept. Following [49], Yang [50] introduced a multi-scale MEF algorithm that first generated a virtual image with medium exposure based on the input images. Then, the method presented in [49] was applied to fuse this virtual image and achieve the fused result. Qu [51] proposed an improved Laplace pyramid fusion framework to achieve a fused image with detail enhancement. In addition, it is not easy to determine the appropriate fusion weight. To overcome this difficulty, Lin [52] presented an adaptive search strategy from coarse to fine, which used fuzzy logic and a multivariable normal conditional random field to search the optimal weight for multi-scale fusion.

Gradient-Based Methods
This kind of method is inspired by the physiological characteristics of the human visual system and is very sensitive to illumination. These methods aim to obtain the gradient information of the source images and then compose the fusion image in the gradient domain. Gu [53] presented an MEF method in a gradient field based on Riemannian geometric measurement and the gradient value of each pixel, which was generated by maximizing the structure tensor. The final fused image was obtained by a Poisson solver. The average gradient of the fused image was high, and this method was suitable in the details. However, the color was rarely processed, such that the fused image was dark, and the color was unnatural. Zhang [54] proposed an MEF method to apply in static and dynamic scenes based on gradient information. Under the guidance of the gradient-based quality evaluation, it generated a tone map similar to a high dynamic range image through seamless synthesis. Similarly, research using the gradient-based method to maintain image saliency was presented in [55], where the significance gradient of each color channel was computed. Moreover, the acquisition of the gradient was also a critical issue for calculating the image contrast. In general, the corresponding eigenvector of the matrix decided the gradient amplitude of the fusion. Several improved approaches were appropriated to optimize the weighted sum of the gradient amplitude in [56]. In this method, they used a wavelet filter, decomposing the image luminance, to obtain the corresponding decomposition coefficients. Paul [57] designed an MEF approach based on the gradient domain, which first converted the input images into YCbCr color space and then performed the fusion of the Y channel in the gradient domain. At the same time, the chrominance channels (Cb and Cr) were fused by applying a weighted sum of the chrominance channels. Specifically, the gradient in each orientation was estimated based on the maximum amplitude gradient selection. Using the gradient, the luminance was reconstructed based on the Harr wavelet. In [58], according to local contrast, brightness, and spatial structure, the author first calculated three weights of the input images and combined them using the multi-scale Laplacian pyramid. The dense scale-invariant feature transformation was used to compute the local contrast around each pixel position and measure the weight maps. The luminance was calculated in the gradient domain to obtain more visual information.

Sparse Representation-Based Methods
The approaches based on sparse representation (SR) take the linear combination of elements in the over-complete dictionary to describe the input signal. The error between the reconstructed signal and the input signal is minimized with as few non-zero coefficients as possible, allowing for a more concise representation of the signal and easier access to signal details [59]. In the past decade, SR-based MEF methods have rapidly become an essential branch in the area of image fusion. A dictionary obtained by K-SVD was used to represent the overlapping patches of the image brightness by the "sliding window" technique in [60]. The fusion image was reconstructed based on the sparse coefficients and the dictionary. Shao [61] proposed a local gradient sparse descriptor to generate the local details of the input image. It extracted the image features to remove the halo artifacts when the brightness of the source images changed sharply. Yang [62] designed a sparse exposure dictionary for exposure estimation based on sparse decomposition, which was used to construct the exposure estimation maps according to the atomic number of the image patches obtained by sparse decomposition.

Other Transform-Based Methods
In addition to the above methods, discrete cosine transform (DCT) and wavelet transform have also been successfully applied to MEF. Lee [63] proposed an HDR enhancement method based on DCT that fused two overexposure and underexposure images. This algorithm used the quantization process in JPEG coding as a metric for improving image quality so that the fusion process can be included in the DCT-based compression baseline.
They proposed a Gaussian error function based on camera characteristics to improve the global image brightness. Martorell [64] constructed an MEF method based on the sliding window DCT transform, which used YCbCr transform to calculate the luminance and chrominance components of the image, respectively. Specifically, this technique decomposed the input images into multiple patches and computed the DCT of these patches. The patch coefficients from the same position of the input images with different exposure levels were combined according to their sizes. The chromaticity values were fused separately as a weighted average at the pixel level. In [65], the input images were converted into YUV space, and the color difference components U and V were fused in line with the saturation weight. The luminance component Y was converted into the wavelet domain, and the corresponding approximate sub-band and detail sub-band were fused by the well-exposedness weight and adjustable contrast weight, respectively. The final fused result was obtained by transforming the fusion image into RGB space.

Deep Learning Methods
In recent years, significant success has been achieved based on deep learning in computer vision and image processing applications [66,67]. More and more MEF methods based on deep learning have been proposed to improve fusion performance [68][69][70][71]. To provide some useful references for researchers, the achievements in recent years based on deep learning are reviewed, including supervised and unsupervised MEF methods.

Supervised Methods
In supervised MEF algorithms, a large number of multi-exposure images with ground truth are required for training. However, this requirement is difficult to meet because there is generally no ground truth available in the MEF. Researchers have to find effective ways to create ground truth to develop this kind of method. CNN is known to be effective to learn local patterns and capture promising semantic information. Furthermore, it is also known to be efficient compared with other networks [72,73]. In 2017, Kalantari [74] first introduced a supervised CNN framework for MEF research. The ground truth image dataset was generated by combining three static images with different exposure levels in their work. The three images were converted into an approximate static scene by optical flow. Then, a convolutional neural network (CNN) was used to obtain fusion weights and fuse the aligned images. The contributions of this paper were: (1) presenting the first study on deep learning MEF; (2) the fusion effects of the three CNN architectures were discussed and compared; and (3) a dataset suitable for MEF was created. Since then, many MEF algorithms based on deep learning have been proposed. In 2018, Wang [75] proposed a supervised CNN-based framework for MEF. The main innovation of the approach was that it used the CNN model to gain multiple sub-images of the input images to use more neighborhood information for convolution operation. This work changed the pixel intensity of the ILSVRC 2012 verification dataset [76] to generate the ground truth images. However, it may not be real for the ground truth images generated in this way.
The second way to solve the lack of the ground truth is to use the pre-trained model generated by different methods. Li [77] extracted the features of the input images by utilizing a pre-trained model in other networks and calculated the local consistency using these features to determine the weights. In addition, due to motion detection implementation, this method can be used in both static and dynamic scenes. Similar work was also presented in [78].
The third way to solve this problem is to select fusion results from some methods as the ground truth. Cai [79] used 13 representative MEF techniques to generate 13 corresponding fused images from each sequence and then selected the image with the best visual quality as the ground truth by conducting subjective experiments. They provided a dataset containing 589 groups of multi-exposure image sequences, with 4413 images. The whole process required much manual intervention, so the number of image sequences trained was very limited, which may hinder the generalization ability of the fusion network. Liu [80] proposed a network for decolorization and MEF based on CNN. To obtain satisfactory qualitative and quantitative fusion results, the local gradient information from the input images with different exposure levels was calculated as the network's input. It worked on a source image sequence consisting of three exposure levels and each exposure level can be viewed as a signal channel. In [81], a dual-network cascade model was constructed consisting of an exposure prediction network and an exposure fusion network. The former was used to recover the lost details in underexposed or overexposed regions, and the latter could perform fusion enhancement. This cascade model used a three-stage training strategy to reduce the training complexity. However, the down-sampling operation in this model may cause checkerboard defects in the fused image, and the author alleviated this problem by applying a loss function constructed with the structural anisotropy index.
The above supervised methods are explicitly designed for the MEF issue. There are also several methods based on supervised deep learning that are constructed for some image fusion tasks, including MEF. Zhang [82] proposed an end-to-end fully convolutional approach (IFCNN) that used Siam architecture, as shown in Figure 5. Two branches extracted the convolutional features from the input images and fused them using element average fusion rules (note that different fusion tasks used different rules). In IFCNN, the model was optimized utilizing perceptual loss, plus a fundamental loss that calculated the intensity difference between the input images and the ground truth. IFCNN can be suitable for fusing images at arbitrary resolution. However, its performance in the MEF task may be limited because it was trained only with a multi-focus image dataset. In [83], a general cross-modal image fusion network was presented, exploring the commonalities and characteristics of different fusion tasks. Different network structures were analyzed in terms of their impact on the quality and efficiency of image fusion. The dataset constructed by Cai [79] was used for the MEF task. However, these models were not explicitly designed for MEF issues and were not fine-tuned on multi-exposure images, so their performance may not be satisfactory in some cases. The methods above either create the ground truth images by adjusting the brightness value of the normal images, use other pre-trained models in other works to obtain the ground truth images, or use the images with subjective effects in the existing fusion results as the ground truth images. However, these methods may not deal well with the lack of real ground truth images. In particular, the ground truth images in some MEF algorithms are selected from the fusion results of other methods and not taken by optical cameras. They may not be accurate or appropriate. In order to solve these problems, some studies try to construct unsupervised MEF architectures.

Unsupervised Methods
Since there are generally no real ground truth images available, some studies have turned to developing the MEF methods based on unsupervised deep learning to avoid the need for ground truth in training. This section describes the relevant unsupervised MEF methods.
In 2017, Prabhakar [84] built the first unsupervised MEF architecture for fusing image pairs, named DeepFuse, as shown in Figure 6. This method first converted the input images into YCbCr color space. Then, a CNN composed of the feature layers, a fusion layer, and the reconstruction layers was used for feature extraction of the Y channel, while the fusion of Chrominance channels (Cb and Cr) was still executed manually. Thirdly, the image data in YCbCr space were converted back into RGB space to obtain the final fusion image. This unsupervised method used a fusion quality metric MEF-SSIM [85] as the loss function to realize unsupervised learning. DeepFuse can extract effective features and be more robust to different inputs because it uses CNN to fuse the brightness, which is its main advantage. Furthermore, as an unsupervised method, it does not need ground truth to train. However, a different color space conversion is required, which is not easy compared with fusing RGB images directly. In addition, simply using MEF-SSIM as the loss function is not enough to learn other critical information not covered by MEF-SSIM. Ma [86] presented a flexible and fast MEFNet for the MEF task, and it also worked in the YCbCr color space. First, the input images were downsampled and sent to a context aggregation network for generating the learned weight maps, which were jointly upsampled to high resolution using a guided filter. Then, the upsampling weight maps were used for the weighted summation of the input images. Specifically, the context aggregation network was trained for fusing the Y channel, while the fusion of Cb and Cr was executed with a simple weighted summation. The final fused image in RGB space was obtained by converting color space. The flexibility and speed were the significant advantages of MEFNet, i.e., the input images with arbitrary spatial resolution can be fused using this fully convolutional network, and the fusion process was efficient since the main calculation was carried out with a fixed low resolution. However, because only MEF-SSIM was used as the loss function, there was the same problem in MEFNet as in DeepFuse.
Qi [87] presented the UMEF network for MEF in static scenes. They used CNN to extract features and fused them to create the final fusion image. Compared with DeepFuse, there were three main differences between them, as follows. First, UMEF can fuse multiple input images. By contrast, DeepFuse was designed to fuse two input images. Second, the loss function was made up of two parts: MEF-SSIMc and an unreferenced gradient loss, while the loss function of DeepFuse was only MEF-SSIM. As a result, more details of the fused images were reserved in UMEF. Third, the color images can be directly fused with MEF-SSIMc in UMEF, and the color space conversion is avoided. In [88], an end-to-end unsupervised fusion network was designed to generate a fusion image, named U2Fusion. It was applied to solve different fusion tasks, such as multi-modal, multi-exposure, and multi-focus issues. U2Fusion extracted the features with pre-trained VGGNet-16 and fused the input images with the DenseNet network. The importance of the input images can be automatically estimated through feature extraction and information measurement, and an adaptive information preservation degree was put forward. However, this method was required for the quality of the input images. When acquiring the image, these problems will be amplified if there is noise or distortion. Gao [89] made some improvements based on U2Fusion and applied the MEF model to the transportation field. The quality of the fused images from the fusion model was improved using adaptive optimization.
Besides the unsupervised method based on CNN, some unsupervised MEF methods based on Generative Adversarial Networks (GAN) were also proposed. Chen [90] presented an MEF network and fused two input images. This network integrated homography estimation, attention mechanism, and adversarial learning, which were, respectively, applied to camera motion compensation, the correction of the remaining moving pixels, and artifact reduction. Xu [17] designed an end-to-end architecture for MEF based on GAN, named MEF-GAN, and used the dataset from Cai [79]. Following [17] and [90], a GAN-based MEF network, named GANFuse, was proposed in [91]. There were two main differences between GANFuse and the GAN-based MEF approaches above. First, as an unsupervised network, GANFuse used an unsupervised loss function, which was applied to measure the similarity between the fusion image and the input images, rather than the similarity with the ground truth. Second, GANFuse was composed of one generator and two discriminators. Each discriminator was used to distinguish the difference between the fusion image and the input images.
It should be noted that all the above unsupervised MEF networks, except UMEF, required color space conversion. The input images needed to be converted into YCbCr color space, and the Y channel was fused with the deep learning model, while the Cb and Cr channels were fused by weighted summation.

HDR Deghosting Methods
Most MEF approaches assume that the source images are perfectly aligned, which is usually violated in practice because of the time differences in image acquisition. Once there are moving objects in the scene, ghosts or blur artifacts often occur that degrade the quality of the fusion image [92][93][94], as shown in Figure 7. MEF in dynamic scenes has always been a challenge. To obtain HDR images without artifacts, a large number of deghosting MEF methods have been proposed from different angles, which are mainly in two aspects: how to detect the ghost area and how to eliminate ghosts. Based on the above parts, this part analyzes the MEF deghosting methods in-depth. The current MEF methods in dynamic scenes are investigated, classified, and compared. The MEF processing in a dynamic scene is divided into the following categories: global exposure registration, moving object removal, and moving object selection or registration.

Global Exposure Registration
The main aim of the methods in this class is to compensate for and eliminate the impacts of the camera motion based on parameter estimation of the transformations that are applied to each input image. These methods do not pay attention to the existence of moving objects.
Cerman [95] presented an MEF approach to register the source images and eliminate the camera motion in handheld image acquisition equipment. The correlation in the Fourier domain was used to evaluate the image offset from the translational camera motion in initial estimation. Both the translational and rotational movement of the subpixels between input images were locally optimized. They used the registration on continuous image pairs without selecting a reference image. Gevrekci [96] proposed a new contrast-invariant feature transform method. This method assumed that the Fourier components were in phase at the corner position, used a local contrast stretching step on each pixel of the source images, and applied the phase congruency for detecting the corners. Then, they registered the source images by matching features using RANSAC. Another approach using the phase congruency images was provided in [97], which used cross-correlation technology to register the phase congruency images in the frequency domain instead of using them to discriminate the key points in the spatial domain. Besides translation registration, rotation registration was also performed with log polar coordinates, where the rotation motion was represented by the translation transformation in the coordinates. Evolutionary programming was applied to detect subpixel shifts to search the optimal transformation values. In [98] used a target frame localization method to register the input images and compensated for the undesired camera motion in the registration process.
Furthermore, using a camera with a fixed position could reduce this problem. In addition to camera motion, the more challenging problem is that moving objects may appear as ghost artifacts in the fused image. Therefore, in recent years more research has focused on removing the ghosts of moving objects in fusion images.

Moving Object Removal
This kind of method removes all moving targets in the scene by static background estimation. Most of the image scenes are static in practical applications, and only a small part of the image contains moving objects. Without selecting a reference image, most of these algorithms perform a consistency check for each pixel of the input images. The moving object is modeled as the outlier and eliminated to obtain an HDR image without artifacts.
Khan [99] proposed an HDR deghosting approach for adjusting the weight by estimating the probability that each pixel belongs to the background iteratively. Pedone [100] designed a similar iterative process, which increased the chances of the pixels belonging to the static set through the energy minimization technology. The final probabilities were applied as the weights of MEF. Zhang [101] utilized the gradient direction consistency to determine whether there was a moving object in the input images. This method calculated the pixel weights using quality measures in the gradient domain rather than absolute pixel intensities. The weight of each image was computed as a product of consistency and visibility scores. If the pixel gradient direction was consistent with the collocated pixels from other input images, the pixel was assigned a larger weight by the consistency score. On the other hand, the pixel with a larger gradient was assigned a larger weight by the visibility score. However, this method may not be robust in frequently changing image scenes. Wang [102] introduced visual saliency to measure the difference between the input images. They applied bilateral motion detection to improve the accuracy of the marked moving area and avoid the artifacts in a fused image through fusion masks. The ghosts of moving objects and handheld camera motion can be removed. However, they need more than three input images for effective fusing. Li [103] applied a light intensity mapping function and bidirectional algorithm to correct non-conforming pixels without reference images. This method used two rounds of hybrid correction steps to remove ghosts in the fused image. In [51], the weight maps were calculated based on luminance and chromaticity information in the YIQ color space. For dynamic scenes, this method used image difference and superpixel segmentation to refine the weight maps, and the weights of moving objects were decreased to eliminate the undesirable artifacts. Finally, a fusion framework based on the improved Laplacian pyramid was proposed to fuse the input images and enhance the details. However, the algorithm was time-consuming and did not work well when the camera jittered.
These methods assume a main pattern in the input image sequence, referred to as the "majority hypothesis", which means that moving objects only occupy a small part of the image. A common problem in these methods is that the performance may not be satisfied when the image scene contains moving objects with large motion amplitude or when some parts of the images in the sequence change frequently.

Moving Object Selection or Registration
The main difference between the algorithms in this class and the moving object removal methods is that the fusion result of the former includes the moving objects appearing in the selected reference image. The moving object selection or registration methods focus on reconstructing the pixels affected by the movement through finding the local correspondence between the regions affected by object motion. Some methods selected one or more source images for each dynamic portion as guidance to eliminate the ghost. Jacobs [104] developed an object motion detector based on an entropy map, which did not need camera curve knowledge to detect the ghost in the region with low contrast. However, this method mostly failed when there was a large image region with moving objects or fewer texture features. Pece [105] introduced an MEF method to remove ghosting artifacts based on the bitmap motion detection technique. First, they extracted each input image's exposure, contrast, and saturation and then applied the median bitmap to detect the moving objects. However, neglecting image structure information may have some adverse effects on the fusion image. Silk [106] employed different methods for deghosting according to the type of movement. The method started with implementing the change detection and did not consider the object boundaries. Then, to refine the object boundaries, using simple linear iterative clustering (SLIC) super-pixels, the images were over-segmented. These super-pixels were divided into motion and nonmotion areas in line with the number of inconsistent pixels signed with the change detection above. In the fusion stage, the super-pixels with the movement were allocated smaller weights. Some super-pixels containing the moving object in each of the source images were allocated larger weights when there were moving objects. Based on previous work in [101], Zhang [107] improved upon and proposed a reference-guided deghosting method. They assumed that most pixels in the source images were static compared with the pixels in the motion regions. They introduced a consistency check for the pixels in the reference image to deal with frequently changing scenes. Granados [108] proposed a Markov random field for ghost-free fused images in a dynamic scene, and selected a reference image to reconstruct the dynamic content. Because the moving object was obtained from a single reference image, the dynamic range of the moving object cannot be recovered fully. In addition, object overlap or half-included objects may appear without any semantic constraint in the reconstruction. In [26], the local contrast and brightness of static images and the color dissimilarity weight of dynamic images were extracted using fusion in static and dynamic scenes. The weight maps were smoothed by the recursive filter. To overcome the ghosts, they applied a new histogram equalization method and median filter to detect the motion regions from the dynamic scenes. Lee [109] proposed a rank minimization method to detect the ghost region. The constraints on moving objects were incorporated into the framework, which consisted of sparsity, connectivity, and the a priori information from underexposed and overexposed areas. The study in [110] presented a ghost-free MEF method based on an improved difference approach. Before becoming ghost-free, each input image was normalized to the brightness consistent with the reference image's exposure level. When the pixel was underexposed or overexposed, a special operation was performed by matching other available exposures. Two reference images were selected in this method.
Some methods use optical flow estimation and the feature matching strategy to remove ghosts in dynamic scenes. Zimmer [111] performed an alignment step based on the optical flow method, which exploited the multiple exposures and created a super-resolution image.
This method can generate dense displacement fields with sub-pixel accuracy and solve the problems caused by moving objects and severe camera jitter. This method relied on the warping strategy, from coarse to fine, to deal with large-scale displacement. Because small objects may disappear on the rough horizontal plane, it was impossible to estimate the large-scale displacement of small objects. Jinno [112] designed a weighting function for fusing the input images, which assumed that the input images had been globally registered. The maximum a posteriori probability estimation was used to estimate the displacement, occlusion, and saturation regions simultaneously. Ferradans [113] proposed an MEF method in the dynamic scene based on gradient fusion, which first selected a reference image and then improved the details of its radiation map by increasing information interpolated from the input image sequence. Liu [114] introduced an approach based on dense scale-invariant feature transform (SIFT) for MEF in static and dynamic scenes. They first applied the dense SIFT descriptor as the activity level measure to extract the local details from the input images and then used the descriptor to eliminate the ghost artifacts in the dynamic scenes. Following this study, Hayat [115] proposed an MEF algorithm based on a dense SIFT descriptor and guided filter. There were two main differences compared with the method in [114]. First, they used the histogram equalization and median filter to compute the color dissimilarity feature instead of the spatial consistency module in [114]. Second, they used the guided filter to remove the noise and discontinuity from the initial weights. Zhang [116] introduced two types of consistency for matching the reference image with the input images before ghost detection, consisting of mutual consistency based on histogram matching and intra consistency based on super-pixel segmentation. This method assumed that the input image was aligned and performed the motion detection at a super-pixel level to maintain the weights of the outliers.
Other algorithms based on a patch-matching strategy reconstruct the moving object region by transferring the information of a subset of the source images. Sen [117] developed an image patch-matching approach for HDR deghosting based on energy minimization. This method can jointly solve the problems of image alignment and reconstruction. Hu [118] established a dense correspondence between the reference image and other images in the sequence. The information of the images in the sequence was modified to match the information of the reference image, and the wrong correspondences were corrected using local homography. In their later work [119], Hu et al. proposed a PatchMatch-based method for removing the ghosts in saturated regions of the source images. This method selected an image with good exposure as the reference image, and the latent image of the fused image was similar to the reference image. The PatchMatch algorithm was used to find the matching patch in other input images in the underexposed or overexposed regions. Compared with the method in [117], this method did not require the conditional random fields of the input images to be linear [120]. Ma [31] detected the structural consistency of the image patches and generated a pixel consistency mapping relationship to realize image registration in the dynamic scene and eliminate the ghosts in the fused image. This method introduced ρ k , representing the consistency between the input image and the reference image, as follows: where · indicates the l 2 norm of a vector. x k is a set of color image patches that are extracted at the same spatial location of the input image sequence containing K multiexposure images, and k lies in [1,K]. l k denotes the mean intensity component of x k . x r is the image patch at the corresponding location of the reference image and l r represents the mean intensity component of x r . s r is the reference signal structure, and s k is the signal structure of another exposure. ρ k lies in [−1,1]. The larger ρ k is, the higher the consistency between s k and s r . Since s k is obtained by mean removal and intensity normalization, it is robust to exposure and contrast variations. The introduction of the constant ε can ensure the robustness of the structural consistency to the sensor noise.
Some methods have been proposed based on this theory [32,33,40]. Such methods have good ghost removal performance. However, the computing cost is also high due to intensive search and repair operations. Li [32,33] improved upon this method and reduced the computational complexity.
In addition, some ghost removal methods are also applied in video [121,122]. Summarize the above HDR deghosting methods, some approaches remove moving objects and generate HDR images with only static areas. Other algorithms select the image with optimal exposure for restructuring dynamic regions or register moving objects from different input images to maximize the dynamic range. However, when eliminating the ghosts, these algorithms may also introduce different artifacts, such as noise, broken objects, dark regions, or partial residual ghosting, etc. Therefore, it is expected that more effective deghosting algorithms will be proposed to adapt to camera jitter or object motion.

Image Dataset
To verify the universality of the algorithm, it is necessary to test and analyze under a variety of representative scenes, including indoor, outdoor, different times, and different weather. Some research teams have disclosed their datasets, summarized in Table 1 as follows. "Dynamic" and "Static" indicate whether the dataset can be used for dynamic or static scenes.

Performance Comparison
Generally, there are two ways to evaluate the performance of MEF methods: subjective qualitative evaluation and objective quantitative comparison.

Subjective Qualitative Evaluation
The observer performs the quality evaluation of the fused image. High-quality fusion images should not only retain as much important information from the input images as possible, but also should be as naturalistic and comparable as the scene. Most of the current literature gives the subjective evaluation results of the algorithm and even gives the enlarged local map in the details. However, it is time consuming and laborious to observe each fused image in practical applications. Moreover, each observer has different standards when observing the fused image, from which it is easy to produce deviation estimations, so objective quantitative evaluation is necessary.

Objective Quantitative Comparison
Quantitatively evaluating the quality of the fusion image is a challenging task, because the ground truth image does not exist. Liu [125] divided 12 popular image fusion metrics into four categories: metrics based on information theory, metrics based on image feature, metrics based on image structure similarity, and metrics based on human perception. The corresponding Matlab code is available at "https://github.com/zhengliu6699/ imageFusionMetrics" (9 January 2022).
Since the performance of the MEF method may vary on the different metrics, it is necessary to use a set of objective metrics to evaluate the MEF algorithm simultaneously. Table 2 summarizes the objective metrics used for MEF methods in a static scene. To ensure an unbiased assessment of fusion performance, this paper applies eight commonly used metrics to evaluate the fused results of different MEF methods in the static scenes, as follows.
1. Structural similarity index measure (MEF-SSIM) [85]. This metric is based on the patch consistency measure and is widely used for MEF performance evaluation. 2. Q AB/F [126]. This is also a commonly used metric in the fused result evaluation. Its primary application is to analyze the edge information of the fused image. 3. Mutual information (MI) [127]. This reflects the amount of information in the fusion image obtained from the input image sequence. 4. Peak signal-to-noise ratio (PSNR) [128]. This is used to measure the ratio between effective information and noise in images and reflect whether the image is distorted. 5. Natural image quality evaluator (NIQE) [115]. This is based on perceived quality. 6. Standard deviation (SD) [89]. This reflects the dispersion of an image. 7. Entropy (EN) [129]. This indicates the richness of information contained in an image. 8. Average gradient (AG) [130]. This expresses the ability of an image to retain small details. MEF-SSIM, Q AB/F , MI, and PSNR are objective metrics with the reference images. NIQE, SD, EN, and AG are metrics without reference images. For the above metrics, except for NIQE, a larger value indicates a better quality of the fusion image.
Since the processing methods in the dynamic scene are different from those in the static scene, the evaluation metrics used are also different. There are few quantitative metrics in the dynamic scene, and most algorithms are only subjectively eval uated [51,103,104,107,109,116,131]. Some other fusion studies only quantify the fusion results in a static scene, while the results in the dynamic scene are not quantified [64,115]. This paper produces some statistics on the quantitative metrics from the literature in the dynamic scene. In a previous work [132], the HDR-VDP-2 metric was used to intuitively judge the distortion between the fused image and the reference image; the larger the value, the higher the quality of the fused image. Fang [133] developed an MEF perception evaluation metric MEF-SSIMd for a dynamic scene. This paper investigates the evaluation metrics used in MEF research, listed in Tables 2 and 3.

Comparisons of Different MEF Methods
In our comparative study, the MEF methods are selected according to the following three principles: the methods have been proposed in recent years, or they have great influence in this field (according to the number of paper indexes); the source codes of these methods are publicly available online; and the selected methods should cover the sub-categories mentioned in Section 2 as much as possible. We selected 18 representative MEF methods, including seven spatial domain methods, nine transform domain methods, and two deep learning methods (one supervised method and one unsupervised method). Table 4 lists the details of the selected MEF methods, including the category, the sources of the methods, the download link of the code, and whether it can be applied in a dynamic scene or not. "Dynamic" indicates that the algorithm can be used not only in a static scene, but also in a dynamic scene. "Static" indicates that the algorithm can only be used in a static scene. "Supervised" and "Unsupervised" mean supervised and unsupervised methods, respectively. For all selected methods, the default parameter settings are the same as those in the original literature.
As can be seen in Figure 11, the fusion results will change when different reference images are selected, and all of the methods can remove ghosts in the "Arch" image sequence. Liu [114], Ma [31], Qi [40], and LH21 [33] retain more details in overexposed and underexposed regions, such as sky and ground shadows, while the fused results of Hayat [115] and LH20 [32] are relatively poor. On the roof of buildings, Hayat [115] and LH21 [33] have a good performance in brightness recovery and detail retention. The brightness values of the fusion results from the other four algorithms are low. Overall, the fusion performance of LH21 [33] is the best in vision impression. Figure 12 illustrates the fused results on the "Wroclav" image sequence. As can be seen, the fused images obtained by Ma [31], LH20 [32], and LH21 [33] have no ghost artifact. While Liu [114] and Hayat [115] eliminate ghosts, they also produce residual ghosts and broken objects. The ghosts are not entirely removed in Qi [40] and the artifacts still exist. Although Ma [31] and LH20 [32] remove the ghosts, the brightness is relatively dark in the underexposed region. In addition, the brightness of the ground in LH20 [32] is still high, and the color of the sky in Ma [31] is distorted. The fusion performance of LH21 [33] is also the best in this image sequence. The details of the regions that are too bright and too dark in the source images are well preserved in LH21 [33], for example, the hair of the man sitting on the bench. Table 6 lists the HDR-VDP-2 scores of the different deghosting methods. Black bold font represents the highest score in this image sequence. The underlined font indicates that the subjective evaluation of the fusion result is relatively good, but the HDR-VDP-2 score is excessively low. The underlined bold font means that the subjective assessment is rather bad, but the HDR-VDP-2 score is excessively high. It was found that the selection of reference images has an impact on the score of HDR-VDP-2. When selecting an input image including a moving object consistent with that in the fusion image as the reference image, or selecting an image corresponding to the brightness of the fusion image but inconsistent with the moving object as the reference image, the HDR-VDP-2 scores are quite different. Although Ma [31] scores the highest in multiple image sequences, the details of the underexposed regions of these image sequences were not recovered well. To sum up, the performance of the fusion results from LH21 [33] is best in both subjective evaluation and objective quantitative scores.

Computational Efficiency
This part gives the calculation time of the MEF methods involved in comparison, except the two deep learning-based MEF approaches (these approaches are performed in a parallel calculating way through GPU acceleration and the computational efficiency is very high). All the other 16 methods are implemented in MATLAB 2018a with a 2.50 GHz CPU (Intel(R) Core(TM) i5-7200U) and 16 GB RAM. When fusing two color images with 1000 × 664 pixels in a static scene, Table 7 lists the average running time of these methods.  [115] 3.8754 Paul [57] 1.8119 Wang [48] 1.5095 Qi [40] 6.9607 Nejati [45] 0.3878 LH20 [32] 1.2923 LZG [46] 8.2330

Future Prospects
Although remarkable progress has been made in MEF, there are some issues for future work. This section gives a detailed discussion on development trends based on the review of the existing methods.
(1) Deep learning-based MEF Although the performance of multi-exposure fusion based on deep learning has been greatly improved, there are still many aspects worthy of further research. i) Establishing a large-scale multi-exposure image dataset is crucial for supervised MEF methods. Some expert photographers may be hired to capture "ground truth", but it is not an easy task due to the general camera's limited capture range. In addition, a method similar to Cai [79] may also be adopted. However, the representative methods should be selected from the latest state-of-the-art algorithms. In addition, data augmentation techniques will provide a way to generate a large amount of data without high cost and time requirements [89]. ii) Constructing a practical loss function consisting of several types of metrics and associated with the specific fusion task. Subsequent research should pay more attention to the characteristics of the fusion task itself rather than blindly increasing the scale of the neural networks. The existing methods based on deep learning rarely consider the correlation between MEF and subsequent tasks when constructing the loss function, which often makes the fusion results subjective. Future research may introduce the accuracy of following tasks to guide the fusion process. iii) At present, most of the existing MEF methods based on deep learning are only suitable for static scenes. In addition, there are also few multi-exposure images in dynamic scenes, so it is necessary to capture and collect multi-exposure images in dynamic scenes.
(2) MEF in dynamic scene Most off-the-shelf MEF methods focus on solving the fusion of the images with different exposure levels in a static scene. However, the source images are often misaligned due to camera jitter and moving objects in the application. The fusion images will suffer from ghosts, image blurring, halo artifacts, and other problems. Although some deghosting MEF methods have been proposed in recent years, they cannot solve these issues robustly. A way to address camera jitter is to use registration, but the preprocessing dependent on the registration algorithm may lead to some limitations, such as low efficiency and dependence on the registration accuracy. Therefore, it is necessary to develop a non-registered method to implicitly realize MEF. Regarding moving objects, selecting the appropriate reference image is also worthy of attention.
(3) The higher-quality MEF evaluation metrics From the discussions from Section 3, it can be found that there may be inconsistency between qualitative and quantitative evaluations of the fused results. Some methods that show good qualitative performance may not perform well in quantitative comparison and vice versa. For example, one of the most commonly used evaluation metrics, MEF-SSIM, still cannot precisely describe the fused images' subjective quality. Several methods also show different performances on different types of metrics. This brings difficulties to the comprehensive evaluation of the MEF methods. Therefore, it is necessary to explore more accurate evaluation metrics from the perspective of the human visual system in future research. In addition, there are few studies on developing appropriate evaluation metrics for MEF in dynamic scenes. These are issues that need attention in the future.
(4) Task-oriented MEF There are few MEF works developed for specific tasks in industry, remote sensing, and other fields. Most research aims at natural images to verify the effectiveness of the proposed methods, and no algorithms can be universal to all scenes. Therefore, it is important to develop and fine-tune suitable MEF algorithms for more specific tasks.

(5) Real-time MEF
From the perspective of application requirements, the MEF technique is the preprocessing of many visual tasks, such as video surveillance, target recognition, target tracking, and other applications. The performance of the MEF methods directly affects the accuracy of the whole research. Some applications have high requirements for the computational efficiency of the algorithm. However, the operating efficiency of the current MEF methods is low owing to the complex transformation, function decomposition, and iterative optimization, which limits the applications of the MEF algorithms in some real-time tasks. Therefore, the development of real-time MEF algorithms is essential, which will cause MEF to have more expensive application areas.
Therefore, based on the above review and prospects, we have not reached the upper limit of MEF, and it remains a long-term task to study higher-quality MEF algorithms and evaluation metrics.

Conclusions
MEF is an essential technique for integrating image information with different exposure levels, which can more comprehensively understand the scene. To follow the latest development in this field, this paper summarizes the existing MEF methods and presents a literature review. These MEF methods can generally be divided into spatial domain, transform domain, and deep learning. In addition, ghost removal MEF methods are also discussed as a supplement. According to the core idea, each class is investigated and further divided into several sub-categories. We analyze the representative methods in each sub-category and provide some systematic summaries. A detailed comparative study for the current MEF algorithms in static and dynamic scenes is carried out. The experiment performs subjective and objective evaluations of the 18 MEF approaches using nine commonly used objective fusion metrics. We have released the relevant resources online for comparison, including input image sequences, fusion results, and curves. There are still many challenges in MEF techniques and objective fusion performance evaluation due to the limitations of sensor noise, camera jitter, moving objects, and computational complexity. Finally, some prospects and potential research directions are proposed based on our observations. We expect that the organized overview of MEF methods can help relevant researchers to better understand the current development of MEF methods.
Author Contributions: F.X. provided the idea. F.X. and J.L. designed the experiments. Y.S. and X.W. analyzed the experiments. F.X. and H.S. wrote the paper. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement:
The data presented in this study are available in the article.