A Survey of Multi-Focus Image Fusion Methods

: As an important branch in the ﬁeld of image fusion, the multi-focus image fusion technique can effectively solve the problem of optical lens depth of ﬁeld, making two or more partially focused images fuse into a fully focused image. In this paper, the methods based on boundary segmentation was put forward as a group of image fusion method. Thus, a novel classiﬁcation method of image fusion algorithms is proposed: transform domain methods, boundary segmentation methods, deep learning methods, and combination fusion methods. In addition, the subjective and objective evaluation standards are listed, and eight common objective evaluation indicators are described in detail. On the basis of lots of literature, this paper compares and summarizes various representative methods. At the end of this paper, some main limitations in current research are discussed, and the future development of multi-focus image fusion is prospected.


Introduction
Image fusion is a process of generating an image superior to the original image and using a special application based on the study of multiple image features in the same scene by using redundant and complementary information among image data [1]. Using specific algorithms to extract useful feature information from two or more images, image fusion technique can generate a new image with more comprehensive and accurate details. According to the types of input source images, the image fusion can be divided into remote sensing image fusion, medical image fusion, multi-focus image fusion, multi-exposure image fusion, infrared and visible image fusion, etc. [2]. Image fusion technology has been developed for more than 40 years, with more and more research methods and applications rising. Among them, multi-focus image fusion technology has a very broad application prospect in digital photography, computer vision, target tracking, and monitoring, microscopic imaging, and other fields [3,4].
The so-called multi-focus problem can be explained as follows: due to the limited focus range of visible-light imaging systems, it is difficult to clearly capture all objects in the same scene [3,5]. As shown in Figure 1, a point in the scene is projected onto a single point in the focal plane to form a focused image. However, if the sensor plane does not coincide with the focal plane, the image formed on the sensor plane will be a fuzzy disk with a diameter of 2R, which could be called a defocused image. According to the above principles, multi-focus images can be divided into two parts, namely the focusing region and the defocusing region. Objects are clearly sharp in the focusing area, while relatively blurred in the defocusing region. Figure 2 shows 14 groups of images, including both gray and color. Each group of images were captured for the same scene but at a different focusing position, forming a multi-focus result, as exhibited in Figure 2. Traditional classification methods include the spatial domain method and the transform domain method [6,7]. With the soaring new multi-focus image fusion methods, it is difficult for the existing classification methods to accurately position all image fusion algorithms. Therefore, the existing multi-focus fusion methods cannot reasonably clas-sify and summarize the pixel level fusion methods. For instance, the pixel level image fusion method can be simultaneously divided into spatial domain and transform domain according to the choice of the domain [8]. Therefore, this paper innovatively proposes the fusion method type based on boundary segmentation, and classifies the pixel level fusion methods into the fusion method of boundary segmentation. In this paper, the existing multi-focus image fusion methods are reviewed, sorted, and classified, and eight commonly used objective evaluation methods are summarized. Based on a large number of literature, the typical algorithm, fusion process, and key technologies of multi-focus image fusion are discussed, and the fusion results and fusion efficiency are compared and summarized. The image fusion can be divided into four categories: transform domain method, boundary segmentation method, deep learning method, and combination fusion method. The applicability of various methods is summarized. Finally, we analyzed and discussed the challenges faced by this field and proposed the solutions, and the future development of multi-focus image fusion technology has prospected.
The first part of this paper is the introduction, which introduces the concepts of multifocus image fusion, and summarizes the content of this paper; The second part is the fusion method and analysis, which analyzes and classifies a variety of multi-focus fusion methods; The third part is the evaluation indicators, which introduces the commonly used subjective evaluation and objective evaluation; The fourth part is the limitations, and gives the corresponding solutions according to the common fusion problems; The fifth part is the conclusion, which analyzes the application and development of multi-focus fusion.

Fusion Methods and Analysis
As shown in Figure 3, image fusion can be divided into pixel-level image fusion, feature-level image fusion, and decision-level image fusion according to the information representation layer [6]. For the image source classification, image fusion can be divided into remote sensing image fusion, medical image fusion, multi-focus image fusion, multiexposure image fusion and infrared, and visible image fusion [7]. Referring to the fusion method, image fusion can be divided into spatial domain image fusion and transform domain image fusion [8]. This paper proposes a new classification method, that is boundary segmentation method, dividing the current mainstream multi-focus fusion methods into four categories: transform domain method, boundary segmentation method, deep learning method, and combinatorial fusion method. Table 1 lists the classification method and the current mainstream algorithms. Table 1. Classification and mainstream algorithms of multi-focus image fusion.

Method
Mainstream Algorithm

Multi-Focus Fusion Methods Based on Transform Domain
Most traditional fusion methods in multi-focus image fusion are based on the transform domain [34]. As shown in Figure 4, the transformation domain method mainly operates the decomposition coefficient after image transformation, which mainly includes three fusion stages: image transformation, coefficient decomposition, and inverse transformation reconstruction. Firstly, the source image is transformed into the transform domain by an image decomposition algorithm; then various fusion strategies are used to fuse different coefficients; finally, the corresponding inverse transformation of the fusion coefficient is processed to obtain the final fusion image. The more layers of decomposition are used, the more detailed the information will be however, the efficiency will decrease. Therefore, the fusion effect will be greatly improved by properly handling the relationship between decomposition layers and execution efficiency. In terms of the different transformations, the transform domain method is further divided into the method based on multi-scale decomposition (MSD), the method based on sparse representation (SR), and the method based on gradient domain (GD) in this paper.  The Laplacian pyramid is the earliest multiscale decomposition method [9]. In this method, the absolute value of the decomposition coefficient is measured by its activity level, and the fusion coefficient is obtained by the choosing-Max rule. The greater the absolute value of the coefficient is, the more information it contains. In 2018, Sun et al. [12] proposed a Region Mosaic (RMLP) method based on The Laplace pyramid to fuse microscopically captured multi-focus images. The method firstly used The Laplacian operator to measure the focus level of multi-focus images. Then, a density-based region growth algorithm was used to segment the focused region mask of each image. Finally, the mask was decomposed into a mask pyramid to supervise the regional stitching of the Laplacian pyramid.
Due to the different forms of the tower structure, pyramid transformation can be divided into gradient pyramid [10], contrast pyramid [11], morphological pyramid [35], etc. The fusion method based on pyramid transformation has the advantages of high fusion efficiency while retaining sufficient original information. However, the decomposition method and the decomposition layer number have a great influence on the final result. The greater number of decomposition layers, the more blurred the fusion image boundary would be.
(2) The wavelet transforms The wavelet transform can decompose the original image into high frequency coefficient and low frequency coefficient. The high frequency coefficient includes vertical, horizontal, and diagonal information. The fusion effect of wavelet transform is better than that of pyramid transform. However, the wavelet transform is not displacement invariant for the feature representation; thus, the fusion effect is not satisfactory for the image with poor registration. To solve this problem, many improved wavelet transform methods are proposed. Yang et al. [5] introduced a multi-focus image fusion method based on fast discrete curve-wave transform (FDCT), which solved the problem of block effect in texture selection and spatial fusion. Yu et al. [13] extracted the six-dimensional feature vectors of the source image using the dual-tree complex wavelet transform (DT-CWT) coefficient sub-bands and then projected them onto the class tags by training a two-class (focused and unfocused) support vector machine (SVM).
In general, the advantage of the MSD method lies in extracting more accurate feature information and having a better fusion effect. However, the decomposition information is too much, leading to a large amount of calculation.

Sparse Representation (SR)-Based Methods
Sparse representation is a new image fusion method. By processing the natural sparsity of image signals, a signal is approximately represented as a linear combination of several atoms in a redundant dictionary. Atomic libraries of sparse representations are provided by over-complete dictionaries. By selecting some atoms in the over-complete dictionary and using a linear combination to reconstruct the image, the dependence between data dimensions and feature vectors can be reduced. Figure 6 shows a multi-focus image fusion framework based on sparse representation. In 2010, Yang et al. [39] introduced SR into multi-focus image fusion for the first time. In this method, sliding window technology was used to segment each source image into multiple overlapping small blocks, and orthogonal matching pursuit (OMP) algorithm was used to perform sparse decomposition for each small block. The sparse coefficient vectors after fusion were obtained by using the maximum selection fusion rule. Subsequently, a variety of improved algorithms based on sparse representation follow. Ma et al. [40] obtained an adaptive dictionary based on rough k-means singular value decomposition. Then the fixed dictionary was combined with the adaptive dictionary to obtain the joint dictionary. The final joint dictionary was used to sparsely encode the source image to separate the complementary and redundant components. In addition, there are also cross sparse representation [16], group sparse representation [41], K-SVD [42], and other methods.
Sparse representation can better solve the problem of fused image noise. However, the processing effect of image details (edge, texture, etc.) is not ideal and easy to blur. In addition, the method has high complexity, low computational efficiency, and poor real-time performance.

Gradient Domain (GD)-Based Methods
The gradient domain (GD)-based method is to fuse the gradient representation of the source image, limiting the gradient of the fused image within a particular threshold. Therefore, it is crucial to obtain gradient information of the image for this method. Paul et al. [43] input the gradient of the image component at each image pixel. They solved the Poisson equation at each resolution to achieve boundary continuity in the gradient domain. Wang et al. [44] proposed a gradient domain image fusion method based on the structure tensor, in which source images were stacked into a multi-valued image, and the structure tensor of each source image was calculated according to its gradient graph.
The method of gradient domain image fusion can improve the image's visual effect, retaining the details and structural information of the source image. This method can be applied not only to multi-focus image fusion but also to multi-exposure image fusion.

Multi-Focus Fusion Methods Based on Boundary Segmentation
This paper proposes a new classification method named the boundary segmentation method. According to the spatial characteristics of the source image, this method mainly generates a weight map for each source image by processing the region of pixels. It then calculates the fused image by the weighted average method or maximum method. This method has high operation efficiency, and the fused image can retain the image information of each local section. However, due to the improper boundary segmentation, many algorithms based on boundary segmentation often lost the edge, contour, and other image details. Therefore, strengthening the extraction of image boundary can effectively improve the quality of fusion. In this paper, the boundary segmentation method is further divided into block-based fusion method, region-based fusion method, and pixel-based fusion method.

Block-Based Methods
The earliest block-based segmentation scheme is to divide the source image into several fixed-size blocks, obtain the fused block by using the threshold based adaptive fusion rule, and finally use the consistency test method to achieve the fused image. Due to the fixed block size, the boundary of the multi-focus fusion image is prone to the fuzzy phenomenon.
Zhang et al. [19] proposed a multi-focus image fusion method based on adaptive region segmentation, which decomposed pre-registered source images into approximate coefficients and detail coefficients using the Laplace pyramid transform. In order to avoid the defect of fixed block size, an adaptive differential evolution algorithm is designed to calculate the optimal block size. Figure 7 shows a fusion framework for this approach. This method can effectively reduce the noise with high computational efficiency. De et al. [20] adopted the quadtree structure realized adaptive segmentation to fuse multi-focus images. The varied block sizes were determined by each specific content, which effectively solved the problem of the block effect.  [19]. Reprinted/adapted with permission from Ref. [19]. 2019, WSPC.
It can be seen from the above that the size of the block has a crucial influence on the final fusion effect, and it is prone to the phenomenon of fuzzy boundary. In addition, the compatibility between adjacent blocks needs to be considered.

Region-Based Methods
In order to improve the flexibility of source image segmentation, the region-based image fusion method came into being. The zone-based approach is similar to the blockbased approach. The main difference is that the activity level is measured in each irregularly sized segmented area rather than a block. Li et al. [48] initially proposed a region-based multi-focus image fusion method. Farid et al. [21] proposed a multi-focus image fusion method based on content adaptive fuzzy (CAB), in which the absolute difference between the original image and the CAB blurred image was used to generate the initial segmentation graph, and morphological operators and graph cutting techniques were used to improve the segmentation accuracy. Xiao et al. [49] proposed an adaptive initialization method for image depth estimation. The image depth was approximated by the iterative solution of the partial differential equation. The target image was adaptively divided into three regions: clear region, fuzzy region, and transition region. Finally, the multi-focus image fusion was realized by extracting the pixels of the clear region and fusing the pixels of the transition region.

Pixel-Based Methods
Beyond 2012, the pixel-based fusion method has become a popular direction of multifocus image fusion. The main reason is that this method can obtain accurate pixel-weighted images. Most pixel-based approaches' core problem is obtaining a weight map for each source image. In these methods, the activity level measurement is first adopted, and then the focus values obtained from different source images are compared to generate pixel-level weight maps. Weight graphs are also known as decision graphs because multi-focus image fusion can be viewed as a classification problem in which each pixel's focus attributes (focusing and defocusing) are determined. In some methods, the source image is also divided into regions with different points (such as focus/defocus/border, texture/smooth). Different fusion rules are applied according to their characteristics.
Du et al. [50] proposed a more focused image fusion algorithm based on image segmentation; the task of decision graph detection was regarded as the image segmentation between the focus region and the defocus region, and the feature images on the boundaries of the focus region and the defocus region were obtained by the multi-scale convolutional neural network. Then the initial segmentation, morphological operation, and watershed processing were performed on the fused image to get the segmentation graph and decision graph. This method proved that the decision graph obtained by a multi-scale convolutional neural network is reliable and can produce high-quality fusion images. Ma et al. [51] proposed a dual-scale multi-focus image fusion algorithm based on an enhanced random walk. Using the complementary characteristics of dual-scale measurement can better align boundaries and solve noise problems, thus achieving a more robust fusion.
Other pixel-based fusion methods involve robust principal component analysis (RPCA) [22], random field [23], morphological filtering [24], etc. The pixel-level fusion method has fast fusion speed and real-time solid performance. However, operating pixels are susceptible to noise, which will reduce the signal-to-noise ratio and contrast of the image.

Multi-Focus Fusion Method Based on Deep Learning
Beyond 2014, deep learning methods have developed rapidly with special effects and excellent applications. In general, deep learning models mainly use the learnability of the network to extract features from multi-focus images and separate focused and defocused regions to generate full-focus fusion images. At present, convolutional neural networks (CNNs) are one of the popular models in this field. In addition, pulse-coupled neural networks (PCNN) and generative adversarial networks (GAN) also have many applications.

Convolutional Neural Network Model
The convolutional neural network is one of the most popular deep learning models. It can realize parallel computing and has high speed and high efficiency characteristics [52,53]. CNN is widely used in medical image analysis, remote sensing image analysis, noise signal analysis, and other fields. Javed Awan et al. [54] used a customized 14-layer convolutional neural network resnet-14 architecture to automatically detect and evaluate ACL injuries of athletes. Zhang et al. [55] proposed a new method more suitable for farmland vacancy segmentation, using the improved RESNET network as the backbone of signal transmission. Lopac et al. [56] proposed a method for the classification of noisy non-stationary time-series signals based on Cohen's class of their time-frequency representations (TFRs) and deep learning algorithms. The proposed approach combining deep CNN architectures with Cohen's class TFRs yields high values of performance metrics and significantly improves the classification performance compared to the base model.
In the field of image fusion, CNN can learn the feature representation mechanism of different abstraction levels for source images, and it is trainable. CNN extracts the features of input images by learning filters to obtain different feature maps of each level. Each unit or coefficient in the feature maps is called a neuron. Generally, three calculation methods, filtering convolution activation function and pooling, are used to connect feature maps between adjacent levels [57]. The typical structure diagram of CNN is shown in Figure 8.  [58]. Reprinted/adapted with permission from Ref. [58]. 2020, MDPI.
In 2017, Liu et al. [25] successfully applied CNN to the field of image fusion for the first time. By extending the classification idea of artificial neural networks, a convolutional neural network is used instead of an artificial neural network to classify the pixels of source images. According to the mapping value, the score graph of the source image can be obtained. Then the decision graph can be built by the consistency verification of the score graph. Amin-Naji et al. [26] proposed a new CNN-based integrated learning approach to pursue data diversity to reduce the over-fitting problem. This fusion method based on CNN integration is better than the single CNN fusion method. Zhang et al. [59] proposed a full end-to-end convolution layer network model. This model chose the feature level fusion, canceled all pooling layers, adopt the strategy of full connection layer directly, gave up on the mapping of the source image pixel resolution, and judged the fusion result loss, thus achieving the purpose of the end-to-end output of fusion image. This method is more concise and effective and avoids the complicated follow-up processing problems of the CNN model.
The advantages of the multi-focus image fusion method based on CNN lie in the ability of layered learning features, more diversity of feature expression, strong discrimination ability, and better generalization performance. The disadvantage is that the training takes a long time, and there is no special training set, which usually requires particular training and image preprocessing.

Pulse Coupled Neural Network
The pulse coupled neural network model [60] is proposed based on the analysis of synchronous pulse oscillations of visual cortex neurons in cats and has been widely used in image fusion. Each neuron in the PCNN model corresponds to a pixel, whose definition is determined by the firing times of the neuron. The more the firing times are, the clearer the corresponding pixel points would be. As shown in Figure 9, PCNN consists of three parts: feeding input field, modulation field, and firing subsystem. The stimulus is received by the feeding input field and fed back to the firing subsystem through modulation field. Figure 9. Typical structure diagram of PCNN [61]. Reprinted/adapted with permission from Ref. [61]. 2022, MDPI.
To make the fused image clearer, Wang et al. [27] proposed a multi-focus image fusion method based on PCNN and random walk. The technique used PCNN to measure the sharpness of the source image and constructed an initial fusion image. The random walk method was then used to improve the accuracy of fusion region detection, and the final fusion image was generated according to the probability calculated by a random walk. An improved PCNN method [28] was proposed to fuse the source image with the guide filter. The improved PCNN was excited by the intermediate fusion image to generate the fusion image. This method created fusion images several times and fused the fusion images with PCNN to make the fusion results more accurate.
The PCNN model can extract local details effectively and recognize image content well. However, the massive iterative calculation and configuration parameters make this method high coupling and time-consuming.
In addition to the above two models, a generative adversarial network (GAN) has also been used in multi-focus image fusion. Guo et al. [29] proposed a multi-focus image fusion method based on least square GAN. In the fusion process, the final fusion decision graph was obtained by binary segmentation and the refinement of the focus graph.

Combinatorial Fusion Method
As seen from the above, different fusion methods have different fusion characteristics. The combination fusion method combines two or more methods to take their strengths. For example, the transform domain method can be combined with the region segmentation method, which can extract more details and enhance the fusion efficiency.
Zhu et al. [30] proposed an image fusion scheme based on image cartoon texture decomposition and sparse representation. Aiming at the proposed sparse representationbased fusion method, they trained a dictionary with a strong representation ability to fuse the animation and texture components. Yang et al. [31] proposed a multi-focus image fusion framework based on non-subsampled contourlet transform form (NSCT) and sparse representation (SR). Li et al. [32] proposed a multi-focus image fusion algorithm based on spatial frequency-driven parameter adaptive pulse-coupled neural network (SF-PAPCNN) and an improved non-subsampled Shear-wave transform (NSST) domain summing modified Laplace Transform (ISML).

Subjective Evaluation
The subjective evaluation depends on the observer to evaluate the quality of the image, including the edge, whether the content is clear and whether it contains noise, etc. However, subjective evaluation is not applicable for the following reasons: observers need to possess relevant professional knowledge, and it is difficult to observe the details of the image with naked eyes. Equipment environment, such as lighting, display brightness, etc., affects the observer's judgment; In order to evaluate the accuracy, it usually needs to organize evaluation meetings, and is time-consuming and labor-consuming.

Objective Evaluation
Objective evaluation is to calculate the image quality through some algorithm, and the calculation results are used as evaluation criteria. Liu et al. [62] divided 12 popular image fusion evaluation indexes into four categories: evaluation indexes based on information theory, evaluation indexes based on image features, evaluation indexes based on image structure similarity, and evaluation indexes based on human perception. Many researchers welcome this classification method.
The commonly used objective evaluation indexes are shown in Table 2. When evaluating the quality of fused images, multiple evaluation indexes are often needed to be calculated. The better the evaluation effect, the better the corresponding fusion effect is.

Evaluation Indicators Evaluation of the Effect
Information entropy (EN) Reflect the amount of information the image carries; the larger the value, the richer the amount of information, the better the quality.
Mean gradient (AG) Measure the clarity of the image, the greater the value, the higher the clarity, the better the quality.

MEAN
Measure the average brightness of the image, the average is moderate, the better the quality.

Standard deviation (STD)
Reflect the richness of image information, the larger the value, the more scattered the gray level distribution, the better the quality.
Mean square error (RMSE) Reflect the spatial details of the image, the smaller the value, the smaller the difference, the better the quality.
Signal to noise (SNR) To measure the proximity between fusion image and ideal image, the greater the value, the higher the similarity.
Normalized mutual information (QMI) Reflects the amount of information retained by the source image in the fusion image, the larger the value, the better the quality.
Structural similarity (SSIM) By comparing the structural similarity between source image and fusion image, the closer the value is to 1, the better the quality is.

Limitations
In the last ten years, multi-focus image fusion technology has been developed. However, there are still some urgent problems that need to be addressed.
(1) Image registration Most of the current fusion methods focus on feature extraction of source images, paying little attention to the image scene consistency, content deformation, and other registration problems. The actual source images are not as accurate as the experimental samples. Thus, the fusion effect would be greatly affected.
In our view, the multi-view registration method can be studied to address the above problem. To be specific, capturing images of similar objects or scene from multiple perspectives can obtain a better representation of the scanned object. The multi-view registration can realized by various algorithms such as image mosaic, 3D model reconstruction from 2D image, etc.
(2) Fusion efficiency Many scholars pursue the applicability and quality of fusion methods, but ignore the efficiency of fusion. However, we believe the fusion efficiency is of great value in practical application. The difficulty may be alleviated by immerging several fusion stages into a one-stop rapid stage, thus simplifying the sophisticated fusion process.
(3) Application scenarios Although there are many multi-focus image fusion methods, most of them are studied and tested in public image libraries. We think it is helpful to collect and build image libraries in many specific industrial fields. Based on the specific image library, with the help of state-of-the-art mathematical theory or models, researchers can develop multi-focus image fusion methods suitable for actual application.

Conclusions
This paper describes four kinds of multi-focus image fusion methods: transform domain method, boundary segmentation method, deep learning method, and combinatorial fusion method. Each method is deeply classified and the advantages and disadvantages of each method are compared. For different scenarios, it is necessary to choose the appropriate method. In addition, the commonly used evaluation indicators are listed, and the objective evaluation is more accurate than the subjective evaluation, which takes less energy and time. Finally, the solution is discussed based on the analysis of the shortcomings of current applications and methods.
Multi-focus image fusion can effectively solve the depth of field problem in optical lens areas and has a wide application space in many fields, such as medicine, security, photography, etc. It has been successfully applied in the areas of microscopic imaging [63], image deblurring [64], focusing shape [65], and information forensics [66].
To sum up, multi-focus image fusion needs further development. Solving the problem of image registration will improve the universality of the method and expand the scope of fusion. In the pursuit of fusion quality, it is necessary to pursue time efficiency, taking real-time fusion as the ultimate goal.

Conflicts of Interest:
The authors declare no conflict of interest.