A Learning-Based Image Fusion for High-Resolution SAR and Panchromatic Imagery

: Image fusion is an e ﬀ ective complementary method to obtain information from multi-source data. In particular, the fusion of synthetic aperture radar (SAR) and panchromatic images contributes to the better visual perception of objects and compensates for spatial information. However, conventional fusion methods fail to address the di ﬀ erences in imaging mechanism and, therefore, they cannot fully consider all information. Thus, this paper proposes a novel fusion method that both considers the di ﬀ erences in imaging mechanisms and su ﬃ ciently provides spatial information. The proposed method is learning-based; it ﬁrst selects data to be used for learning. Then, to reduce the complexity, classiﬁcation is performed on the stacked image, and the learning is performed independently for each class. Subsequently, to consider su ﬃ cient information, various features are extracted from the SAR image. Learning is performed based on the model’s ability to establish non-linear relationships, minimizing the di ﬀ erences in imaging mechanisms. It uses a representative non-linear regression model, random forest regression. Finally, the performance of the proposed method is evaluated by comparison with conventional methods. The experimental results show that the proposed method is superior in terms of visual and quantitative aspects, thus verifying its applicability. and performance of the proposed method. The results are comprehensively compared with those of conventional


Introduction
Recently, various high-resolution satellite sensors have increasingly been developed, especially the synthetic aperture radar (SAR) imaging sensor, which has an important advantage in Earth observations [1,2]. It is an active sensor that provides its own source of illumination, which is independent of solar illumination and is not affected by daylight or night darkness [3]. It can also penetrate through atmospheric effects, allowing for Earth observation regardless of weather conditions such as rain, fog, smoke, or clouds [4,5]. Information contained in a SAR image depends on the backscattering characteristics of the surface targets and is sensitive to the geometry of the targets [6]. The image provides information on surface roughness, object shape, orientation, or moisture content [7,8]. Furthermore, the SAR image can highlight objects that have a low contrast in optical images. However, interpreting the details in SAR images is a challenging task for several reasons: (1) SAR images inherently contain geometric distortions due to distance-dependence along the range axis and signatures related to radar signal wavelengths [9]; (2) the human eye is familiar with the visible part of the electromagnetic spectrum and is not adapted to the microwave-scattering phenomenon [10]; (3) the reflectance properties of objects in the microwave range depend on the frequency band used and may significantly differ from the usual assumption of diffuse reflection at the Earth's surface [11]; (4) since SAR images are inherently coherent during the process of their generation, speckle noise is inevitable in the resulting images, rendering the images unintuitive [12]; and (5) such images also contain the after-effects caused by foreshortening, slant-range scale distortion, layover, and 2 of 21 shadows [13,14]. Thus, the SAR image can be visually difficult to interpret and, ultimately, this data improvement approach is designed at the end to be implemented in the monitoring and analyzing earth surface issues that offering an advanced solution for many applications including environmental studies [15].
To improve the quality and interpretability of SAR images, image fusion with optical images, which contain information regarding reflective and emissive characteristics, can be a good alternative [16][17][18]. In particular, the panchromatic image can be utilized because it is physically sensitive to ground objects and reflects the objects' contour information with high spatial resolution and abundant textural features [19]. The overall concept of image fusion between the SAR and panchromatic images is to incorporate spatial details extracted from the panchromatic image into the SAR image by using an appropriate algorithm [20]. Therefore, the fusion of the SAR and panchromatic images makes it possible to use complementary information and contributes to a better understanding of the objects in target areas [21]. Furthermore, the fusion of SAR and panchromatic images has additional benefits, such as the sharpening of image quality, enhancement of certain features that are invisible with either data set in the non-combined state, complementation of data sets for improved classification, detection of changes using multi-temporal data, and substitution of missing information in one image with signals from another sensor image [1].
However, because of the significant differences between the imaging mechanisms of the SAR and optical sensors, the generation of surface features of the same object are different in SAR and panchromatic images [5]. Conventional image-fusion methods such as principal component analysis (PCA) and high-pass filtering are not appropriate because they do not consider the differences in imaging mechanisms and the spectrum characteristics between the two image types [22]. An alternative approach is multiscale decomposition, based on which various methods have been proposed for the fusion of SAR and panchromatic images; however, these methods have some limitations [19,20,22,23]. For image fusion based on these methods, the SAR and panchromatic images are represented by the fixed orthogonal basis function, and the image fusion is performed by the fusion of the coefficients of different sub-bands in the transform domain [24]. Because some features cannot be represented sparsely, this fusion cannot represent all useful features accurately due to limited fixed transforms [20]. In particular, the discrete wavelet transforms (DWT) fusion method only uses features of single pixels to make decisions, and it is not shift-invariant [25]. Similarly, the contourlet transform (CT)-based fusion method lacks shift-invariance, which results in pseudo-Gibbs phenomena around singularities, and it has difficulty in preserving edge information. The non-subsampled contourlet transform (NSCT)-based method, which is a fully shift-invariant form of the CT, leads to better frequency selectivity and regularity [26]. However, this method still fails to fuse the features of physically heterogenous images [5]. Another approach is the sparse representation method, in which the generation of dictionary and sparse coding is crucial [24]. This method can extract potential information from input images in addition to representing them sparsely; however, this method does have limitations. Firstly, the advanced sparse coefficients fusion rule may cause spatial inconsistency, and secondly, the trained dictionary cannot accurately reflect the complex structure and detail of the input images [27].
To overcome these limitations, this study proposes a new image-fusion method that utilizes useful features as much as possible and considers the differences in imaging mechanisms. Instead of directly fusing pixels or decomposing them to perform fusion in a limited transform, this algorithm aims to extract sufficient features and establish relationships to fuse the SAR and panchromatic images. This makes it possible to contain the structural and detailed information of panchromatic images and increase the overall interpretability of SAR images [28]. Furthermore, a learning-based approach is used to account for the differences in imaging mechanisms of the SAR and panchromatic images. Random forest (RF) regression, which can model non-linear relationships, is utilized, and learning is performed for each class to reduce the complexity of the algorithm and for better predictions [29,30]. Then, experiments are performed on multiple scenes to demonstrate the capability and performance of the proposed method. The results are comprehensively compared with those of conventional image-fusion methods. The main contributions of this study can be summarized as follows: (1) this is the first learning-based approach for fusing single high-resolution SAR and panchromatic images; (2) to consider the differences in imaging mechanisms, this method uses RF regression, which can model non-linear patterns, avoids overfitting, and is relatively robust to the presence of noise; (3) this method performs classification of the image, where the complexity is reduced by establishing relationships for each class.; and (4) this method extracts various features to consider sufficient information.
The rest of this paper is organized as follows: Section 2 describes materials used in detail and the proposed algorithm in detail. In Section 3, the results of the proposed method are presented, and they are compared with those of the conventional image-fusion methods and discussed. Finally, Section 4 concludes the paper.

Study Site and Dataset
The study areas are Gwangjin-gu and Seongdong-gu, located in Seoul, in central South Korea ( Figure 1). These areas are covered by forests, grass, barren land, water, and developed structures; thus, they represent an extensive range of terrain morphologies. The dataset used in the experiments for the panchromatic image type is WorldView-3, and for the SAR image type, the Korea Multi-Purpose Satellite-5 (KOMPSAT-5) dataset is used. The WorldView-3 dataset used in this study was acquired on 7 August 2015; the images in this dataset have a spatial resolution of 0.31 m and a radiometric resolution of 11 bits [31]. The KOMPSAT-5 dataset was acquired on 10 September 2015; it was obtained in the enhanced high-resolution mode with a spatial resolution of 1 m, an ascending orbit, and horizontal transmit-horizontal receive (HH) polarization. The processing level was L1D, which performs terrain correction and then geolocates onto a digital elevation model (DEM) with cartographic projection [32]. Initially, speckle noise exists in the SAR images; however, it is expected to reduce through filtering, thereby providing better information. In this study, a gamma map filter of 5 × 5 kernels, which is the most efficient filter for reducing speckles while preserving object edges, is selected for speckle filtering [33]. Furthermore, because the filtered KOMPSAT-5 images should be calculated with the same weights as the weights used for the WordlView-3 images, the filtered KOMPSAT-5 images are configured with a matching pixel value range [20]. For the fusion scheme, the KOMPSAT-5 images are resampled at a resolution of 0.31 m to match that of the WorldView-3 images. Next, to remove the misregistration error term, image registration is applied using manual ground control points, followed by geometric transformation. In addition, the coordinate system of each image is projected as the Universal Transverse Mercator Coordinate System (UTM). Then, for a reasonable computation time, experiments are performed with subsets of 2000 × 2000 pixels, and the total area of the three sites is selected to validate the proposed method. Table 1

Methods
The proposed fusion framework can be decomposed into four steps for the preprocessed images: (1) selection of training pixels, (2) classification, (3) feature extraction, and (4) learningbased image fusion; they are shown in Figure 5. In the first step, the pixels to be used for the training are selected, and in the second step, classification is performed on the SAR and panchromatic images. In the third step, feature descriptors are extracted to be used for training as the SAR image, and in the fourth step, fusion is performed by establishing a relationship based on learning. These steps are described below.

Methods
The proposed fusion framework can be decomposed into four steps for the preprocessed images: (1) selection of training pixels, (2) classification, (3) feature extraction, and (4) learningbased image fusion; they are shown in Figure 5. In the first step, the pixels to be used for the training are selected, and in the second step, classification is performed on the SAR and panchromatic images. In the third step, feature descriptors are extracted to be used for training as the SAR image, and in the fourth step, fusion is performed by establishing a relationship based on learning. These steps are described below.

Methods
The proposed fusion framework can be decomposed into four steps for the preprocessed images: (1) selection of training pixels, (2) classification, (3) feature extraction, and (4) learning-based image fusion; they are shown in Figure 5. In the first step, the pixels to be used for the training are selected, and in the second step, classification is performed on the SAR and panchromatic images. In the third step, feature descriptors are extracted to be used for training as the SAR image, and in the fourth step, fusion is performed by establishing a relationship based on learning. These steps are described below. Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 22

Selection of Training Pixels
In the step involving the selection of training pixels, meaningful pixels to be used for establishing the relationship should be selected. In particular, training pixels should be selected to consider the differences in imaging mechanisms. This study selects invariant pixels, that is, pixels with little difference in reflectance between the two images. In other words, the relationships are established based on invariant pixels, and the values of pixels with substantially large differences are predicted [28,34]. The invariant pixels are acquired through image differencing, which is a method that subtracts pixel values between the SAR and panchromatic images, in accordance with Equation (1):

Selection of Training Pixels
In the step involving the selection of training pixels, meaningful pixels to be used for establishing the relationship should be selected. In particular, training pixels should be selected to consider the differences in imaging mechanisms. This study selects invariant pixels, that is, pixels with little difference in reflectance between the two images. In other words, the relationships are established based on invariant pixels, and the values of pixels with substantially large differences are predicted [28,34]. The invariant pixels are acquired through image differencing, which is a method that subtracts pixel values between the SAR and panchromatic images, in accordance with Equation (1): where I S is the pixel value in the SAR image, I P is the pixel value in the panchromatic image, i and j represent rows and columns, respectively, and C is an arbitrary constant. Then, Otsu's method is used to classify change and no-change, where the no-change region is selected as the invariant pixels.

Classification
To reduce the complexity of the algorithm and to enforce a higher prediction, classification is performed in this study [28]. In other words, each class is obtained, and learning is performed independently for each class. Here, classification is performed by stacking two images to consider the characteristics of both SAR and panchromatic images using fuzzy C-means (FCM), which is an unsupervised classification algorithm [35]. FCM is based on the optimization of the objective function based on the similarity measure considering the distance between data and the center of the cluster as shown by Equation (2): where N is the number of data; c is the number of clusters; u kn is the membership function and satisfied the condition 0 ≤ u kn ≤ 1, c k=1 u kn = 1; m is a weighting exponent that control the degree of fuzziness in the resulting membership functions and is set to 2 for simplicity [36]; d 2 (y n , v k ) = y n − v k 2 is squared distance between y n and v k , in which Y = [y 1 , y 2 , · · · y n ] is a dataset to be grouped and v k is the cluster center. To minimize the objective function, the FCM algorithm performs an iterative process, and the membership functions and cluster centers are defined as Equations (3) and (4): U and V are iteratively updated to obtain an optimal solution, and the iterative process ends when U (r) − U (r−1) < ε, where U (r) and U (r−1) are the membership functions in the rth and r − 1th iterations and ε is a predefined small positive threshold [37]. Furthermore, the number of clusters is a key parameter in the proposed method as it determines the number of training models in which land-cover distribution characteristics as well as performance and training time should be considered. If there are not enough clusters, the land-cover distribution characteristics will be neglected, and if there are too many clusters, the training time will increase, complex computations will be necessary, and overtraining may occur. Therefore, in this study, the number of clusters is set to 6 to not only obtain appropriate performance and training times but also to consider the land cover distribution characteristics [28].

Feature Extraction
Conventional image-fusion methods use only the pixel values of SAR and panchromatic images. However, in general, the gray level of single pixels is not informative; therefore, additional information other than the pixel values is necessary [38,39]. To ensure that abundant information is considered, this study uses texture information. Several approaches exist for extracting texture features, for example, the gray-level co-occurrence matrix, local binary patterns, and Gabor filters [40][41][42], among which the Gabor filter is selected; this filter is inspired by a multi-channel filtering theory for processing visual information in the human visual system [43]. It is advantageous in terms of invariance to illumination, rotation, scale, and translation; thus, it has been successfully applied for various image processing and machine vision applications [44]. The 2-D Gabor function comprises a complex sinusoid modulated by a Gaussian envelope, in which the Gabor filter includes a real component and an imaginary one. In this study, because of the substantial magnitude of the images, only real components were considered. Calculation without imaginary components would cause small discrepancies; however, the results are still efficient in terms of feature extraction time [45]. This can be represented as Equations (5)-(7): where a and b are pixel positions; γ is the spatial aspect ratio (the default value is 0.5 in [46]); σ is the standard deviation of the Gaussian envelope; λ is the wavelength of the sinusoidal factor and 1/ λ is equal to the spatial frequency f ; ϕ is the phase offset, where ϕ = 0 and ϕ = π/2 return the real component and imaginary component, respectively [47]; and θ is the orientation.
Gabor features, generally taken as Gabor filters, are constructed by selecting different spatial frequencies and orientations. The frequency corresponds to scale information and is expressed as Equation (8): where f m is the mth frequency; f max is the central frequency of the filter at the highest frequency, for which the most commonly adopted value is √ 2/4, based on the suggestion that f max should be lower than 0.5 [48,49]; and k is the scale factor, which this study selects as 2 [50]. Then, the orientations are expressed as Equation (9): where θ n is the nth orientation and N is the total number of orientations. In the study, a total of 40 features are extracted by selecting five scales and eight orientations. Then, to reduce the dimensionality of features and condense the relevant information, PCA is applied. The dimension of the features is compressed to six, a value which considers both the information of the features and the efficiency of computation [51]. Furthermore, as supplementary features, the mean and standard deviation, considering the information of neighboring pixels, are included. Here, to reflect both the coarse and fine-texture information of neighborhoods sufficiently, window sizes of 3 × 3, 5 × 5, 7 × 7, and 9 × 9 are selected.

Learning-Based Image Fusion
As mentioned above, there are significant differences in imaging mechanisms between SAR and panchromatic images. To consider the differences in the imaging mechanisms, compositive characteristics should be utilized, and non-linear relationships are required. Therefore, this study employed RF regression, which is a representative algorithm that considers compositive characteristics and models non-linear relationships. RF regression is based on the classification and regression tree (CART) model, which is an ensemble-based algorithm that combines several decision trees and obtains results [52]. For the classification, the results are obtained by most votes from the tree results, whereas for regression, the tree results are averaged [53]. In particular, each tree is created independently through a process called bootstrap aggregation, or bagging, to avoid correlations with other trees; in this process, training data subsets are drawn by randomly resampling the subsets with replacement from the original training dataset [54,55]. Thus, this process is robust to the presence of noise or slight variations in the input data, has greater stability, and increases the prediction accuracy [56,57]. Furthermore, in each tree, approximately 30% of the data is excluded from the training process, which is called out-of-bag (OOB) data. The mean squared error (MSE) between the OOB data and the data used for growing the regression trees is obtained; then, a prediction error called the OOB error is calculated for each variable [53]. This error estimates the importance of every variable, such that the influence of each input variable can be further analyzed. To determine the importance of the input variables, each variable is permuted, and regression trees are grown on the modified dataset [58]. The variable importance is calculated based on the difference in the MSE between the original OOB dataset and the modified dataset. In other words, if the exclusion of a variable leads to a significant reduction in prediction accuracy, the variable is considered important.
In addition, the RF algorithm requires the specification of two parameters: the number of variables to be used for the best split at each node (m try ) and the number of trees in the forest (n tree ). In regression problems, the standard value of m try is one-third of the total number of input variables; thus, in this study, m try was selected as 5 [59]. Regarding n tree , previous studies have shown that using a large value for n tree provides better stability. However, recent studies have revealed that n tree has no significant effect on performance; thus, in this study, n tree was selected as 32 considering both the performance and training time [28,34,52].
Using the two parameters, the RF is modeled and generated independently for each class, which leads to a reduction in the complexity of the algorithm and allows more information to be retrieved. For each class, supervised learning is performed by setting the features extracted from the SAR image as independent variables and the pixel values of the panchromatic image as dependent variables for the positions corresponding to previously obtained invariant pixels. Then, the features of the SAR image corresponding to all the pixels of each class are extracted and utilized as input values of the obtained RF regression model. Finally, the fusion result is generated by integrating the predicted values for each class.

Criteria for Fusion Quality Assessment
The quality of the image-fusion results can be evaluated according to two criteria. First, the performance of the fusion results for the proposed method can be intuitively evaluated in terms of visual aspect. Second, quantitative evaluation can be used to obtain the performance of fusion results, which must be statistical and objective [22]. To assess the performance, the fusion quality index (FQI), average gradient (AG), and spatial frequency (SF) are selected. FQI is an index for evaluating the quality of a fused image for given input images; it is based on the combination of luminance distortion, contrast distortion, and loss of correlation of coefficient over local regions into a single measure [60]. Given the SAR image I S , the panchromatic image I P , and the fused image I F , FQI is defined as Equation (10): where λ(w) = σ I S 2 σ I S 2 +σ I P 2 is a weight computed over a window w, in which σ I S 2 and σ I P 2 are the variance of the SAR and panchromatic images, respectively; c(w) = C(w) w ∈W C(w ) is a saliency computed over a window w, where C(w) = max σ I S 2 , σ I P 2 ; QI(I S , I F w) and QI(I P , I F w) are the quality indexes of the fused image with regard to the SAR and panchromatic images, respectively; and w is set to 8 × 8 [60]. Given two images a and b, the QI is defined as Equation (11): where µ a and µ b are the means of the respective images; σ a and σ b are the standard deviations of the respective images; and σ ab is the covariance of the two images. AG represents information on the edge details of an image, which is sensitive to the details of contrast and texture in the image [22]; it is defined as Equation (15): where ∆I Fx and ∆I F y are the differences in the x and y directions in the fused image, respectively. SF reflects the active degree of an image in the spatial domain [61] and is defined as Equations (16)- (18): SF = Row f requency 2 + Column f requency 2 FQI has the range of [0, 1], and an FQI value closer to 1 indicates better performance, whereas for AG and SF, higher values indicate better performance [61].

Comparison of Fusion Results
To evaluate the effectiveness of the proposed fusion approach, its results were compared with those of the conventional image-fusion algorithm. To ensure a fair comparison, fusion algorithms using a single SAR and panchromatic image were considered, where the à-trous wavelet decomposition (ATWD) [20], DWT [23], NSCT [19], and NSCT-pulse couple neural network (NSCT-PCNN) [5] methods were selected. The ATWD method is based on the importance of the wavelet coefficient, which is incorporated into the SAR image at a certain high frequency. For the DWT method, the maximum values of the coefficients at low frequencies and high frequencies are selected as the low and high frequencies, respectively. The NSCT method is based on the contourlet transform without downsamplers and upsamplers, and it also selects the averaging scheme at a low frequency and the maximum scheme at high frequency. The NSCT-PCNN method performs fusion based on PCNN for coefficients at low frequencies, and the coefficients at high frequencies are obtained through NSCT. In accordance with the aforementioned details, the decomposition level of NSCT and NSCT-PCNN was selected as the three showing the best image-fusion results [19]. Furthermore, the experiments were carried out on a desktop PC with an Intel(R) Core (TM) i7-8700 @ 3.20 GHz processor, 24.00 GB of RAM, and a 64-bit Windows 10 operating system. Particularly, all experiments involving the proposed model were programmed in Python 3.7, and the conventional methods were programmed in MATLAB 2019a. The image-fusion results are shown in Figures 6-8.   From an overall visual inspection, the results of the proposed method and those of the conventional fusion methods provided more information than the original single image. They contained spatial information of the panchromatic image, such as the line information and edge information of buildings, as well as the object information of the SAR image. However, in the results of ATWD and DWT, spatial information was insufficient compared with those of other methods. For Site 1, which primarily consisted of vegetation and included developed structures, the surface roughness of the SAR image in both these areas was emphasized, resulting in less spatial information. Sites 2 and 3 mainly consisted of developed structures, and the surface roughness of the SAR image was also emphasized more than the line and edge information of buildings, like the results for Site 1. Furthermore, more spatial information was present in the result of the NSCT than in the result of ATWD or DWT; however, it was also confirmed that the object information of the SAR image was lost compared to the original SAR image. The result of the NSCT-PCNN included more spatial and object information compared to those of the conventional image-fusion methods; however, the spatial information of the vegetation in Site 1 was somewhat insufficient. In contrast, the proposed method included sufficient spatial and object information regardless of vegetation or developed areas. The specific details are indicated on the red rectangle in Figures 6-8, and the enlarged areas are shown in Figure 9. From an overall visual inspection, the results of the proposed method and those of the conventional fusion methods provided more information than the original single image. They contained spatial information of the panchromatic image, such as the line information and edge information of buildings, as well as the object information of the SAR image. However, in the results of ATWD and DWT, spatial information was insufficient compared with those of other methods. For Site 1, which primarily consisted of vegetation and included developed structures, the surface roughness of the SAR image in both these areas was emphasized, resulting in less spatial information. Sites 2 and 3 mainly consisted of developed structures, and the surface roughness of the SAR image was also emphasized more than the line and edge information of buildings, like the results for Site 1. Furthermore, more spatial information was present in the result of the NSCT than in the result of ATWD or DWT; however, it was also confirmed that the object information of the SAR image was lost compared to the original SAR image. The result of the NSCT-PCNN included more spatial and object information compared to those of the conventional image-fusion methods; however, the spatial information of the vegetation in Site 1 was somewhat insufficient. In contrast, the proposed method included sufficient spatial and object information regardless of vegetation or developed areas. The specific details are indicated on the red rectangle in Figures 6-8, and the enlarged areas are shown in Figure 9. Although visual analysis is direct and intuitive, it is also highly subjective and, therefore, may not allow for fully accurate evaluation. Thus, the performance of the fusion results was further evaluated quantitatively based on FQI, AG, and SF, which are summarized in Table 2. Regarding the FQI, the proposed method performed better than the conventional image-fusion methods in all sites. For site 1, the proposed method showed improvements of 8.51%, 6.24%, 27.14%, and 19.90% over ATWD, DWT, NSCT, and NSCT-PCAA, respectively, in addition to respective improvements of 2.78%, 0.53%, 21.92%, and 24.63% for Site 2 and respective improvements of 2.77%, 0.45%, 20.25%, and 14.68% for Site 3. The higher FQI of the proposed method indicates that its fusion results contain more of the information of the SAR and panchromatic images. In contrast, AG yielded different results for each site, as follows. For Site 1, the NSCT-PCNN had the highest value, whereas at Sites 2 and 3 the proposed method had the highest value. AG represents the spatial information in the panchromatic image in addition to the object information and surface roughness in the SAR image. As mentioned above, Site 1 consisted mostly of vegetation, and the result of the NSCT-PCNN contained most of the surface roughness information of the SAR image with a lack of the spatial information of the panchromatic image of the vegetation area. Because of this, the texture features of the vegetation were best highlighted owing to the influence of surface roughness in the calculation of AG. However, for Sites 2 and 3, which consisted mainly of developed structures, the spatial information of the panchromatic image and the object information of the SAR image were the main information, and the result of the proposed method had the highest abundance with regard to both aforementioned sets of information. Regarding the SF, which is primarily a metric for assessing the spatial information derived from the panchromatic image, the proposed method exhibited the best performance in all sites. In other words, it is confirmed that the proposed method would be more useful than the conventional image-fusion methods in visual and quantitative evaluations. Although visual analysis is direct and intuitive, it is also highly subjective and, therefore, may not allow for fully accurate evaluation. Thus, the performance of the fusion results was further evaluated quantitatively based on FQI, AG, and SF, which are summarized in Table 2. Regarding the FQI, the proposed method performed better than the conventional image-fusion methods in all sites. For site 1, the proposed method showed improvements of 8.51%, 6.24%, 27.14%, and 19.90% over ATWD, DWT, NSCT, and NSCT-PCAA, respectively, in addition to respective improvements of 2.78%, 0.53%, 21.92%, and 24.63% for Site 2 and respective improvements of 2.77%, 0.45%, 20.25%, and 14.68% for Site 3. The higher FQI of the proposed method indicates that its fusion results contain more of the information of the SAR and panchromatic images. In contrast, AG yielded different results for each site, as follows. For Site 1, the NSCT-PCNN had the highest value, whereas at Sites 2 and 3 the proposed method had the highest value. AG represents the spatial information in the panchromatic image in addition to the object information and surface roughness in the SAR image. As mentioned above, Site 1 consisted mostly of vegetation, and the result of the NSCT-PCNN contained most of the surface roughness information of the SAR image with a lack of the spatial information of the panchromatic image of the vegetation area. Because of this, the texture features of the vegetation were best highlighted owing to the influence of surface roughness in the calculation of AG. However, for Sites 2 and 3, which consisted mainly of developed structures, the spatial information of the panchromatic image and the object information of the SAR image were the main information, and the result of the proposed method had the highest abundance with regard to both aforementioned sets of information. Regarding the SF, which is primarily a metric for assessing the spatial information derived from the panchromatic image, the proposed method exhibited the best performance in all sites. In other words, it is confirmed that the proposed method would be more useful than the conventional image-fusion methods in visual and quantitative evaluations. Table 2. Evaluations for the image fusion methods (FQI: fusion quality index, AG: average gradient, SF: spatial frequency, ATWD: à-trous wavelet decomposition, DWT: discrete wavelet transforms, NSCT: non-subsampled contourlet transform, NSCT-PCNN: non-subsampled contourlet transform-pulse couple neural network).

Validation of Random Forest Regression
As mentioned above, the RF regression models were constructed independently for each class, thus, the predictive models were verified separately. The classification images of each site are shown in Figure 10, and the characteristics of each class are as follows: Classes 1 and 2 represent areas with high backscattered intensity and double bounce scattering characteristics, because of the artifacts in the SAR image, where the intensity of class 2 is lower than that of class 1. Classes 3 and 4 include the specular reflection characteristics of the SAR image and bare land or those of the high-brightness roofs in the panchromatic image, where class 4 is brighter than class 3 in the panchromatic image. Classes 5 and 6 are composed of vegetation, roads, or low buildings in the panchromatic image; the low backscattered intensity characteristics of class 5 and diffuse scattering characteristics of class 6 are shown in the SAR image.

Validation of Random Forest Regression
As mentioned above, the RF regression models were constructed independently for each class, thus, the predictive models were verified separately. The classification images of each site are shown in Figure 10, and the characteristics of each class are as follows: Classes 1 and 2 represent areas with high backscattered intensity and double bounce scattering characteristics, because of the artifacts in the SAR image, where the intensity of class 2 is lower than that of class 1. Classes 3 and 4 include the specular reflection characteristics of the SAR image and bare land or those of the highbrightness roofs in the panchromatic image, where class 4 is brighter than class 3 in the panchromatic image. Classes 5 and 6 are composed of vegetation, roads, or low buildings in the panchromatic image; the low backscattered intensity characteristics of class 5 and diffuse scattering characteristics of class 6 are shown in the SAR image. The evaluation was performed visually with a scatter plot and quantitatively using the coefficient of determination ( ), as shown in Figure 11 and Table 3. The scatter plot represents the correlation between data, and there are high correlations among all classes regardless of the site. In particular, Site 1 showed a high correlation among classes 1, 2, and 4, whereas Sites 2 and 3 showed a high correlation between classes 1 and 4. The other classes showed a moderate bias but a The evaluation was performed visually with a scatter plot and quantitatively using the coefficient of determination (R 2 ), as shown in Figure 11 and Table 3. The scatter plot represents the correlation between data, and there are high correlations among all classes regardless of the site. In particular, Site 1 showed a high correlation among classes 1, 2, and 4, whereas Sites 2 and 3 showed a high correlation between classes 1 and 4. The other classes showed a moderate bias but a sufficiently high correlation. Considering R 2 , a high value of which indicates the high precision and accuracy of the model, similar tendencies are observed, as follows. For all sites, classes 1 and 4 had the highest R 2 , and both exhibited similar properties. However, for class 2, Sites 2 and 3 were somewhat lower than Site 1, which is thought to be because of the complex structure of many buildings. Furthermore, classes 3, 5, and 6 involved several characteristics, which can lead to relatively low R 2 values. However, the overall results are reasonable; thus, the robustness of the constructed modes is confirmed.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 15 of 22 sufficiently high correlation. Considering , a high value of which indicates the high precision and accuracy of the model, similar tendencies are observed, as follows. For all sites, classes 1 and 4 had the highest , and both exhibited similar properties. However, for class 2, Sites 2 and 3 were somewhat lower than Site 1, which is thought to be because of the complex structure of many buildings. Furthermore, classes 3, 5, and 6 involved several characteristics, which can lead to relatively low values. However, the overall results are reasonable; thus, the robustness of the constructed modes is confirmed.

Variable Importance
To evaluate the influence of variables on RF regression, the variable importance scores were obtained. In particular, the variable importance scores were evaluated for each site, and the importance of each variable was averaged by all classes, as shown in Table 4. In terms of the importance of an individual variable, regardless of the site, the intensity of the SAR image contributed the most, followed by the mean of window sizes 3 × 3, 5 × 5, 7 × 7, and 9 × 9. In the case of the Gabor filter and standard deviation, the contribution was approximately 4-6%, which was relatively insignificant. On the other hand, in terms of the variable type, the contributions of intensity, Gabor filter, mean, and standard deviation were approximately 13%-16%, 25%-30%, 36%-37%, and 17%-20%, respectively; thus, it is confirmed that all variables are properly influenced.
However, it should be noted that the variable importance scores are relative; therefore, they depend on the number of variables included. In other words, the importance scores can be changed by removing or replacing the predictors, as different inter-correlated variables could act as substitutes.

Variable Importance
To evaluate the influence of variables on RF regression, the variable importance scores were obtained. In particular, the variable importance scores were evaluated for each site, and the importance of each variable was averaged by all classes, as shown in Table 4. In terms of the importance of an individual variable, regardless of the site, the intensity of the SAR image contributed the most, followed by the mean of window sizes 3 × 3, 5 × 5, 7 × 7, and 9 × 9. In the case of the Gabor filter and standard deviation, the contribution was approximately 4-6%, which was relatively insignificant. On the other hand, in terms of the variable type, the contributions of intensity, Gabor filter, mean, and standard deviation were approximately 13-16%, 25-30%, 36-37%, and 17-20%, respectively; thus, it is confirmed that all variables are properly influenced.
However, it should be noted that the variable importance scores are relative; therefore, they depend on the number of variables included. In other words, the importance scores can be changed by removing or replacing the predictors, as different inter-correlated variables could act as substitutes.

Additional Dataset
One additional dataset was included to verify the robustness of the proposed method. The area is St. John's, Newfoundland and Labrador, which is located along the Atlantic Ocean and mainly contains the water, grass, barren land, forest, and developed structures, and the panchromatic and SAR images are acquired from the GeoEye-1 and TerraSAR-X sensors. The GeoEye-1 image was acquired on 19 August 2019; it has a 0.46 m spatial resolution and 11-bit radiometric resolution. The TerraSAR-X was acquired on 8 August 2019; it was obtained in Staring SpotLight mode with a 0.8 m × 0.25 m spatial resolution, an ascending orbit, and HH polarization. The preprocessing was performed in the same way as that for the previously used dataset, elucidated in the previous sections, and the additional experiments were performed on two sites with a subset of 1500 × 1500 pixels. The additional experimental images and results are shown in Figures 12 and 13. From a visual inspection, it can be seen that the fusion was properly performed and that both the spatial information of the panchromatic image and the object information of the SAR image are sufficiently present in the resultant image. Furthermore, as shown in Table 5, the performance for the additional sites was like that in the previous results. That is, it is confirmed that the proposed method shows satisfactory results for the additional dataset, and its applicability is verified.

Additional Dataset
One additional dataset was included to verify the robustness of the proposed method. The area is St. John's, Newfoundland and Labrador, which is located along the Atlantic Ocean and mainly contains the water, grass, barren land, forest, and developed structures, and the panchromatic and SAR images are acquired from the GeoEye-1 and TerraSAR-X sensors. The GeoEye-1 image was acquired on 19 August 2019; it has a 0.46 m spatial resolution and 11-bit radiometric resolution. The TerraSAR-X was acquired on 8 August 2019; it was obtained in Staring SpotLight mode with a 0.8 m × 0.25 m spatial resolution, an ascending orbit, and HH polarization. The preprocessing was performed in the same way as that for the previously used dataset, elucidated in the previous sections, and the additional experiments were performed on two sites with a subset of 1500 × 1500 pixels. The additional experimental images and results are shown in Figures 12 and 13. From a visual inspection, it can be seen that the fusion was properly performed and that both the spatial information of the panchromatic image and the object information of the SAR image are sufficiently present in the resultant image. Furthermore, as shown in Table 5, the performance for the additional sites was like that in the previous results. That is, it is confirmed that the proposed method shows satisfactory results for the additional dataset, and its applicability is verified.

Conclusions
This study proposes a method that fuses high-resolution SAR and panchromatic images. A learning-based approach is adopted, and RF regression, which considers the differences in imaging mechanisms, forms the basis of the proposed method. The proposed method first selects the pixels to be used for learning and then performs classification on stacked SAR and panchromatic images to establish independent relationships for each class, thereby reducing the algorithm complexity. In particular, the number of classes is selected as six considering the land cover distributions and training time. Furthermore, to consider as many features as possible, various features are extracted from the SAR image, among which the Gabor filter and the mean and standard deviation of multiple window sizes are selected. Finally, image fusion is performed based on RF regression; then, the results are compared with those of conventional image-fusion methods. The following conclusions are obtained based on the results. First, from the visual aspect, the proposed method includes more of the object information of the SAR image and spatial information of the panchromatic image than conventional image-fusion methods. It is confirmed that sufficient information is included, regardless of vegetation and built-up areas. Second, the quantitative performance of the proposed method shows significant improvements. The performance evaluation verifies that the proposed method exhibits improved preservation of the information of the SAR and panchromatic images and results in less distortion when compared with conventional imagefusion methods. Third, when validating the RF regression model employed in the proposed method, it is confirmed that the predictive model is properly constructed. In addition, in the case of the variables selected, they contribute appropriately to the RF regression model. Finally, the applicability of the proposed model is verified by applying the proposed method to an additional dataset.

Conclusions
This study proposes a method that fuses high-resolution SAR and panchromatic images. A learning-based approach is adopted, and RF regression, which considers the differences in imaging mechanisms, forms the basis of the proposed method. The proposed method first selects the pixels to be used for learning and then performs classification on stacked SAR and panchromatic images to establish independent relationships for each class, thereby reducing the algorithm complexity. In particular, the number of classes is selected as six considering the land cover distributions and training time. Furthermore, to consider as many features as possible, various features are extracted from the SAR image, among which the Gabor filter and the mean and standard deviation of multiple window sizes are selected. Finally, image fusion is performed based on RF regression; then, the results are compared with those of conventional image-fusion methods. The following conclusions are obtained based on the results. First, from the visual aspect, the proposed method includes more of the object information of the SAR image and spatial information of the panchromatic image than conventional image-fusion methods. It is confirmed that sufficient information is included, regardless of vegetation and built-up areas. Second, the quantitative performance of the proposed method shows significant improvements. The performance evaluation verifies that the proposed method exhibits improved preservation of the information of the SAR and panchromatic images and results in less distortion when compared with conventional image-fusion methods. Third, when validating the RF regression model employed in the proposed method, it is confirmed that the predictive model is properly constructed. In addition, in the case of the variables selected, they contribute appropriately to the RF regression model. Finally, the applicability of the proposed model is verified by applying the proposed method to an additional dataset.
In future studies, the following aspects should be considered. First, by obtaining and applying the method to a sufficiently wide range of seasonal and temporal images, it should be further verified. Second, the method's usefulness should be further confirmed through application to SAR and panchromatic images obtained from other sensors. Third, the performance of the RF regression process should be improved by further extracting and combining various features. Finally, its applicability should be investigated by applying the fused images to various applications.