Multi-Resolution Collaborative Fusion of SAR, Multispectral and Hyperspectral Images for Coastal Wetlands Mapping

: The hyperspectral, multispectral, and synthetic aperture radar (SAR) remote sensing images provide complementary advantages in high spectral resolution, high spatial resolution, and geometric and polarimetric properties, generally. How to effectively integrate cross-modal information to obtain a high spatial resolution hyperspectral image with the characteristics of the SAR is promising. However, due to divergent imaging mechanisms of modalities, existing SAR and optical image fusion techniques generally remain limited due to the spectral or spatial distortions, especially for complex surface features such as coastal wetlands. This paper provides, for the ﬁrst time, an efﬁcient multi-resolution collaborative fusion method for multispectral, hyperspectral, and SAR images. We improve generic multi-resolution analysis with spectral-spatial weighted modulation and spectral compensation to achieve minimal spectral loss. The backscattering gradients of SAR are guided to fuse, which is calculated from saliency gradients with edge preserving. The experiments were performed on ZiYuan-1 02D (ZY-1 02D) and GaoFen-5B (AHSI) hyperspectral, Sentinel-2 and GaoFen-5B (VIMI) multispectral, and Sentinel-1 SAR images in the challenging coastal wetlands. Speciﬁcally, the fusion results were comprehensively tested and veriﬁed on the qualitative, quantitative, and classiﬁcation metrics. The experimental results show the competitive performance of the proposed method.


Introduction
Coastal wetlands are located at the intersection of ocean and land, which are of great significance for resource protection, climate regulation, and maintenance of biodiversity, with impacts on carbon stocks. In addition, coastal wetlands are typical complex surfaces, which are significant challenges in achieving fine mapping [1]. Recently, remote sensing (RS) technology has grown up to be a key method for wetlands survey and monitoring. The increasing availability of RS data brings rapid advancement and interest in the field of radar and optical data fusion [2][3][4]. Satellite-based optical sensors passively receive solar electromagnetic waves reflected by ground objects for imaging with rich spatial and spectral information [5]. Among them, multispectral images typically have high spatial resolution and rich details [6,7]. Hyperspectral images are characterized by nearly continuous spectral information, which is a distinguishing and popular feature for classification tasks [8]. Synthetic Aperture Radar (SAR) is an active detector, and its imaging is not affected by meteorological conditions and sunlight levels [9,10]. Radar microwave is more sensitive to dielectric properties and moisture characteristics, and the cross-polarization of SAR has a fair degree on distinguishing the vegetation canopy [11,12]. SAR can also provide backscattering properties that are different from optical, such as the geometric structure and surface roughness [13][14][15][16]. Therefore, how to perform pixel-level fusion of optical and SAR images, integrating the high spatial-spectral resolution with the characteristics of polarized backscattering, is of great significance for coastal wetlands mapping.
Due to the uncertainties inherent in the data sources, common SAR and optical image fusion techniques can be divided into three categories: SAR and multispectral image fusion, SAR-panchromatic-multispectral image fusion, and SAR-hyperspectral image fusion. The fusion of SAR and multispectral images has been applied to earth with good results. Wu et al. used the shear wavelet transform to fuse TerreSAR-X images (3 m) with Landsat-8 multispectral images to enhance the extraction of impervious surfaces [17]. Yin and Jiang made a special calculation on SAR images to minimize distortion in spectral information when fused RADARSAT-2 (PolSAR) with Landsat TM images [18]. Gaetano et al. used the registered optical data as a guide map to Sentinel-1 SAR denoising by a generalized bilateral filter, then fused SAR with Sentinel-2 MS data to improve the accuracy of land use classification [19]. Shao et al. combined the intensity-hue-saturation (IHS) fusion technique with gradient transfer (GTF) algorithms and described the fusion of SAR and multispectral images as an optical image optimization process that better preserved radiation information and spatial details [20]. Amarsaikhan et al. improved the results of land cover classification based on wavelet transform fusion-high-resolution SAR and optical images were used as the research data-and then compared the fusion results with Brovey transform, Ehlers fusion, and PCA algorithms [21]. Kulkarni et al. proposed a method combining PCA with the DWT algorithm [22], and Chen et al. proposed an algorithm using generalized IHS combined with wavelet transform, both reducing the spectral distortion of multispectral images [23]. Yang et al. used the Gram-Schmidt algorithm to fuse GF-1 (WFV) multispectral images with Radarsat-2 PolSAR images, improving the classification results of coastal wetlands in the Yellow River Estuary and increasing the accuracy of extracting mudflats and reeds [24].
The fusion of SAR, panchromatic, and multispectral images has also been explored in depth. Byun et al. provided two fusion ideas for SAR, panchromatic and multispectral images. The first is an area-based hybrid pansharpening fusion scheme. The SAR image is divided into active and inactive regions, and different fusion rules are designed for the two regions. The AWT algorithm is used to fuse the SAR and the panchromatic image, and then the multispectral image has component substitution applied to fusion images [25]. The second is a texture-based fusion scheme. The local statistical parameters are adaptively calculated to perform a weighted fusion of panchromatic images and SAR. The multispectral images are fused with the generalized IHS transform [26]. Garzelli used à-trous wavelet transform to fuse panchromatic and multispectral images, then injected structural information from SAR images into the pansharpened images to obtain the final fusion results [27]. Yonghonga extracted the high-pass details of SAR and panchromatic images by á-trous wavelets, and the panchromatic detail information was modulated by texture HPFM (high-pass filter modulation) of SAR images and then fused with multispectral images, and the fused images have spectral fidelity in vegetation, bare ground, and buildings [28].
In recent years, with the great potential of hyperspectral analysis in wetland monitoring being explored, research on SAR and hyperspectral image fusion has emerged [29][30][31][32]. Chen et al. performed the IHS transformation to fuse hyperspectral data and Topographic Synthetic Aperture radar (TOPSAR) and obtained high spectral and spatial resolution images, solving the fuzzy classification of land cover [33]. Nasrabadi used the nonlinear correlation between SAR and hyperspectral images to fuse and perform the Reed-Xioli (RX) anomaly detector based on kernel learning, improving the accuracy of mine identification [34]. Dabbiru et al. fused UAV synthetic aperture radar (UAVSAR) and hyperspectral images (HSI) from an airborne visible/infrared imaging spectrometer (AVIRIS) to improve the classification of oil-contaminated coastal vegetation [35].
Overall, the fusion of optical and SAR images can integrate the rich spatial and spectral information of optical images, as well as the backscattering polarimetric properties of SAR, which is of great significance for the fine mapping of coastal wetlands. However, available fusion methods for optical and SAR images have the following problems that need to be solved urgently: (1) On the one hand, SAR and optical images have divergent imaging mechanisms, and their image properties are quite different; the fusion is extremely likely to result in information distortion and component destruction. (2) On the other hand, coastal wetlands are typically a complex surface, which further poses a greater challenge to the collaborative fusion of the optical and SAR images. (3) In addition, the existing pixel-level fusion methods mainly focus on the synthesis of SAR and multispectral (or panchromatic) images, and there are few studies on the fusion of SAR and satellite hyperspectral images, especially for multi-sensor optical and SAR image fusion. To the best of our knowledge, there are no integrated cross-modal pixel-level fusion methods for SAR, multispectral (MS), and hyperspectral (HS) images, which are promising by combining their complementary advantages for coastal wetlands mapping.
In view of the above problems, this paper proposes a multi-resolution collaborative fusion method of MS, HS, and SAR images to obtain high-quality images for practical applications, such as coastal wetland mapping. First, the high spatial resolution images are decomposed based on edge-preserving filters, reducing the information distortion by simultaneous positioning in the spatial and frequency domain. Secondly, to make the algorithm more robust, we design optical and SAR cross-modal weight in both spatial and spectral dimensions among the fusion branch, while weighted inverse projection is performed to provide good local fidelity. Finally, the upsampled HS images are modulated by injecting spatial detail information and backscattering polarimetric properties without disturbing spectral components. The main innovations of this paper are as follows:

•
This paper firstly proposes a multi-modal collaborative fusion method for SAR, MS, and HS images to obtain the fused image with high spatial resolution, abundant spectral information, and the geometric and polarimetric properties of the SAR; • In the proposed M2CF, the optimal spectral-spatial weighted modulation and spectral compensation are designed to reconstruct the high-fidelity fusion images. Furthermore, the backscattering gradients of SAR are guided to fuse, which is calculated from saliency gradients with edge-preserving, making it more suitable for cross-modal fusion with complex surface features; • Fusion yields steady visible benefits, achieving the minimum spectral loss with high PSNR while robustly improving the classification results in coastal wetlands.
The rest of the paper is arranged as follows. Section 2 focuses on the methodological framework of the paper. Section 3 describes image preprocessing and the fusion datasets. Section 4 evaluates the fused images in terms of quantitative and qualitative metrics, and the classification accuracy is used to characterize the competitiveness of the fused images in realistic applications. Section 5 draws the discussion, and Section 6 presents the conclusions of our paper.

Framework Description
Multi-resolution analysis (MRA) framework can locate in multi-frequency dimensions with spatial and spectral separation, reducing the spectral distortion in the fusion of optical images [36,37]. It is widely used and promoted because of its efficiency and portability, such as the fusion of infrared and visible images [38]. More remarkably, the MRA-based fusion images can obtain a better signal-to-noise ratio, which is beneficial for classification. In addition, it also can be used for extensive area, in-orbit, and multi-modal fusion without bringing the computational load. That is, MRA makes it possible to achieve the best fusion performance in multiple modalities.
The remote sensing image can be processed by MRA segmentation to obtain lowfrequency and high-frequency components. The high-frequency component represents the texture details of the original image, mainly containing spatial information [39]. The low-frequency component is the base layer after filtering, which mainly contains spectral information [40]. The mathematical formula for MRA can be expressed as: where α is the scaling factor of pixel values, and β is the error term. I x ∈ R d x ×N represents the unfolded matrix of a three-dimensional digital image with d x channels by N pixels. X is the number of modalities, {I x } X x=0 . B x ∈ R d x ×N represents the low-frequency component of the filtered image, i.e., the base layer of the image; D x ∈ R d x ×N represents the highfrequency component of the original image after MRA segmentation, which is obtained by the difference between the original image and the low-frequency component.
The traditional MRA framework is not suitable for fusion tasks with multi-modal or heterogeneous images, which are highly dependent on correlation between modalities. To enhance the integration ability of multi-modal images and reduce information loss, we proposed a multi-resolution collaborative fusion algorithm. Figure 1 illustrates a general flowchart of the multi-modal MRA collaborative fusion (M2CF) method. The core idea of this fusion algorithm is to decompose the remote sensing image based on the generalized multi-resolution analysis framework by applying edge-preserving filtering. Among them, we locate the spatial and spectral dimensions of the multispectral image and the de-speckle SAR image, and decompose the image into low-frequency (spectral) and high-frequency (spatial) components. The histogram-matched low-frequency and high-frequency components are weighted, and the subimages are collaboratively fused and reconstructed according to the MRA inverter. Finally, we perform spectral compensation on the reconstructed image to obtain the final fusion image. Accordingly, we describe some important steps of M2CF in detail below.

Multi-Frequency Extraction of MS
Edge-preserving (EP) filters are widely used in natural image enhancement and fusion. EP filters can be constructed by simple linear or multiple nonlinear filters. Typical EP filters are fast bilateral filters (FBF), weighted least squares filters (WlsF), guided filters (GF), etc. The WlsF was first proposed by Farbman et al. [41] and has become increasingly extensively employed in natural image processing. Recently, the WlsF has also been applied to image denoising, image tone modulation, and image detail enhancement. It is characterized by the ability to multi-scale tone and edge detail manipulation when the image is decomposed.
In this paper, we use the WlsF-based optimization framework to decompose the multispectral images into smooth and edge parts, respectively. The principle is to ensure that the basic gradient of the image remains unchanged. The filtered and smoothed image B 1 is like the input MS image I 1 , in terms of spectral information and has distinct edge features. The filter model can be expressed as: where λ is a smoothing factor. f ∇ (I 1 ) is the image gradient of I 1 , obtained by second-order derivatives. Then, the filter model is also calculated in the least square: where the subscript k represents the spatial location of the pixel. The weight coefficients of the smoothing term are a z , a t , and the subscripts indicate direction. The first term of the function represents the difference between the input and output image. The second term is the regular term, which makes the output image smoother by minimizing the bias derivative. The definition of the weight coefficients is: where log( ) is the log value of the input image intensity band . The exponent δ (typically between 1.2 and 2.0) determines the sensitivity to the gradients of I 1 . ε is the residual difference. The filtered image B 1 can be expressed as a linear matrix equation.
where B 1 is obtained by applying a nonlinear operator f λ (·), which depends on the gradients of I 1 . I x 0 is an identity matrix. L k is a five-point spatially inhomogeneous Laplacian matrix obtained by minimizing the partial derivatives of I 1 .

Multi-Frequency Extraction of SAR
SAR images also have different bands, like optical images, such as vertical send and vertical receive (VV) polarization, vertical send and horizontal receive (VH) polarization, etc. However, the polarization modes of SAR are distinctly different from the optical bands.
In order to achieve complementary information between different polarization of SAR images while standardizing SAR images and reducing noise, polarization synthesis of SAR is required. Then, to deconstruct SAR and acquire continuous detail information, we use a guided filter (GF), which is more suitable for SAR multi-frequency extraction.

Polarization Synthesis
The polarization modes of SAR images are normalized in the range of m to n: where S min is the minimum value of pixels in a SAR polarization, and S max is the maximum value of pixels in a SAR polarization. The polarization synthesized SAR image S mg can be expressed by the following equation: where p 1 is the position weight map of the pixel vv ij with a stronger backscattering value in VV polarization above all polarization bands. Similarly, p 2 is the position weight map of the pixel vh ij with a stronger backscattering value in VH polarization above all polarization. S is the mean value of the polarization bands at a certain pixel position.
In addition, the SAR image is a grayscale image, which needs to be matched with the optical image histogram. The standard deviation σ 2 of the SAR matrix can be calculated from the pixel values: where the number of SAR pixels is N, S i is the i-th pixel of the SAR matrix i ∈ [1, N], and S mg denotes the pixel average of the SAR matrix. Histogram matching of the SAR image matrix is expressed as: Similarly, σ 3 is the standard deviation of the HS image and I 3 is the pixel average of the optical image. S * is the SAR matrix after histogram matching.

Guided Filter
The guided filter is a non-iterative filter, which belongs to the fast filter. In addition to the speed advantage, which can maintain the image gradient excellently [42], the important assumption of the guided filter is that there is a local linear relationship between the guided image and the output image within the filtering window. Meanwhile, according to the idea of MRA, the input image is composed of the base layer B 2 and details layer D 2 with spatial, sharp, texture, etc. Therefore, there are: where k is the midpoint of the local window, so the pixels belonging to the window ω k can be calculated using the corresponding pixels transformed by the coefficients of (a k, b k ). S * i indicates the value of the input SAR matrix at position i; G i is the optical guided map; B 2,i is the output matrix after guided filtering. Here, the output image can be considered as a local linear transformation of the guided map G i . In order to maintain the local linear model and minimize the value difference between the input and output images in the local window, the equation can be expressed as: Then, linear ridge regression is used with regularization parameters to fit the loss function: where, is the regularization parameter to prevent the value of a k from being too large. Solving the above equations yields the value of a and b at the guiding window as follows: where σ k is the standard deviation of the guiding image G k over the window ω k . S * k is the mean value of the SAR pixels within the filtering window. G k is the mean value of the guided image within the filtering window. After obtaining the above equation, a pair of (a k, b k ) coefficient can be calculated for each filtering window.
However, each pixel is contained in multiple operation windows, and thus multiple (a k, b k ) can be computed for each pixel position. Where the value of the output image is obtained by the mean value of the coefficients, and the above process can be described as: where a i and b i are the mean values of the coefficients calculated for all windows at the i-th pixel position, respectively.

Weight Configuration
Fusion weights of the spectral and spatial components are calculated separately. The hyperspectral image is selected for the primary spectral information after upsampling. It is modulated with two low-frequency components according to the correlation coefficient weight and the spectral transformation weight. For the high-frequency fusion branch, we apply pixel saliency gradient and spatial information ratio to fuse spatial information. The following subsections describe the details of the process.

Correlation Coefficient Weight
In the spectral information fusion branch, in order to avoid spectral distortion, it is necessary to find the fusion band of a certain MS image corresponding to the HS image. We find the corresponding fusion interval by calculating the correlation coefficient. The covariance matrix is expressed as C(·), and then the correlation coefficient weight R can be found at: where I x are the independent variable matrices. R ∈ (0, 1], the higher value of |R|, the closer relationship between the two spectral matrices.

Saliency Gradient Weight
The visual weight map (VWM) based on saliency extraction is mainly used for image fusion under multi-scale decomposition. The saliency gradient is used to design a visual weight map, which defines the visual importance of each pixel. Pixel locations with a larger VWM value usually correspond to the edge and texture areas, and these areas often contain richer spatial features. The intensity contrast between different pixels of the image is used to extract the saliency gradient. The pixel range of the image matrix I x is [m, n], then the saliency gradient is V x (q): where Cq denotes the number of pixels where the pixel value is q, ∀q, k ∈ [m, n]. The locations of the pixel with larger values on the visual weight map usually correspond to edge and texture regions. We sum the visual weight values of all pixels and take the mean value V x 1 . It can be approximated as the richness of the spatial information and is used to calculate the fusion weight of the image spatial detail.
where V x 1 denotes the average visual weight value of the image matrix with the number of N pixels, and V x is the x-th visual weight matrix. The saliency gradient weight of the image matrix x 1 is defined as Q x 1 .

Hyperspectral Band Weight
The spatial information ratio between different bands of the HS image is precisely proportional to its spatial optimization effect on the fused image. The intensity component of the HS image is calculated (the average of all bands is usually considered as the intensity value). Then, the ratio of the standard deviation of each hyperspectral band σ 3,i to the intensity band σ 3, is calculated. This ratio is used as the band weight W of HS images, it is defined as:

Spectral Transformation Weight
MS data can be considered as the spectral degradation of HS data. The multispectral image I 1 is correlated with the hyperspectral image I 3 by: where · F is the Frobenius norm of the matrix. T ∈ R d 1 ×d 3 denotes the spectral transformation matrix, e is the spectral residual. Each channel vector {T i } d 3 i=1 represents the spectral transformation matrix of the i-th HS band to multispectral image [43]. Then, the spectral transformation function can be fitted by the least square: 2.5. M2CF: Multi-Modal MRA Collaborative Fusion Table 1 gives the symbols and descriptions of the variables used in the proposed M2CF method. In the table below, the subscripts x ∈ 0, 1, 2, 3. I 1 , I 2 and I 3 represent the MS, SAR, and HS image, respectively. I 0 is a zero matrix of the same size as the hyperspectral image. I is the identity matrix.

Notation
Description Dimensions Obtained by the difference between the I x and B x .
The normalized SAR images, obtained by Equation (7). 1 × N S * SAR under synthesized and histogram matched.
Indicates upsampled I x image. d x × N Spatial penalty term to balance the equation.
The fusion coefficients of subimages.
The final fusion image after spectral compensation.
The modalities after multi-resolution decomposition are integrated according to the fusion weight based on the multi-resolution analysis inversion. Fusion rules are very important but difficult. Here, the low-frequency component F L and the high-frequency component F H of the fused image can be obtained by the weighting operation: where g l and g h are the fusion coefficients of low-frequency and high-frequency information, respectively. Q x is the saliency gradient weight. f up (·) indicates upsampling operation. I 0 is a zero matrix of the same size as the HS image after upsampling. Then, the preliminary fusion image Y can be represented as: where ∈ (0, 1] is the spatial penalty term to balance the importance of different terms. However, inverse reconstruction inevitably produces spectral residuals. Meng et al. introduced the concept of spectral compensation to improve the spectral fidelity [44]. Similarly, we obtain the spectral compensation matrix by calculating the difference between the original low-resolution hyperspectral image and the preliminary fused image, and then the spectral loss of the fused image is compensated. The spectral compensation process can be expressed as: where f MTF (·) and f down (·) denote MTF blurring and down-sampling, respectively. I 3 is the low-resolution hyperspectral image. I y denotes the final fused image.

Study Area
Owing to the complicated and heterogeneous distribution of wetland features, the coastal wetlands are also known for their typically complex and challenging surfaces. The study area includes the Yellow River Estuary, Yancheng, and the South Shore of Hangzhou Bay coastal wetlands (Figure 2). These wetlands are often used as study areas for hyperspectral image processing or wetland mapping [24,[45][46][47][48]. The Yancheng is in the east of Jiangsu Province, China. It is a typical silty mudflat coast, the largest coastal wetland on the edge of the Asian continent. The study area has formed a complex wetlands ecosystem and mainly contains salt marshes, culture ponds, and natural vegetation. The Yellow River Estuary, an alluvial delta plain formed by the accumulation of river sediment, is in the northeast of Shandong Province, China. There is an ecological protection area with a national nature reserve. It has abundant resources, and mainly contains paddy fields, potholes, and reeds. The Hangzhou Bay National Wetland (Hangzhou Bay for short) is in the northwestern part of Hangzhou Bay, Zhejiang Province, China. The wetland is rich in flora and fauna with numerous semiaquatic plants, such as Spartina alterniflora, Tamarix, Suaeda salsa, etc.

Preprocess and Datasets
In this fusion experiment, ZY-1 02D and GaoFen-5B (AHSI) hyperspectral images, Sentinel-2 and GaoFen-5B VIMI multispectral images, and Sentinel-1A SAR are multiresolution RS datasets. The European Space Agency (ESA) provides worldwide Sentinel Series RS data services (https://scihub.copernicus.eu (accessed on 1 February 2022)). Gaofen-5 provides 30 m ground sample distance, 5 nm spectral resolution with 150 bands in VNIR (400-1000 nm), and 10 nm spectral resolution with 180 bands in short-wave infrared (1000-2500 nm). The acquisition dates or satellite perigee passing time of Sentinel-1A SAR and optical images are nearly identical, which ensures that the ground type remains unchanged. Sentinel-1A SAR offers a dual polarization mode with short revisit times. We selected four 10 m spatial resolution bands in Sentinel-2A, including three visible bands and a near-infrared band.
SAR images are preprocessed in Sentinel Application Platform (SNAP) 6.0, mainly including thermal noise removal, terrain correction, and de-speckle. Among them, the SAR artificial de-speckle filter is used to suppress the noise in homogeneous regions and improve the edge features. Here, the Redefined Lee filter is chosen to remove the speckle noise with a window size of 5 × 5. The final resampled SAR image has 10 m spatial resolution with 1500 × 1500 or 1800 × 1800 pixels (Table 2). Optical images are preprocessed based on ENVI 5.3, mainly including radiometric calibration and the Fast Line-of-Sight Atmosphere (FLAASH) correction. The global digital elevation model (DEM) data at 30 m was used to correct the optical and SAR images (https://asterweb.jpl.nasa. gov/gdem.asp (accessed on 14 December 2021)). Then, the Global Digital Elevation Model (DEM) is applied to the surface reflectance products for orthorectification correction. HS images also require eliminating the bad bands, including overlapping bands, blurred or damaged bands, etc. Accurate registration, a process of geometric alignment of SAR and optical imagery, is the foundation of image fusion. Manual selection of control points on the SAR and optical images is laborious and time-consuming, and it is difficult to calibrate the corresponding points between different modalities [49]. To address the above issues, the following registration process is designed for cross-modal images: The preprocessed Sentinel-2 MS image is used as the reference image. The SAR and HS images are registered separately from the reference image. In this way, the multi-modal registration is reduced to optical registration and Sentinel series registration. Among them, the control points are artificially assisted to realize the coordinated registration. We selected the geometric angle and boundary as reference points and apply the registration module in the ENVI5.3 to alignment. The registration accuracy of the experimental datasets approaches the sub-pixel level (RSE ≈ 0.34).

Sample Selection and Distribution
To evaluate the performance of fused images for practical applications, we perform further classification experiments on three coastal wetlands. The random forest (RF) classifier as an ensemble learning algorithm and the support vector machines (SVM) classifier as a machine learning algorithm both have been widely used for RS image classification. Among them, the SVM and RF classifiers coded in MATLAB ® 2020a are used for the experiments. The SVM classifier adopts the radial basis function as the kernel function, and the variance parameter and penalization factor are estimated via cross-validation. The number of decision trees in the RF classifier is manually set to 500. The main reason for choosing two classifiers is that we expect to see a performance gain due to the fusion algorithm rather than the advanced classifier. We obtained the region of interest (ROI) for all feature classes by using high-resolution Google Earth, field sampling, and according to careful image interpretation (Figure 3). Except for image preprocessing and registration, all image processing steps were performed by MATLAB ® 2020a coding, including image fusion, evaluation metrics, and classification.

Experimental Results
In order to verify the effectiveness of the proposed method, we conducted fusion experiments on real RS images and compared quantitative metrics with several classical fusion algorithms. These include component substitution methods, model-based methods, hybrid methods, etc. All the fusion codes were run on a WIN10 computer with an Intel Core i9 processor and 128 GB RAM. The running time (in seconds) was recorded to evaluate the computational efficiency. Further classification experiments were conducted to characterize the practical application capability of the fused images.

Image Fusion Results
In this part, we use real RS data to evaluate the fusion algorithms in a more objective and specific way. The Yancheng wetlands are rich in spatial detail, and the Yellow River Estuary and Hangzhou Bay wetlands have complex spectral information. Figures 4-6 show the comparison of fusion results obtained by the proposed M2CF and other fusion algorithms. The main comparison methods include CS-based, MRA-based, tensor decomposition-based, hyperspectral super-resolution, unmixing-based, and hybrid methods [20,36,44,50]. In this, all fusion results are presented in true color with local zooms (Bottom right corner). The Wald protocol is addressed as a solution to the problem of reference image non-availability [51]. The consistency property indicates that any high-resolution fusion image should be as close as possible to the source image after down-sampling. Experiments using the original hyperspectral image as a reference and down-sampling the fused image to evaluate the spectral distortion. Tables 3-5 describe the quantitative metrics for image fusion methods, and the best one is shown in bold. Statistical quantitative evaluation metrics include the peak signal to noise ratio (PSNR), the spectral angle mapper (SAM), the correlation coefficient (CC), the root mean square error (RMSE), and the dimensionless global error in synthesis (ERGAS).       Figure 4 shows that all fusion images are brighter than the original HS images in visualization. Traditional component substitution (CS) methods have improved spatial resolution compared to the reference image (ZY-1 02D). However, CS-based methods, such as IHS, GS, and Brovey have severe spectral distortions. Specifically, the muddy water in the upper right corner of the GS, Brovey, and PRACS has been distorted to purple on the true-color images. CNMF and GSA-Hysure (hyperspectral super-resolution) methods have severe spatial distortion. Constrained by the speckle noise of SAR, CNMF shows loss of texture information, while GSA-Hysure shows obvious radar noise. Although the combination of CS and MRA resulted in little spectral distortion, the GSA-BDSD and GSA-ATWT fusion images still retained noticeable speckle noise. The proposed M2CF method has minimal spectral distortion and significant spatial optimization compared to others. Table 3 shows the robustness of the proposed method and GSA-ATWT in the spectral fidelity with spectral angle map (SAM) less than 5 and peak signal-to-noise (PSNR) ratio greater than 30. The conventional CS-based method has superior PSNR and short consumption time, but the spectral error is large. The CNMF and GSA-Hysure methods are not satisfactory. Although the SAM of the NLSTF fused result is smaller, the fusion image is blurry. In contrast, the SARF method has poorer spectral metrics, but it has better visualization compared to CS-based fusion. The M2CF achieves the minimum loss and has the best spectral fidelity with the effect of SAR noise cancellation while improving the spatial resolution. Figure 5 shows the original images and fusion results of all ten methods; Table 4 lists the quantitative metrics. This region is smooth with fewer spectral differences in features. The fusion results generally emerge from the polarimetric backscattering properties of SAR. Some bare lands are bright in the fused images, which have low spectral reflectance in the optical images, and the water potholes look darker. Traditional CS-based methods have severe spectral distortions, and hybrid methods produce spectral distortions on the individual feature class, except for the GSA-BDSD and GSA-ATWT. CS-based methods have some effects on cloud removal, such as the GS and IHS, where the intensity component from the upsampled HS data is unselectively replaced by the SAR, resulting in large differences in spectral information.

Fusion Results of Yellow River Estuary
The fusion image of CNMF loses some optical details. SARF is not suitable for image fusion with a large number of bands. The reason for this is that SARF injects spatial details into HS data intensity components to maintain spectral fidelity. The fusion image generated by NLSTF is rather noisy since less spatial information is considered in models. GSA-Hysure performs image interpolation several times during the fusion process due to the mechanism of the algorithm, which involves more strange pixels and causes more significant spatial distortion. Fusion images of the hybrid method have less spectral distortion, but have less spatial improvement, which is poorly visualized.
The Yellow River Estuary RS image is smoother so that the PSNR of the fusion results is lower, and the spatial enhancement is generally poorer. Among the evaluation metrics of the fused images, the CS-based methods have the lowest PSNR, and the SAM is the largest. For the MRA-based GSA-BDSD, to ensure the robustness of the spectral fidelity, the injection weight of the high-frequency information is rationed less, and the injected spatial details are insufficient, which limits the spatial enhancement. The performance of the hybrid method is relatively good, and the M2CF method is the best.

Fusion Results of Hangzhou Bay
The spectral information (GF-5B) used in Hangzhou Bay is richer compared to the other two study areas, and the spatial resolution is progressive between modalities. Figure 6 shows that the spectral distortion of fusion images is smaller than in the previous areas, and the spatial optimization is weaker. It is noted that there is pseudo-noise at the intersection of ocean and land on the original SAR image. The CS-based methods and CNMF have large spectral distortions, and the pseudo-noise belonging to the SAR image is still clearly visible, mainly due to their inherent processing mechanism. SARF and hybrid methods reduce the approximation error while eliminating the speckle noise. GSA-Hysure, CNMF, and NLSTF generate new noise in the fusion process of pixel operations, and such methods are not suitable for the fusion of SAR and optical images. The fusion results of the hybrid method are darker than the original GF-5 HS image. Among them, although GSA-BDSD has no obvious SAR image artifacts, it still has a significant amount of speckle noise belonging to the SAR image. Figure 6 shows the robustness of GSA-ATWT, GSA-BDSD, and NLSTF in terms of spatial fidelity and underlines the potential of the proposed method. Table 5 describes the quantitative metrics of eleven fused images in Hangzhou Bay. The CNMF method has the most severe spectral distortion. In addition, the running time of GSA-Hysure and NLSTF leave much to be desired. SARF has good visualization but poor quantitative metrics, which inject radar details into HS data with the variable gain coefficient.
According to the results from the three study areas, there is a basically identical trend in both quantitative and qualitative comparisons. CS-based methods have the least operation time but are highly dependent on the correlation between the source images. However, there are spectral mismatches between different sensors and divergent properties between modes, which can cause significant spectral distortion. The disadvantage of the PRACS method is that the dimensionless global error is large. The observations in Tables 3-5 show the robustness of GSA-ATWT, GSA-BDSD, and M2CF in terms of spectral fidelity. It is interesting that PRACS has good spectral fidelity in Hangzhou Bay, whereas it suffers serious spectral distortion in the Yancheng and Yellow River Estuary. Methods based on unmixing and hyperspectral super-resolution are prone to spatial texture detail distortion with a poor signal-to-noise ratio. NLSTF is prone to artifacts and a heavy computational burden. Apparently, hybrid methods are better than CS-based methods. Nevertheless, hybrid methods are not stable, and the algorithms are complex with long procedures. The flexibility and accuracy of the proposed M2CF are much better than that of hybrid methods. The proposed M2CF has the best fusion results; it promotes spatial resolution with high spectral fidelity. Experimental results show that the proposed method is more proper for fusing SAR, MS, and HS images.

Spectral Profiles Comparison
The reflectance at different wavelengths can be expressed in the image cube as a spectral profile. The pure pixel spectral profiles of three typical features are compared before and after the fusion (Figure 7), namely water, vegetation, and dry land/salt marsh. It can be found that the overall trend of the spectral profiles is correct and fits the original reference properly, especially in Hangzhou Bay. However, the profiles of the CS-based methods are mismatched in all three regions. Some fusion images even show negative values in the water profiles, such as the GS, Brovey, and NLSTF. The unmixing-based (CNMF) profile is mostly lower than the reference profiles in salt marshes and water. There is no regularity in the spectral profiles of the hybrid methods, and it fluctuates within a certain small range. Fortunately, including M2CF, the spectral profiles of GSA-BDSD and GSA-ATWT are high-fidelity on all four typical features, indicating that the spectral dimension was preserved well during the fusion process. Once again, the spectral fidelity of the proposed method is relatively robust.

Classification Results
To test the specific performance of the fusion methods, the fused images are employed for classification, and its application ability is characterized by the accuracy rate. The RF and SVM classifiers were trained 75 times, and then we computed three widely-used metrics for classification results, namely Overall Accuracy (OA), Average Accuracy (AA), and Kappa Coefficient. The classification accuracy was the mean value of 20 times to make a quantitative performance comparison. Classification metrics for HS images and optical (HS + MS) fusion images were included in the comparison to characterize the improvement in classification accuracy.

Classification Results of Yancheng
As can be observed, ZY-1 02D HS has the worst classification result, mainly due to the lower spatial resolution (Figure 8). In other words, HS has complete classification blocks, but the detail differentiation is weaker. The classification accuracy of most fusion images with high spatial and spectral resolution was enhanced. However, PRACS and CNMF fusion images are limited by the similarities in spectral reflectance. In the classification of the MRA-based hybrid fusion images, culture pond and salt marsh were easily misclassified. Among them, the BDSD fused image has many misclassified pixels in dry land and paddy field, and the boundaries in spartina alterniflora and reed are unclear.  Table 6 depicts the OA, AA, and Kappa of the classification in Yancheng. In quantitative metrics, the accuracy of the fusion images mostly exceeds that of images based on single ZY-1 02D data. However, CNMF has lower classification accuracy than optical fusion images in two classification experiments. Because of the spectral distortion results in CNMF fusion, misclassification occurs when combining spatial information. Among the accuracy metrics of the random forest classifier, the overall accuracy of Brovey, SARF, and GSA-ATWT is lower than optical fusion image, and all other methods improved to different degrees. PRACS has better accuracy in RF Classifier but is not as precise as ZY-1 02D in SVM Classifier. Collectively, the proposed M2CF's OA and Kappa were the highest among the metrics in the Yancheng, and the AA in the RF classification experiments improved about 9% over the classification results using HS alone; the OA and Kappa in the SVM experiments improved by 3.80% on average, further demonstrating that image fusion can improve the classification accuracy of the coastal wetlands. This reinforces the conclusion that the fusion of optical and radar data is indeed able to provide valuable synergistic information. Figure 9 shows the RF and SVM classification results in the Yellow River Estuary, and Table 7 lists quantitative metrics of the classification accuracy. Similarly, the M2CF method had better classification results. The SVM classifier shows PRACS, the hybrid method, and the proposed method visualize better, which is more consistent with the classification results of the Yancheng wetlands.  Visually, the HS image (ZY-1 02D) of the Yellow River Estuary has blurred classification on salt marshes and culture ponds in SVM classification experiments. It is observed that for the concerned application, both optical (MS + HS) and other fusion images are promising as compared to the ZY-1 02D HS, except for GSA-Hysure and PRACS. With the improvement of spatial resolution, the ability to distinguish the details was enhanced. The classification results of CS-based methods (e.g., Brovey and GS) are quite fragmented, mainly because the fused images retain the image details and speckle noise of the SAR. PRACS has no significant ability to make a distinction between paddy fields and culture ponds in both sets of experiments. The IHS and GS are misclassifications of the Yellow River in both classifiers, distinctly. For CNMF and GSA-Hysure methods, they fail to classify the materials well due to their sensitivity to complex noise (Figure 9). For the M2CF and three hybrid methods, the classification of salt marshes and culture ponds is differentiated well, which has similar spectral characteristics to optical images because SAR images are sensitive to soil moisture and geometric structure. However, the MRA-based hybrid methods are prone to speckle misclassification in Pothole.

Classification Results of Yellow River Estuary
The quantitative metrics of classification accuracy show that GSA-Hysure has lower classification accuracy than HS alone. It illustrates that inappropriate fusion algorithms can even reduce classification accuracy ( Table 7). The classification accuracy of NLSTF and SAR is roughly the same as the optical fusion image (HS + MS). The best classification accuracies were achieved for the fused images obtained by the M2CF method, and the OA was improved by about 5% over the hyperspectral images alone. Again, proving the superiority of our proposed fusion method over other fusion methods. The classification experiments further illustrate that pixel-level fusion of SAR and optical images can improve classification accuracy, especially in the identification of water bodies, bare land, and vegetation. It indicates the importance of image fusion for coastal wetland mapping.

Classification Results of Hangzhou Bay
In general, the pixels are usually mixed on the low-resolution HS in Hangzhou Bay; it is observed that multi-modal image fusion improves the wetlands classification to some extent. Concretely, CNMF has mass misclassification of ocean and pothole, which pollutes the maps. Moreover, due to the noise interference, the performance of CS-based fusion image is not improved or even negative. The NLSTF is unsuitable for identifying the subspaces of data in complicated areas. Hybrid methods perform well with clearer classification maps, especially the textures between photovoltaic panels and artificial trenches. In RF Classifier, the accuracy of optical fusion (MS + HS) is higher than that of CNMF, SARF, and GSA-Hysure. The quality of image fusion is directly related to the accuracy and visualization of classification results. The classification accuracy is improved through the hybrid fusion methods (Table 8). In both RF and SVM classification images of M2CF, the boundaries of the features are well defined and have continuity of distribution ( Figure 10). To sum up, the proposed M2CF achieves superior performance for coastal wetlands mapping.

Discussion
The classification and identification of coastal wetlands have long been interesting but challenging research in remote sensing. The increasing availability of data brings rapid advancement in the fusion of optical and radar images [4,52,53]. Optical images, especially for HS data, can provide rich and continuous spectral information. Radar microwaves can distinguish differences in roughness and moisture, particularly for capturing continuous water surfaces. The increase in the number of hyperspectral image bands at this stage has brought great challenges to the fusion algorithm [54]. It is necessary to focus on the actual production application of the algorithm, striving to achieve a balance between effectiveness and efficiency. Moreover, the polarized backscatter of SAR is sensitive to the size, density, and dielectric constant of the vegetation. Therefore, an efficient cross-modal fusion of hyperspectral and SAR images is timely and critical for further research [55].
This research will fill a gap that focuses on developing a cross-modal, fast, and robust fusion method of the HS, MS, and SAR images for coastal wetlands mapping. The experiments confirm that the classification accuracy of cross-modal fusion images mostly exceeds those based on single HS or optical fusion (MS and HS) images. It is consistent with many fusion studies; that is, the appropriate fusion algorithm can improve the classification accuracy [17,20,45,56]. Besides the results presented in the experiments, the following points should be further explored. The CS fusion frameworks have advantages in high efficiency and spatial enhancement. This is the reason why CS-based methods are most widely used in SAR and optical image fusion. Hyperspectral super-resolution or unmixingbased methods are better compared to CS-based methods, but they have poor performance in cross-modal data. Hybrid methods are an available fusion option. Multi-resolution analysis can strike a balance between spectral fidelity and spatial enhancement. It avoids instability due to the uncertainty of a step-by-step framework. Among the experimental results, the proposed M2CF achieves the smallest spectral loss while obtaining the highest classification accuracy. Unfortunately, M2CF may not be suitable for fusion at large spatial resolution ratios or mountainous topography. It is mainly limited to radar shadows caused by foreshortening and layover. Joint classification using cross-modal RS data fusion for wetlands mapping is promising [2,14,16,30]. Recent trends suggest that research in crossmodal fusion is progressing towards deep learning [57]. Because of the nonlinear correlation and inherent uncertainties in data sources, the above fusion results may not be as excellent as iteratively optimized deep learning algorithms.
Taking Yancheng as an example, Table 9 reports more cases as the supplement to the ablation experiment of M2CF in Yancheng. As a multi-frequency extraction experiment, Gaussian and Low-pass filters are incorporated into the method ablation. M2CF is based on MRA models, while M2CF is composed of edge-preserving filters. It is observed that the SAM and RMSE of the proposed module will increase significantly when using Gaussian and Low-pass filters to extract multi-frequency information. Specifically, the spectral metrics of the fusion module are more distorted without the spectral compensation; it further illustrates that the components work better together. In this work, two filter parameters need to be set manually, namely the smoothing factor λ of Equation (2) and the guiding window ω of Equation (15). We applied controlled variable experiments to verify the sensitivity of the parameters. The factor of the WlsF smoothing term depends on the input image. When the gradient of the input image is relatively large, we want the constraint to be smaller to preserve the structural information of the image; when the gradient of the image is small, this detailed information is considered unimportant, and the constraint can naturally be larger. For the guided filter, each pixel is contained in multiple filter windows, and the window size is directly related to the edge-preserving of the output image. Figure 11 shows the correlation between SAM and parameter settings. For the experiment process, the guide window (GW) is 8, and the smoothing factor (SF) is 1.2, separately. In addition, current studies often lack ground-truth and benchmark datasets at larger spatial scales. Variations in spatial registration and radiation mismatch between the SAR and optical are also major challenges. Periodic tide levels make remote sensing of coastal wetlands still challenging, which also increases the variability of data fusion [58]. Therefore, progress in this field still requires improvements in more robust fusion techniques and systematic procedures to assess the benefits of fusion.

Conclusions
In this paper, a hyperspectral-multispectral-SAR image fusion algorithm, namely multi-modal MRA collaborative fusion (M2CF), is offered. The proposed model improves generalized MRA and allows homogeneous data to be simultaneously integrated with physically heterogeneous radar radiation. It not only utilizes spectral-spatial information of the optical images, but also injects geometric and polarimetric properties of SAR. Fusion yields steady visible benefits, achieving the minimum spectral loss with high PSNR. Compared to step-by-step fusion frameworks, M2CF is more adjustable and robust. The classification experiments also illustrate that M2CF fused images bring about +3.2% OA improvements compared to optical fused images. Finally, we discuss two parameter settings and the ablation experiment for reference. Optical and SAR image fusion still has great potential; in the future, we will develop fusion algorithms using deep learning or feature learning.
Author Contributions: Y.Y. and W.S. analyzed the data, performed the experiments, and wrote the draft of the manuscript; X.M. designed the framework of this study, gave comments, and significantly revised the manuscript; G.Y., L.W., J.P. and Y.W. gave comments and funding acquisition. All authors have read and agreed to the published version of the manuscript.