A Fusion Method of Optical Image and SAR Image Based on Dense-UGAN and Gram–Schmidt Transformation

: To solve the problems such as obvious speckle noise and serious spectral distortion when existing fusion methods are applied to the fusion of optical and SAR images, this paper proposes a fusion method for optical and SAR images based on Dense-UGAN and Gram–Schmidt transformation. Firstly, dense connection with U-shaped network (Dense-UGAN) are used in GAN generator to deepen the network structure and obtain deeper source image information. Secondly, according to the particularity of SAR imaging mechanism, SGLCM loss for preserving SAR texture features and PSNR loss for reducing SAR speckle noise are introduced into the generator loss function. Meanwhile in order to keep more SAR image structure, SSIM loss is introduced to discriminator loss function to make the generated image retain more spatial features. In this way, the generated high-resolution image has both optical contour characteristics and SAR texture characteristics. Finally, the GS transformation of optical and generated image retains the necessary spectral properties. Experimental results show that the proposed method can well preserve the spectral information of optical images and texture information of SAR images, and also reduce the generation of speckle noise at the same time. The metrics are superior to other algorithms that currently perform well.


Introduction
With the development of remote sensing imaging technology, the advantages of various remote sensing images, such as resolution and readability, have been greatly improved. However, images from a single source will inevitably encounter problems such as single imaging mode and narrow applicable scenes, which are difficult to be well utilized [1]. The Synthetic Aperture Radar (SAR) is an active side-looking radar system [2] which has high spatial resolution, can image all day and all weather, and is sensitive to target ground objects, especially land, water, buildings, etc. The SAR images contain rich texture characteristics and detailed information [3]. Optical depends on the reflection imaging of sunlight on the surface of ground objects, which can directly reflect the spectral information and contour features, and has excellent visual effects. Therefore, fused optical and SAR images can obtain their effective information, so as to accurately depict the scene and display the ground features from multiple angles. It has important applications in military reconnaissance and target detection [4].
At present, the fusion method of optical and SAR image can be roughly classified into two categories: transform domain method and spatial domain method [5]. Transform domain method is an image fusion method based on the traditional multi-scale transformation theory. Firstly, the source image was decomposed and then the decomposed sub-images were fused with appropriate fusion rules; finally, the sub-images were reconstructed to

GAN
In 2014, Goodfellow et al. [19] proposed a confrontational generation model based on two-person zero-sum game. The original GANs consists of two parts: generator G and discriminator D. The generator is used to capture the distributed data and describe how the data is generated. The discriminator is used to distinguish the data generated by the generator from the real data. The model is widely used in image generation, style transfer, data enhancement and other fields. In this network, the input of the generator is random noise z, after being processed by the generator, the output data G(z) is input into the discriminator D for judgment, and D will output a true or false judgment result, Remote Sens. 2021, 13,4274 3 of 17 namely D(G(z)), which is used to indicate the probability that G(z) is close to real data. When the output probability is close to 1, it means that the generated data G(z) is close to real data. Otherwise, G(z) is false data. Therefore, in the training process, the goal of the generator is to generate as close to the real data as possible, while the discriminator is as accurate as possible to discriminate that the data generated by the generator is a fake data. The generator and the discriminator constantly play games. When the data generated by the generator can be "faked", that is, it cannot be discriminated by the discriminator, the network reaches "Nash equilibrium". Among them, the target loss function can be expressed as: where E(·) represents the mathematical expectation of the distribution function; P data represents the distribution of real data; z represents random noise, that is, the input of generator G; P z represents the distribution of random noise z; P g represents the data distribution generated by the generator; G represents the generator; D stands for discriminator.

Gram-Schmidt (GS) Transformation
GS transform is a common method in linear algebra and multivariate statistics. It eliminates redundant information by orthogonalizing matrix or multidimensional images, which is similar to PCA transform. Figure 1 is the flow chart of GS fusion. The method first calculates the multi-spectral bands according to a certain weight to obtain a gray image, which is regarded as GS 1 [20]. Additionally, then using GS 1 to perform GS positive transformation with multi-spectral bands. Next, calculating the mean and standard deviation of GS 1 and the generated image, respectively, perform histogram matching on the generated image to simulate GS 1 [18]. Finally, the matched generated image is used to replace GS 1 for GS inverse transformation, and the fused image is obtained. and discriminator . The generator is used to capture the distributed data and describe how the data is generated. The discriminator is used to distinguish the data generated by the generator from the real data. The model is widely used in image generation, style transfer, data enhancement and other fields. In this network, the input of the generator is random noise , after being processed by the generator, the output data is input into the discriminator for judgment, and will output a true or false judgment result, namely , which is used to indicate the probability that is close to real data. When the output probability is close to 1, it means that the generated data is close to real data. Otherwise, is false data. Therefore, in the training process, the goal of the generator is to generate as close to the real data as possible, while the discriminator is as accurate as possible to discriminate that the data generated by the generator is a fake data. The generator and the discriminator constantly play games. When the data generated by the generator can be ʺfakedʺ, that is, it cannot be discriminated by the discriminator, the network reaches ʺNash equilibriumʺ. Among them, the target loss function can be expressed as: Where • represents the mathematical expectation of the distribution function; represents the distribution of real data; represents random noise, that is, the input of generator ; represents the distribution of random noise ; represents the data distribution generated by the generator; represents the generator; stands for discriminator.

Gram-Schmidt (GS) transformation
GS transform is a common method in linear algebra and multivariate statistics. It eliminates redundant information by orthogonalizing matrix or multidimensional images, which is similar to PCA transform. Figure 1 is the flow chart of GS fusion. The method first calculates the multi-spectral bands according to a certain weight to obtain a gray image, which is regarded as [20]. Additionally, then using to perform GS positive transformation with multi-spectral bands. Next, calculating the mean and standard deviation of and the generated image, respectively, perform histogram matching on the generated image to simulate [18]. Finally, the matched generated image is used to replace for GS inverse transformation, and the fused image is obtained.

Overall Structure
The Dense-UGAN network structure proposed in this paper includes a generator and a discriminator. When training the network, supply the registered optical and SAR images into the generation network in the form of one image with multiple channels. Then, introduce the generated single image and label image (optical image and SAR image) into discrimination model to complete the two-classification task. Finally, the high-quality generated image is output. After the network training is completed, we only need to import the registered optical and SAR cascade images into the trained generator, and then perform GS transformation on the output images and optical images to obtain the final fusion result. Figure 2 is the framework of proposed fusion method.
The Dense-UGAN network structure proposed in this paper includes a generator and a discriminator. When training the network, supply the registered optical and SAR images into the generation network in the form of one image with multiple channels. Then, introduce the generated single image and label image (optical image and SAR image) into discrimination model to complete the two-classification task. Finally, the highquality generated image is output. After the network training is completed, we only need to import the registered optical and SAR cascade images into the trained generator, and then perform GS transformation on the output images and optical images to obtain the final fusion result. Figure 2 is the framework of proposed fusion method.

Network architecture of generator
The purpose of the generator is to extract more and deeper image features to generate a new fused image. However, the traditional convolution network or fully connected network will inevitably have problems such as information loss and wastage when transmitting information. At the same time, when there are too many layers, problems such as gradient disappearance or gradient explosion will occur, resulting in the network being unable to train. Therefore, based on the traditional GAN model, this paper uses the combination of dense connection and U-shaped network to reconstruct the generator network structure. As shown in Figure 3, it can be seen that the generator is composed of dense connection network modules mainly composed of four convolution layers and U-shaped network modules mainly composed of six convolution layers with five deconvolution layers. The dense connection network establishes a cross-layer connection between the front convolution layer and the back layer, which improves the network performance and potential through feature reuse, generates a compression model with easy training and high parameter efficiency, mines deep features efficiently. In the Ushaped network module, set the kernel sliding step size of C2 and C4 convolutional layers to 2, so as to realize the downsampling of the input, and at the same time, the size of the feature graph is correspondingly halved. In the decoding process, used two deconvolution modules to upsample and recover the feature image. In addition, the U-shaped network in this paper uses 6 convolution layers with step length of 1 and 4 convolution layers with step length of 2 alternately, which reduces the compression degree of the feature graph and preserves the feature completeness of the supplied image.
Each convolution layer and deconvolution layer in the generator has a BatchNorm layer [21] and the activation function LeakyRelu [22]. The last layer of the deconvolution layer uses Tanh as the activation function to normalize the results to the interval of (-1,1), thus realizing the output normalization.
For the feature map obtained from each layer in the U-shaped network decoder, we used skip connection, which is equivalent to sequentially introducing the feature maps of the original images with the same resolution and containing intuitive low-level se-

Network Architecture of Generator
The purpose of the generator is to extract more and deeper image features to generate a new fused image. However, the traditional convolution network or fully connected network will inevitably have problems such as information loss and wastage when transmitting information. At the same time, when there are too many layers, problems such as gradient disappearance or gradient explosion will occur, resulting in the network being unable to train. Therefore, based on the traditional GAN model, this paper uses the combination of dense connection and U-shaped network to reconstruct the generator network structure. As shown in Figure 3, it can be seen that the generator is composed of dense connection network modules mainly composed of four convolution layers and U-shaped network modules mainly composed of six convolution layers with five deconvolution layers. The dense connection network establishes a cross-layer connection between the front convolution layer and the back layer, which improves the network performance and potential through feature reuse, generates a compression model with easy training and high parameter efficiency, mines deep features efficiently. In the U-shaped network module, set the kernel sliding step size of C2 and C4 convolutional layers to 2, so as to realize the downsampling of the input, and at the same time, the size of the feature graph is correspondingly halved. In the decoding process, used two deconvolution modules to upsample and recover the feature image. In addition, the U-shaped network in this paper uses 6 convolution layers with step length of 1 and 4 convolution layers with step length of 2 alternately, which reduces the compression degree of the feature graph and preserves the feature completeness of the supplied image.
Each convolution layer and deconvolution layer in the generator has a BatchNorm layer [21] and the activation function LeakyRelu [22]. The last layer of the deconvolution layer uses Tanh as the activation function to normalize the results to the interval of (−1,1), thus realizing the output normalization.
For the feature map obtained from each layer in the U-shaped network decoder, we used skip connection, which is equivalent to sequentially introducing the feature maps of the original images with the same resolution and containing intuitive low-level semantic information during the upsampling process. The feature map is superimposed with the feature map obtained by upsampling and then convolved to perform cross-channel information integration, which can help the decoder part of the network recover the image information with the lowest cost.
Remote Sens. 2021, 13, 4274 5 of 17 mantic information during the upsampling process. The feature map is superimposed with the feature map obtained by upsampling and then convolved to perform crosschannel information integration, which can help the decoder part of the network recover the image information with the lowest cost.

Network architecture of discriminator
The purpose of the discriminator is to distinguish whether the target image is an image generated by the generator or a real image, and then classify the target image by feature extraction. Its network structure is shown in Figure 4. It can be seen that the discriminator is a 5-layer convolutional neural network, which is a 3×3 filter from the first layer to the fourth layer, and the stride is set to 2. The first four layers all use BatchNorm to normalize the data, and use LeakyRelu as the activation function. The last layer is a linear layer for classification.

Loss function
Loss function can be used to measure the gap between network output results and data labels. The loss function in this method consists of two parts: the loss function of the generator and the loss function of the discriminator; the ultimate goal is to minimize the loss function and obtain the best training model.

Network architecture of Discriminator
The purpose of the discriminator is to distinguish whether the target image is an image generated by the generator or a real image, and then classify the target image by feature extraction. Its network structure is shown in Figure 4. It can be seen that the discriminator is a 5-layer convolutional neural network, which is a 3 × 3 filter from the first layer to the fourth layer, and the stride is set to 2. The first four layers all use BatchNorm to normalize the data, and use LeakyRelu as the activation function. The last layer is a linear layer for classification.
Remote Sens. 2021, 13, x FOR PEER REVIEW 5 of 17 mantic information during the upsampling process. The feature map is superimposed with the feature map obtained by upsampling and then convolved to perform crosschannel information integration, which can help the decoder part of the network recover the image information with the lowest cost.

Network architecture of discriminator
The purpose of the discriminator is to distinguish whether the target image is an image generated by the generator or a real image, and then classify the target image by feature extraction. Its network structure is shown in Figure 4. It can be seen that the discriminator is a 5-layer convolutional neural network, which is a 3×3 filter from the first layer to the fourth layer, and the stride is set to 2. The first four layers all use BatchNorm to normalize the data, and use LeakyRelu as the activation function. The last layer is a linear layer for classification.

Loss function
Loss function can be used to measure the gap between network output results and data labels. The loss function in this method consists of two parts: the loss function of the generator and the loss function of the discriminator; the ultimate goal is to minimize the loss function and obtain the best training model.

Loss Function
Loss function can be used to measure the gap between network output results and data labels. The loss function in this method consists of two parts: the loss function L G of the generator and the loss function L D of the discriminator; the ultimate goal is to minimize the loss function and obtain the best training model.

Generator Loss Function
In the fusion process of optical and SAR images, it is desirable to preserve the contour information of optical and the texture details of SAR images. Therefore, the loss function of the generator is mainly considered in four parts, which can be expressed as: In which L GAN (G) is the adversarial loss, L L1 (G) is the content loss, L SGLCM (G) is the texture feature loss, and L PSNR (G) is the peak signal-to-noise ratio loss, which will be described in detail below. λ, µ, and η are the weight coefficient for balancing the four loss functions.

•
Adversarial loss L GAN (G) is the adversarial loss between the generator and the discriminator, which can be expressed as: In which I f represents the generated image, N represents the number of fused images, D θ d I f represents the result of classification, and c is the value that the generator wants the discriminator to believe for fake data. •

Content loss
The luminance information of optical image is characterized by its pixel intensity, while the texture detail information of the SAR image can be partially characterized by gradient. Therefore, in order to obtain a fused image with similar intensity of optical image and similar gradient of the SAR image, we can use L L1 (G) to express the content loss of the image in the generation process.
where H and W represent the height and width of the input image, respectively, · F represents the Frobenius norm of the matrix, ∇ representing the gradient operator and ξ controlling the weight between the two items [12].
• Texture feature loss SAR images do not contain spectral color information, so texture feature analysis is particularly important. Texture essence of SAR image is a phenomenon that specific gray level appears repeatedly in spatial position. The gray level correlation between two pixels at a certain distance in image space represents the correlation characteristics of image texture. The gray level co-occurrence matrix is defined as the joint distribution probability of pixel pairs, which reflects the relevant indexes of the image by counting the frequency distribution of two gray levels in the specified spatial distribution. It not only reflects the comprehensive information of image gray level in adjacent direction, adjacent interval and change amplitude, but also reflects the position distribution characteristics among pixels with the same gray level. It is the basis of calculating texture features. Therefore, in order to make the generated image and the supplied SAR image have similar texture features, this paper introduces the generated image and SAR image's L 1 norm of the gray level co-occurrence matrix as a measure of texture similarity.
where GLCM(·) represents the gray level co-occurrence matrix of the image. In addition, contrast measures the distribution of matrix value and the amount of local changes in the image, reflecting the clarity of the image and the depth of the texture; energy measures • Peak signal-to-noise ratio loss In order to minimize the loss of texture details and edges of SAR images, it is easy to produce speckle noise when fusing optical and SAR images. The peak signal-to-noise ratio is based on the error between corresponding pixels, which can be used to measure the noise level in the image. Therefore, to make the generated image contain less noise and reduce the image distortion, this paper introduces the loss of peak signal-to-noise ratio to improve the image quality. The final calculation formula is Within, , MAX represents the maximum value of image point color, and it is 255 for the 8-bit sampling point.
For the weight between PSNR I f , I v and PSNR I f , I s , we use the pixel normalization method. In this paper, the pixel points v and s in the pixel histograms of optical image and SAR image, where the area difference between the highest pixel intensity value in optical image and SAR image histogram is 0.5 (frequent and continuous). Additionally, the weight ratio between them is obtained. The specific results are shown in the following Figure 5. the local variation of image texture, reflecting the degree of dispersion between image pixel value and mean value; homogeneity measures the similarity of image gray levels in the row and column direction, reflecting the local gray correlation. Therefore, in order to make full use of the texture features of SAR images, this method mainly introduces four texture features: contrast, energy, variance and homogeneity.  Peak signal-to-noise ratio loss In order to minimize the loss of texture details and edges of SAR images, it is easy to produce speckle noise when fusing optical and SAR images. The peak signal-to-noise ratio is based on the error between corresponding pixels, which can be used to measure the noise level in the image. Therefore, to make the generated image contain less noise and reduce the image distortion, this paper introduces the loss of peak signal-to-noise ratio to improve the image quality. The final calculation formula is Within, is the mean square deviation For the weight between , and , , we use the pixel normalization method. In this paper, the pixel points and in the pixel histograms of optical image and SAR image, where the area difference between the highest pixel intensity value in optical image and SAR image histogram is 0.5 (frequent and continuous). Additionally, the weight ratio between them is obtained. The specific results are shown in the following Figure 5.

Discriminator loss function
In actuality, in the absence of a discriminator, the fusion image with some information about the optical and SAR image can be obtained by using this method. However, the result is not particularly good. Therefore, in order to improve the image generated by the generator, we introduce the discriminator. Additionally, we establish a confrontation game between the generator and discriminator to adjust the generated image. Formally, the loss function of the discriminator includes two parts: one is the adversarial loss between the generator and discriminator; the other is the structural simi-

Discriminator Loss Function
In actuality, in the absence of a discriminator, the fusion image with some information about the optical and SAR image can be obtained by using this method. However, the result is not particularly good. Therefore, in order to improve the image generated by the generator, we introduce the discriminator. Additionally, we establish a confrontation game between the generator and discriminator to adjust the generated image. Formally, the loss function of the discriminator includes two parts: one is the adversarial loss L GAN (D) between the generator and discriminator; the other is the structural similarity loss L SSI M (D), which will be described in detail below. This can be expressed as: Among them, δ is the weight coefficient.
a and b respectively represent labels of generated image I f and optical image I v , D θ d I f and D θ d (I v ), respectively, represent classification results of generated image and optical image.

• SSIM loss
When eyes observe the image, it actually extracts the structural information of the image, not the error between image pixels [23]. The peak signal-to-noise ratio loss function is based on error sensitivity to improve image quality, and does not take into account the visual characteristics of the human eye. Structural similarity is an evaluation criterion based on structural information to measure the degree of similarity between images, which can overcome the influence of texture changes caused by light changes, and is more suitable for human subjective visual effects. By calculating the structural similarity loss between the generated image and SAR image, the generated image can retain more rich texture features of the SAR image, and generate edge details consistent with the human visual system.
where SSI M I f , I s represents the structural similarity (SSIM) index of image blocks in the generated image and SAR image, which can be calculated as: In the formula, µ x represents the average gray level of image x, µ ; σ xy represents the covariance between image x and image y, σ xy = 1 non-zero constants introduced to avoid the system instability when µ 2 x + µ 2 y and σ 2 x + σ 2 y are close to 0. The value range of SSIM function is [0,1]. The larger the value, the smaller the image distortion and the more similar the two images are.

Dataset and Parameter Settings
The research site is located in Nanjing, Jiangsu Province and its vicinity. SAR images was collected by Canada's RADARSAT-2 satellite with a resolution of 5 m and the collection time was 11 April 2017. Optical images are several 5 m-resolution images taken by the Rapideye satellite in Germany in April 2017.
First, we randomly select 60 pairs of optical and SAR images with a resolution of 256 × 256 from the dataset as the experimental training set to train the network. In order to get a better model, we set the sliding window step to 14 to clip each image [13], fill the cut sub-block size into 132 × 132 and then input them into generator. After that, the size of generated image is 128 × 128. Next, we introduce the generated image, optical, and SAR pairs into the discriminator and use Adam optimizer [24] to continuously improve the network performance until reached the maximum training times. Finally, we select another four pairs of images in the dataset for qualitative and quantitative analysis.
Our training parameters are set as follows: the size of batch images is set to 64, the number of training iterations is set to 10, and the training step k of the discriminator is set to 2. λ is set to 100, η is set to 100, µ is set to 2000, δ is set to 0.1 (the parameter setting will be discussed in the later), ξ is set to 5, and the learning rate is set to 10 −4 . Label a of the generated image is a random number ranging from 0 to 0.3, label b of the optical image is a random number ranging from 0.7 to 1.2, and label c is also a random number ranging from 0.7 to 1.2. Because labels a, b, and c are not specific numbers, they are called soft labels [25].

Valuable Metrics
To avoid the inaccuracy in subjective evaluation, we use some objective measures to calculate the corresponding values of fused images. Such as information entropy [26], average gradient [27], peak signal-to-noise ratio [28], structural similarity [29], spatial frequency [30], and spectral distortion. These evaluation indexes can calculate the fused image from the aspects of energy, spectrum, texture, and contour, reflecting the quality of the fused image with specific values.

• Entropy (EN)
The entropy of the image can reflect the amount of information contained in the image. The greater the entropy, the better the image fusion effect.
p i is the probability of the i-th grayscale value. L represents the total number of pixels in the image.

•
Average Gradient (AG) Assuming that the size of the image is M × N, G(m, n) represents the gray value of the image at point (m, n). The value of AG can reflect the performance ability of the image in local details. The larger the value, the clearer the image. • Spatial Frequency (SF) SF can be used to detect the total activity of fused images in spatial domain, and indicate the ability to contrast small details. The larger SF is, the richer the edges and textures the fused image has. SF = (RF) 2 + (CF) 2 (14) In 2 represents the line frequency. 2 represents the column frequency.

• Spectral Distortion (SD)
Spectral distortion mainly reflects the loss of spectral information between the fused image and the source image.
Because the spectral characteristics of optical images are more consistent with the visual observation of human eyes on ground objects in remote sensing images, the spectral distortion in this paper is calculated between fused images and optical images. The smaller the SD, the better the spectral features remain.

Results and Analysis
In this experiment, we compare the fusion performance of different methods for different scene images from subjective and objective evaluation. Figure 6 shows the selected images. These images come from the optical and SAR image pairs of different scenes in the dataset, including the scenes such as land, water, and buildings which are mainly considered in the process of image fusion. Because the spectral characteristics of optical images are more consistent with the visual observation of human eyes on ground objects in remote sensing images, the spectral distortion in this paper is calculated between fused images and optical images. The smaller the SD, the better the spectral features remain.

Results and Analysis
In this experiment, we compare the fusion performance of different methods for different scene images from subjective and objective evaluation. Figure 6 shows the selected images. These images come from the optical and SAR image pairs of different scenes in the dataset, including the scenes such as land, water, and buildings which are mainly considered in the process of image fusion.

Group1
Group2 Group3 Group4 In order to avoid the problems of gradient disappearance and gradient explosion in the GAN, this paper uses Dense-UGAN network as the main structure of the generator to realize image feature extraction. Additionally, we compare the fusion results with the generative adversarial network based on DCGAN, U-Net, and skip connection [31] to illustrate the effectiveness of the Dense-UGAN in the fusion of optical and SAR images. Use the original GAN loss function for training [13], and the results are shown in Table  1. In order to avoid the problems of gradient disappearance and gradient explosion in the GAN, this paper uses Dense-UGAN network as the main structure of the generator to realize image feature extraction. Additionally, we compare the fusion results with the generative adversarial network based on DCGAN, U-Net, and skip connection [31] to illustrate the effectiveness of the Dense-UGAN in the fusion of optical and SAR images. Use the original GAN loss function for training [13], and the results are shown in Table 1. It can be seen from Table 1 that no matter which network is used for image fusion, the final result is better than the original GAN network; that is, all objective evaluation parameters are generally improved. Secondly, according to the above table, when the Dense-UGAN network is used as the main structure of the generator for image fusion, the EN, AG, SSIM, and SF are increased by 7.13%, 43.77%, 0.62%, and 67.79%, respectively, than the original GAN. Among them, the SF is 13.636, which is 26.69% higher than SC-GAN, indicating that the structure of this paper performs better for fusion performance than other excellent networks.
Therefore, we combine the generative adversarial network based on Dense-UGAN and Gram-Schmidt transform to achieve optical and SAR image fusion.

Loss Function Analysis
In this part, we first evaluate the fusion effect of networks with different loss functions. Then, the weight parameters λ, µ, η, and δ in the loss function of generator and discriminator are discussed, so as to fine-tune the model to the best setting.
The second row of Table 2 is the experimental results of the Dense-UNet network using the original loss function in [13]. We will use it to conduct ablation experiments with the experimental results of introducing different loss functions. It can be seen that compared with the original loss function, after introducing the texture feature loss L SGLCM (G) into the generator loss function, the objective evaluation indicators EN and STD increased by 5.88% and 34.77%, respectively, indicating that the texture feature loss is beneficial to improve the performance of the image in local details. Additionally, after introducing structural similarity loss L SSI M (D) into the discriminator loss function, several texture feature indexes are also improved. Finally, we compare the complete loss function of this article with the original loss function results; we can see that EN, STD, PSNR, SSIM, and SD have increased by 5.15%, 25.41%, 2.06%, 31.64%, and 49.9%, respectively. It shows that the loss item proposed in this paper is an effective function for optical and SAR image fusion tasks, which can achieve the purpose of urging the fusion image to contain more and more spectral and spatial characteristics.
For the discussion of weight parameters, there are four parameters and coupling might exist between different loss functions, so the strategy is to increase the loss items one by one in the order of magnitude [32]. Firstly, we fix the weight λ of content loss L L1 in the generator loss function as 100 [33], then determine the weight parameter η of peak signal-to-noise ratio loss L PSNR , and finally determine the weight µ of texture feature loss L SGLCM . Similarly, the weight coefficient δ of structural similarity loss L SSI M in the discriminator loss function is also discussed in the same way. To quantitatively evaluate the results, we use the average value of each objective evaluation index of the selected four groups of source image pairs to compare different weight models. The experimental results are shown in Figure 7.  It can be seen from the experimental result that when the weight parameter  of is set to 1(x100), the weight  of is set to 20(x100), and the weight coefficient  of is set to 0.1, the objective index results of the fused image are relatively the best, and the amount of information is the largest.

Different Algorithms Comparison
To effectively evaluate the proposed optical and SAR image fusion method, this paper compares it with other five representative image fusion methods, including multiscale weighted gradient fusion method (MWGF) [34], wavelet transform-based fusion method (DWT) [35], fast filter-based fusion method (FFIF) [36], non-subsampled contourlet transform domain-based fusion method (NSCT) [18], and the Fusion-GAN fusion method [13]. Among them, MWGF and FFIF belong to the spatial domain. DWT and NSCT are representative methods based on transform domain, and the fusion rule adopted for NSCT is ʺSelect-Maxʺ. Fusion-GAN is a method based on deep learning. Different methods have different fusion effects. The results of the four scenes selected in the experiment are shown in Figure 8. It can be seen from the experimental result that when the weight parameter η of L PSNR is set to 1 (×100), the weight µ of L SGLCM is set to 20 (×100), and the weight coefficient δ of L SSI M is set to 0.1, the objective index results of the fused image are relatively the best, and the amount of information is the largest.

Different Algorithms Comparison
To effectively evaluate the proposed optical and SAR image fusion method, this paper compares it with other five representative image fusion methods, including multi-scale weighted gradient fusion method (MWGF) [34], wavelet transform-based fusion method (DWT) [35], fast filter-based fusion method (FFIF) [36], non-subsampled contourlet transform domain-based fusion method (NSCT) [18], and the Fusion-GAN fusion method [13]. Among them, MWGF and FFIF belong to the spatial domain. DWT and NSCT are representative methods based on transform domain, and the fusion rule adopted for NSCT is "Select-Max". Fusion-GAN is a method based on deep learning. Different methods have different fusion effects. The results of the four scenes selected in the experiment are shown in Figure 8.
paper compares it with other five representative image fusion methods, including multi-scale weighted gradient fusion method (MWGF) [34], wavelet transform-based fusion method (DWT) [35], fast filter-based fusion method (FFIF) [36], non-subsampled contourlet transform domain-based fusion method (NSCT) [18], and the Fusion-GAN fusion method [13]. Among them, MWGF and FFIF belong to the spatial domain. DWT and NSCT are representative methods based on transform domain, and the fusion rule adopted for NSCT is ʺSelect-Maxʺ. Fusion-GAN is a method based on deep learning. Different methods have different fusion effects. The results of the four scenes selected in the experiment are shown in Figure 8. From the subjective fusion results, it can be seen that the fused images obtained by FFIF, DWT, and MWGF methods inherit the spatial characteristics of SAR images, but the spectral characteristics do not inherit the optical images well. The spectral features of the fusion image using NSCT transform are inherited from the optical image, and the spatial features of SAR image are retained, but the image has more speckle noise. The fusion image obtained by the original GAN method is not suitable for the normal per- From the subjective fusion results, it can be seen that the fused images obtained by FFIF, DWT, and MWGF methods inherit the spatial characteristics of SAR images, but the spectral characteristics do not inherit the optical images well. The spectral features of the fusion image using NSCT transform are inherited from the optical image, and the spatial features of SAR image are retained, but the image has more speckle noise. The fusion image obtained by the original GAN method is not suitable for the normal perception of human eyes because only the lightness component of optical image is considered in the fusion process. However, the spectral features of the fusion results obtained by the method in this paper are obviously well inherited; the gap between the fusion results and optical images is smaller, the image definition is higher, and the loss of texture details and contour is less.
In addition, in order to further compare the performance of the fused image and optical image in detail, we intercepted a part of area from the experimental results, which includes water and residential. Then, we used the Canny algorithm to simply detect the edge of the optical image and the fused image. The experimental results are shown in Figure 9. sults and optical images is smaller, the image definition is higher, and the loss of texture details and contour is less. In addition, in order to further compare the performance of the fused image and optical image in detail, we intercepted a part of area from the experimental results, which includes water and residential. Then, we used the Canny algorithm to simply detect the edge of the optical image and the fused image. The experimental results are shown in Figure 9. As shown in Figure 9, the edge of the middle building is well reflected in the fusion results, and other places also show more texture details, which means that the proposed method performs well in keeping the details of the source image, and the fused image contains more content.
Further data processing the fusion results of Group1 image under the different methods. The image objective evaluation results are obtained and showed in Table 3.  Table 3 that the performance of this method is better than other methods regarding AG, PSNR, SSIM, SF, and SD. For AG and SF, compared with the MWGF method with better performance, it is improved by 45.2% and 30.42%, respectively. Additionally, compared with the NSCT method with better performance for PSNR, SSIM, and SD, it is improved by 1.74%, 11.59%, and 34.83%, respectively. In a word, although the proposed method cannot achieve the best in every index, the spectral distortion of the fused image has been improved, and the objective index has a good effect.
Moreover, we have also conducted fusion experiments on the other three groups of original image pairs in the selected dataset. Figure 10 is the line charts of objective results. From the objective data, it can be seen that the method proposed in this paper can be well applied in heterogeneous image fusion. As shown in Figure 9, the edge of the middle building is well reflected in the fusion results, and other places also show more texture details, which means that the proposed method performs well in keeping the details of the source image, and the fused image contains more content.
Further data processing the fusion results of Group1 image under the different methods. The image objective evaluation results are obtained and showed in Table 3. It can be seen from Table 3 that the performance of this method is better than other methods regarding AG, PSNR, SSIM, SF, and SD. For AG and SF, compared with the MWGF method with better performance, it is improved by 45.2% and 30.42%, respectively. Additionally, compared with the NSCT method with better performance for PSNR, SSIM, and SD, it is improved by 1.74%, 11.59%, and 34.83%, respectively. In a word, although the proposed method cannot achieve the best in every index, the spectral distortion of the fused image has been improved, and the objective index has a good effect.
Moreover, we have also conducted fusion experiments on the other three groups of original image pairs in the selected dataset. Figure 10 is the line charts of objective results. From the objective data, it can be seen that the method proposed in this paper can be well applied in heterogeneous image fusion.

Discussion and Conclusion
Firstly, this paper presents the theory of generative adversarial network and Gram-Schmidt transform, then introduces the Dense-U network into the GAN generator to obtain deeper semantic information and comprehensive features of optical and SAR images. At the same time, the loss function of the generative adversarial network is constructed. The PSNR and SGLCM loss are introduced into the generator loss function, and the SSIM loss is introduced into the discriminator to optimize the network parame-

Discussion and Conclusions
Firstly, this paper presents the theory of generative adversarial network and Gram-Schmidt transform, then introduces the Dense-U network into the GAN generator to obtain deeper semantic information and comprehensive features of optical and SAR images. At the same time, the loss function of the generative adversarial network is constructed. The PSNR and SGLCM loss are introduced into the generator loss function, and the SSIM loss is introduced into the discriminator to optimize the network parameters and obtain the best network model. Finally, the cascaded source image pairs are input into the trained generator to obtain a generated image, and the generated image is GS-transformed with the optical image to obtain the final fusion result. The experimental results show that the fusion image obtained by this method can well retain spectral characteristics of the optical image and texture details of the SAR image, while reducing the generation of coherent speckle noise, and can be well applied in the pixel-level fusion of heterogeneous images.