Edge-Preserving Convolutional Generative Adversarial Networks for SAR-to-Optical Image Translation

: With the ability for all-day, all-weather acquisition, synthetic aperture radar (SAR) remote sensing is an important technique in modern Earth observation. However, the interpretation of SAR images is a highly challenging task, even for well-trained experts, due to the imaging principle of SAR images and the high-frequency speckle noise. Some image-to-image translation methods are used to convert SAR images into optical images that are closer to what we perceive through our eyes. There exist two weaknesses in these methods: (1) these methods are not designed for an SAR-to-optical translation task, thereby losing sight of the complexity of SAR images and the speckle noise. (2) The same convolution ﬁlters in a standard convolution layer are utilized for the whole feature maps, which ignore the details of SAR images in each window and generate images with unsatisfactory quality. In this paper, we propose an edge-preserving convolutional generative adversarial network (EPCGAN) to enhance the structure and aesthetics of the output image by leveraging the edge information of the SAR image and implementing content-adaptive convolution. The proposed edge-preserving convolution (EPC) decomposes the content of the convolution input into texture components and content components and then generates a content-adaptive kernel to modify standard convolutional ﬁlter weights for the content components. Based on the EPC, the EPCGAN is presented for SAR-to-optical image translation. It uses a gradient branch to assist in the recovery of structural image information. Experiments on the SEN1-2 dataset demonstrated that the proposed method can outperform other SAR-to-optical methods by recovering more structures and yielding a superior evaluation index. operator and is regarded as the texture component. In fact, we can also extract other information from the image, such as the texture features and the curvature. EPC decomposes the X c into a content component and texture component based on the obtained texture component, and then convolution is performed on the content component (cid:98) X c . The convolution kernel w is modiﬁed by the edge-preserving kernel k that is generated from each window in the content component (cid:98) X c ; ⊗ denotes the convolution operation.


Introduction
With the continuous development of remote sensing technology, optical remote sensing data and synthetic aperture radar (SAR) remote sensing data have been widely leveraged in disaster monitoring, environmental monitoring, resource exploration, and agricultural planning, etc. [1][2][3][4]. Optical remote sensing image data are more representative of what we can observe with the naked eye, which means that these data contain rich spectral information, but capture depends heavily on the clarity of the environment. Heavy clouds and bad weather seriously reduce the quality of optical remote sensing images, and light conditions limit observation times, resulting in limited use of optical remote sensing data [5]. By relying on the microwave band electromagnetic waves, SAR can work in all weather and all light conditions to obtain SAR remote sensing data. However, the interpretation of SAR images is a difficult task for people without professional training, Due to the above points, it is necessary to leverage some technical means to increase the readability of SAR images. In the past few decades, some methods have been proposed to enhance the readability of SAR images based on the ideas of image enhancement and image colorization. SAR image enhancement aims to make the target in an SAR image more obvious through processing [10][11][12]. Odegard et al. [13] presented a method to reduce the speckle noise in SAR images based on wavelet transform, but it may cause an increase in the amount of natural clutter. An adaptive processing method was developed in [14], which combines with filtering, histogram truncation and equalization steps. An example application, the generation of a flood image, proved the validity of the method. SAR image colorization tries to make SAR images resemble optical imagery by encoding the pixels in the SAR images [15][16][17]. These methods are mainly for single-pol SAR images, as single-pol SAR images are single-channel images that are visually close to grayscale images. Image colorization is a process of entropy increase, which strongly depends on the establishment of the model; therefore, performance degradation may occur in actual use. The SAR images processed by the above method have improved visual features and perform better in feature detection and recognition. However, differently from optical remote sensing images, these processed images are only suitable for expert recognition and untrained people still cannot recognize the features in the images [18].
Deep learning is the field of machine learning; of handling complex tasks by building neural network models, which have developed rapidly with the improvements of computing ability in recent years [19]. Deep learning can be used to achieve imageto-image translation tasks, which are regression tasks [20][21][22][23]. Some methods for the SAR-to-optical image translation task have been presented. These convert the more readily available SAR images into optical images that are more compatible with human visual perception [5,18,[24][25][26][27][28][29]. They are mainly based on generative adversarial networks (GAN), as GAN have the ability to produce images in line with real data distributions when there is a big difference between the SAR image and the optical image. These methods can generate grayscale or RGB optical images through SAR images by slightly adjusting the network for the image-to-image translation task. As these methods often do not take into account the special nature of image conversion and the network structure is not specifically made for SAR images, the optical images obtained often lose the structural information in SAR images and may contain conversion errors. Some work has been performed to improve them. A feature-guided method combined with a loss function based on discrete cosine transform (DCT) was developed in [30]. Zhang et al. [31] focused on the influences of edge information and polarization on the recovery process of SAR-to-optical image translation. However, the network structure is not specially designed for SAR-to-optical image translation. In addition, the whole feature maps are convolved with the same convolution filter, which is designed to reduce the parameters and complexity of neural networks in a standard convolution layer. However, the details and structural information of SAR images would be ignored while the content of each window is different but the filter is the same. It can also be understood that the parameters of the convolution kernel are globally optimal in an ideal situation, but are only sub-optimal for the contents in each window. This can degrade the quality of the generated image, especially in a difficult task such as SAR-to-optical image translation. Some methods try to predict convolutional filter weights at each pixel with a separate sub-network [32,33], but they increase the number of parameters, leading to more memory usage, longer training time and the corresponding marked dataset.
In this paper, we propose an edge-preserving convolutional generative adversarial network (EPCGAN) to enhance the structural information and visual clarity in the generated optical image. Inspired by decomposition theory utilized in traditional image enhancement methods and the pixel-adaptive convolution (PAC) [34], edge-preserving convolution (EPC) is proposed to perform content-adaptive convolution on feature maps while preserving the structural information. We first decompose the content of the feature map based on structural information extraction, and then perform content-adaptive convolution on the obtained content components, which combines the decomposition theory of traditional reinforcement methods with deep learning theory. The filter weight in the content-adaptive convolution is obtained by multiplying the weights of standard convolution kernel and the weights of edge-preserving kernel generated from the content component in each sliding window. Combined with the proposed EPC, EPCGAN, which has a gradient branch to assist the recovery of structural information, is proposed for the SAR-to-optical translation task. The gradient branch continuously receives the content information from the backbone network to simulate the gradient of a real optical image and finally feeds back the gradient information to the backbone network to assist in the image generation, which aims to make full use of the structural information in the SAR image for the SAR-to-optical image translation. In order to verify the effectiveness of our proposed edge-preserving convolution and edge-preserving convolutional generative adversarial networks, we conducted comparative experiments and ablation studies on the SEN1-2 dataset. Experimental results prove that our proposed method can obtain better visual properties with more defined texture and better evaluation indexes than other methods for SAR-to-optical image translation.
Specifically, the major contributions of this paper are as follows: 1.
Edge-preserving convolution (EPC) is proposed for SAR-to-optical image translation. It performs content-adaptive convolution on a feature graph while preserving structural information according to decomposition theory, leading to good structure in the generated optical images.

2.
For the situations in which SAR image interpretation is difficult, a novel edgepreserving convolutional generative adversarial network (EPCGAN) for SAR-to-optical image translation is proposed, which can improve the quality of the structural information in the generated optical image by utilizing the gradient information of the SAR image and the optical image as a constraint.

3.
The experiments on the training set selected from the SEN1-2 dataset [35] containing multi-modal data (forests, rivers, waters, plains, mountains, etc.) prove the superiority of the proposed algorithm. Meanwhile, ablation studies are given.
The organization of the remainder of the paper is as follows. Section 2 gives a comprehensive review of related methods. The proposed edge-preserving convolutional generative adversarial network for SAR-to-optical image translation is introduced in Section 3. We present the experiment results on the SEN1-2 dataset in Section 4 and comprehensive analyses in Section 5. Finally, the conclusions are illustrated in Section 6.

Image-to-Image Translation
Image-to-image translation refers to the conversion of an image into another type of image, which has become one of top research topics in deep learning. Examples of translation include converting sketches to real pictures and realistic images to anime images [36][37][38][39][40]. Calculating the loss only through the content loss function, such as the L1-norm loss function or L2-norm loss function, will lead to the output having poor visual quality, which limit the results of the image-to-image translation task in the early stage. Generative adversarial networks have been widely applied in image-to-image translation, since the generator in GAN can generate images with excellent visual properties. The conditional generative adversarial network (cGAN) is a widely used framework for image-to-image translation tasks due to its ability to generate images based on not only content but also style [41]. Isola et al. presented a novel network named Pix2pix for image-to-image translation based on the cGAN framework, where the generator is based on U-Net [20]. Then, a high-resolution network Pix2pixHD was developed in line with Pix2pix, which can realize high-resolution image-to-image translation and semantic editing [22]. Pix2pix and Pix2pixHD has shown excellent conversion capabilities in sketchto-real image conversion and style transfer experiments, but a large amount of paired data from different domains is needed, which is sometimes hard to acquire. Based on the ideas of symmetry and circulation, the networks named CycleGAN and DualGAN were proposed, which can utilize unpaired datasets for training [21,23]. Both Pix2pix and CycleGAN aim for one-to-one conversion, that is, the conversion from one domain to another domain. When there are multi-domain images that need to be converted, it takes a long time to retrain a model for each domain translation. Choi et al. presented a network named StarGAN, which can realize multi-domain image translation and only requires one training period [42]. Some methods also try to control some features in the output image through encoded variables [43]. A lightweight network for image-to-image translation was also proposed [44,45]. SAR-to-optical translation is also a part of image-to-image translation. However, there are huge differences between SAR images and optical images due to the datasets and speckle noise. Therefore, this particular case is indeed different than most image-to-image translation tasks. Unfortunately, when a network designed for "ordinary" image-to-image translation tasks is applied to SAR-to-optical translation, the outcome is poor. Therefore, our method for SAR-to-optical translation is meaningful.

Deep Learning-Based Methods for SAR Data
Deep learning has been used in SAR image optimization for different reasons. Based on the boundary equilibrium generative adversarial network (BEGAN) proposed in [46], a generative adversarial network for SAR image generation was developed, and it was demonstrated that synthetic data generated by the proposed network could improve the accuracy of classification [47]. Chierchia et al. [48] presented a deep learning-based method to remove the speckle noise in SAR images, and the network is based on the residual network, which is presented in [49]. The results came close to those of some state-of-the-art denoising methods for SAR images, which proves the potential of deep learning-based methods for SAR images. In order to enhance the quality of SAR images, the dialectical generative adversarial network (Dialectical GAN) was proposed to generate TerraSAR-X data with a ground-range resolution of 2.9 m and Sentinel-1 data based on a groundrange resolution of 20 m, which is similar to the effect of super-resolution in computer vision [50]. In addition, researchers also discussed the possibility of SAR-to-optical image translation to enhance the utilization of SAR images. Most solutions are based on the cGAN framework. Merkle et al. [25] proposed a method for optical and SAR image matching by converting single-pol SAR images to optical images with a U-net architecture and cGAN. Wang et al. [26] developed the SAR-GAN network consisting of two sub-networks to perform the despeckling task and coloring task, respectively; however, the two-step design idea ignores the different imaging principles of SAR images and optical images. Multi-temporal SAR data have also been considered, He et al. [51] developed a method that can generate optical images based on a meticulously designed residual network and cGAN. Some methods first convert SAR images into optical images and then fuse the SAR-to-image images with cloud images and SAR images to obtain cloud-free images, which contain RGB information [29,52] or hyperspectral information [28]. Schmitt et al. [35] published the SEN1-2 dataset, containing 282,384 pairs of corresponding image patches, which provides sufficient training data for the SAR-to-optical image translation task. cGAN requires strictly corresponding datasets, and the quality of datasets seriously affects the training results. Mario et al. [5] leveraged an unsupervised learning network CycleGAN [21] for SAR-to-optical image translation and discussed the fundamental limitations affecting SARto-optical image translation. Wang et al. [18] presented the supervised cycle-consistent adversarial network (S-CycleGAN) based on Pix2pix and CycleGAN to keep both the land cover and structural information. Furthermore, some methods that consider SAR image characteristics have been proposed. Zhang et al. [30] developed a feature-guided method with DCT loss, and Zhang et al. [31] utilized edge information to assist with SAR-tooptical image translation. However, these methods are usually simply modified versions of networks for general image-to-image translation that were not designed for SAR-to-optical image translation. In SAR-to-optical translation, we hope to recover an optical image with good lines. However, SAR images contain strong speckle noise, the edges of the image may be ignored in the standard convolution and the weight of the convolution kernel is content-independent, resulting in the output image having poor definition and blurred structural edges. Differently from the previously described methods, the proposed EPC and EPCGAN were designed for SAR-to-optical translation based on the characteristics of optical images and SAR images.

Methods
In this section, we first introduce the edge-preserving convolution. Then we present the details of edge-preserving convolutional generative adversarial networks and loss functions, accordingly.

Edge-Preserving Convolution
The convolutional neural network, a pioneering achievement, is described in [53]. It is one of the most widely used network structures in deep learning. The feature maps are convolved with a convolution kernel of specified size in a standard convolution layer. The standard convolution layer has far fewer parameters and far less of a computational load during training than fully connected layers, which effectively increases the depth of the neural network and decreases the difficulty of training. The weights of the convolutional layer are spatially shared but also content insensitive. Formally, the standard convolution from image features X with c channels to image features X with c channels can be written as: where W ∈ R c ×c×k×k are the weights of the convolution kernel, p are pixel coordinates in the image features, Φ(·) is the range of k × k around the pixel coordinate of input and b c denotes biases. With a slight abuse of notation, we use p − p to denote the indexing of the spatial dimensions of an array with 2D spatial offsets. It can be seen from Equation (1) that the weight of the pixel multiplication in the convolutional layer is only related to the position. Once a convolutional neural network is trained, the same convolutional filter bank is applied to all images and all pixels, regardless of their content. Therefore, the structural information and details of the image are ignored, which limits the quality of the output image from the network.
To solve this limitation, we draw lessons from the traditional edge-sensing decomposition method for improving the convolution operation. Image decomposition techniques are widely used in traditional edge-aware image operators to achieve image enhancement [54][55][56], which is also used for the processing of SAR images [57,58]. Traditional decomposition methods can be summarized as: where X = E (X) is the content component; E (·) is the operation of extracting content from an image, which is usually an edge-aware filter; X = X − X is the texture component, which is the difference between image and content components; g(·) and f (·) are different processes for the content component and texture component, which can be referred to as a non-linear function. These traditional edge-aware decomposition methods leverage edge-aware filters to obtain the content component, which is usually considered to consist of the low-frequency components of the image, and the texture component, which is usually considered to consist of the high-frequency components of the image. Applying different modifications to content components will result in the changes in contrast and tone adjustments of image, and the image can be sharpened by enhancing the texture component. While the gradient of the image is considered to contain the texture information of the image, we first extract the gradient of the image as the texture component in the image and keep the texture component unchanged, and then perform a convolution operation on the content component. Since the goal of the module we designed is not to change the number of channels in the feature map, the subsequent channels are unified to c. The standard convolution of a content component and of the processing of content components can be defined as: Inspired by PAC algorithm in [34], an edge-preserving kernel k p − p is proposed to modify the standard convolutional filter weights adaptively, according to the features in the content component. The edge-preserving kernel k p − p is generated by the difference between the value of the X c (p) and the surrounding pixel value, which provides the amplitude of the edge and small-scale detail. The edge-preserving kernel k p − p can be written as: Equation (6) is actually a modified Gaussian function. The σ is the standard deviation, which can control the degree of edge retention in the convolution. The α(p) is a regularization parameter added to limit the range of differences. The regularization υ(p) can be defined as: Combined with Equation (6), the processing of content components in edge-preserving convolution can be defined as: The kernel W p − p is the same as the standard convolution kernel, whose purpose is to learn the corresponding relationship between the SAR image and the optical image through training; the edge-preserving kernel k p − p , which is generated from the content of each convolution window, can keep the edges in the content component by decreasing the influence of pixels with amplitudes that differ from that of the center pixel in Equation (8).
After the content component undergoes the operation in Equation (8), we merged the texture component, and the content component to obtain new features. Finally, the edgepreserving convolution (Figure 2) can be written as: This operation can be used in image restoration or translation, which preserves edges in each convolution and implements content-adaptive enhancement.

Network Framework
In addition to speckle noise, there always exist great differences between optical images and SAR images of the same scene, which are mainly due to their different imaging concepts [5]. The physical properties of the objects' surfaces will be highlighted in the SAR image, but the optical image provides more structural details; hence, the design of network is a problem that needs careful consideration. cGAN is an effective choice that can enhance the visual likeness of output image by GAN and the intensity constraint of the conversion process with the pixel loss between the output image and the target image. However, while some obvious features in the SAR image or optical image may not be obvious at all in the other, the loss of strong constraints would make the network unstable and produce blurry results with missing structural information of some objects. CycleGAN is another choice, which does not rely on the strong constraint loss function between the output image and the target image. CycleGAN can preserve the structural information well, but some land cover information is lost and translation errors may occur without a strong constraint loss function.
The edge-preserving convolutional generative adversarial network was designed based on the CycleGAN framework, but strong constraint loss between the input image and output image is added. In order to reduce the negative effect of strong constraint loss, we add some other losses to reduce the impact of strong constraint loss. In addition, our network also has a branch structure to make better use of the structural information in the input image. The overall framework of EPCGAN is shown in Figure 3.

Generator
Based on the proposed EPC, we designed a generator that contains a gradient branch. The backbone network utilizes the proposed EPC to extract features and merges the information provided by the gradient branch to output the converted image. The gradient part takes the gradient of the input image as the input, continuously integrates the auxiliary information provided by the backbone network, and finally feeds back to the backbone network for the final image reconstruction. Detailed information is shown in Figure 4.
The backbone network first leverages a 7 × 7 convolution and the proposed EPC, which can produce effective feature extraction of an image. After that, the size of the feature map is reduced through the convolution layer to reduce the network parameters, which has been proven to be effective in image-to-image translation [18]. We incorporate the feature maps from the 3th, 6th and 9th blocks into the gradient branch as auxiliary information and introduce the residual in the residual dense block (RRDB) proposed in [59] to fuse the feature map of the backbone network with the output of the gradient branch.
The goal of the gradient branch is to estimate the conversion of the gradient map between the SAR image and the optical image. The gradient branch first obtains the gradient map from the input image, just as the proposed EPC does. The gradient map can be obtained by calculating the differences between pixels, which can be realized by a convolutional layer with a fixed kernel. The acquisition of the gradient map can be expressed as: where α(·) stands for the operation to extract the gradient map, and z = (x; y) are the coordinates in image F. The gradient branch will continuously combine the feature maps in the backbone network in order to restore the gradient map, an implicit reflection of whether the recovered regions should be sharp or smooth. In the generator, we provide the feature map generated by the penultimate layer of the gradient branch to the backbone network. At the same time, we apply these feature maps as input to generate the output gradient map through a 1 × 1 convolution layer.

Discriminator
The discriminator adopted the PatchGAN architecture, consisting of five convolutional layers, which has effective discrimination ability with fewer parameters [20]. Each convolutional layer is followed by a leaky ReLU, and a sigmoid output layer is set in the end for classification. The advantage of this method is that only the local image is discriminated, not the whole image, so that the image can be better judged better.

Loss Function
There exist two generators and two discriminators in our proposed EPCGAN to learn the translation between SAR image domain X and optical image domain Y with paired data The generator G 1 attempts to generate an image G 1 (x) that looks similar to the optical image based on the input SAR image x, and the discriminator D y aims to distinguish real optical image y and generated optical image G 1 (x). In the same way, generator G 2 generates an image G 2 (y) that looks similar to the SAR image from the input optical image y, and the discriminator D x is designed to distinguish real SAR image x from generated SAR image G 2 (y). The adversarial losses are shown as below: Cycle consistency loss was proposed in CycleGAN [21]. For each input SAR image x, it is converted to G 1 (x) by the generator G 1 , and then converted to G 2 (G 1 (x)) by the generator G 2 . The input x is expected to be consistent with G 2 (G 1 (x)), and the cycleconsistency loss is as follows: Most SAR-to-optical image translation methods optimize well-designed networks through common pixel loss, which can reduce the average pixel difference between generated optical images and real optical images, but may lead to fuzzy results with loss of structural information. We also leverage the loss function to accelerate convergence and improve SAR-to-optical performance. Since there are two generators in our network, the pixel loss can be expressed as: In order to improve the perceptual quality of the generated image, the concept of perceptual loss was proposed in [60]. The features containing semantic information are extracted from the pretrained VGG network. The Euclidean distances between the features of input images and generated ones are minimized in perceptual loss: where ϕ i (·) denotes the ith layer output of the pretrained VGG model. If the model is only optimized by L1 loss or MSE loss in the image space, we usually obtain images with blurry edges given an input test sequence where the ground truth has sharp edges. In order to enhance the structural information of the generated optical image as much as possible, we propose a gradient loss that is calculated by the gradient of the generated image and the gradient of the target image as follows: where α(·) denotes the operation of gradient extraction. In the proposed EPCGAN, the output of the generator includes the output G 1 (x) of the backbone network and output G 1branch (x) of the gradient branch. The function of the gradient branch in the generator is to extract effective structural information according to the input image to assist with image translation. In order to restrict the function of the gradient branch, we utilize the distance between the output of the gradient branch and the gradient graph of the target image to constrain the updating of the gradient branch parameters: In summary, we have two discriminators D X and D Y , which are optimized with L GAN (G 1 , D Y ) and L GAN (G 2 , D X ). For the generator, L GAN and L cyc are used to improve the visual realism of the output image while maintaining the structures. The L pix and L per are to provide corresponding constraints based on the pixel distance between the generated image and the target image. Gradient loss and branch loss cooperate with each other to improve the structural information of the output image according to the pixel distance between the generated image and the target image. The overall objectives are defined as follows: where λ cyc , λ pix , λ per , λ grad and λ branch denote the weight parameters of different losses.

Experiments
In order to prove the effectiveness of the proposed method, we conducted experiments with some methods for SAR-to-optical translation based on the same training set and test set, which were selected from the SEN1-2 dataset [35].

Dataset
The selection of the experimental dataset is a very important issue when proving the robustness of any method. SEN1-2, containing 282,384 paired image blocks collected from across the globe and throughout all meteorological seasons, has been proven to be usable for SAR-to-optical translation tasks. These image blocks were obtained by medium-range clipping from multiple paired SAR and optical images, and the size of each image block is 256 × 256 pixels. A common method of selecting datasets is to take some image blocks from a picture as the training set and some other image blocks as the test set, under the condition of ensuring that there are no overlapping pixels in the two kinds of image blocks. When the paired data resources for the task are difficult to obtain, this method is indeed reasonable. However, there is always a high degree of similarity between image blocks that come from the same large picture. When the network is trained with image blocks from the same picture as the test set, the model will perform better on the test set than it should, and the robustness of the model cannot be reflected in such experimental results.
We selected 1551 pictures from SEN1-2 as the training set, which contained multiple terrain types, including forests, lakes, mountains, rivers, buildings, farmlands, roads, and bridges, etc. At the same time, we selected pictures to form four test sets, Test_1,Test_2, Test_3 and Test_4, to evaluate the model. Test_1 contained 289 image blocks with various terrains, which were used to evaluate the performance of the model. Some image blocks in Test_1 and some image blocks in the training set came from the same large pictures, which were collected by us and named Test_2. In addition, we also added Test_3 and Test_4, which contained 62 image blocks and 111 image blocks, respectively. Those two datasets show mountains and urban suburbs with complex layouts, and the image blocks in the two datasets were from the large pictures that did not participate in the training of the model; therefore, Test_3 and Test_4 were completely unseen datasets. They were used to prove the robustness of our method. Details of each dataset are tabulated in Table 1.

Training Details
We trained and tested EPCGAN and the other SAR-to-optical methods on the same dataset. For each model, we used the same preprocessing method, and random rotating and flipping were utilized to avoid overfitting. ADAM optimizer [61] with β 1 = 0.5, β 2 = 0.999 was used for the optimization of EPCGAN. In particular, the two generators in EPCGAN shared the Adam optimizer, and the two discriminators also shared another Adam optimizer. The EPCGAN was trained for 200 epochs at a batch size of 1 in the experiments. We set the learning rates to 2 × 10 −4 for both generator and discriminator, and linearly reduced them to zero starting from epoch 100. As for the weight parameters of losses, the λ cyc was set to 10 following the settings in [21], and λ pix , λ per , λ grad and λ branch were set to 10 to balance the impressions of different losses. All the experiments were implemented on PyTorch and trained on NVIDIA GTX 2080Ti GPUs.

Results and Analysis
To evaluate the proposed EPCGAN quantitatively, we applied peak signal-to-noise ratio (PSNR), mean square error (MSE) and structural similarity (SSIM) for comparison. MSE represents the average gap between corresponding pixels. In order to make the results easy to observe, we first reduced the image pixel value of (0-255) to (0-1) and then calculated the MSE. The smaller the MSE, the smaller the distortion. The PSNR was based on the MSE between corresponding pixels in the reconstructed optical image and the real optical image. The higher the PSNR, the smaller the distortion. While the PSNR treated each pixel equally, the score of PSNR often deviated from the visual quality acquired by human eyes. Considering the human visual system, we also used SSIM to evaluate the similarities in brightness, contrast and structure. The higher the SSIM, the smaller the distortion. It is worth noting that our framework has the ability to convert optical images into SAR images and convert SAR images into optical images. We only discuss the translation from SAR images to optical images here.
We compare the proposed method quantitatively with some methods for SAR-tooptical translation, including Pix2pix [20], CycleGAN [21] and S-CycleGAN [18]. Pix2pix and CycleGAN are well-known methods for image-to-image translation that have been proven to be feasible in SAR-to-optical translation in some previous works [5,24]. S-CycleGAN was proposed in [18] for SAR-to-optical translation, which combines pixel loss and cycle-consistency loss. The results of PSNR and SSIM values are presented in Table 2. In each row, the best results are highlighted in bold. We can see in all the testing datasets that the proposed EPCGAN achieved the best PSNR and SSIM performance. Pix2pix could obtain good performance in PSNR compared with other methods and achieved the second highest PSNR values on Test_3-second only to EPCGAN; however, the SSIM values acquired by Pix2pix were the lowest on all the testing datasets due to the L 1 loss used in training. The L 1 loss was calculated according to the difference between the pixels of the generated picture and the target image, which is similar to the calculation principle of MSE. Thus, Pix2pix is more like a PSNR-oriented SAR-to-optical method, with which it is easy to produce relatively fuzzy results with high PSNR values.
We also visually compare these SAR-to-optical methods. From Figure 5, we see that they have better structural information and visual effects than other methods. For the first image, EPCGAN successfully restored the road, which is vaguely reflected in the SAR image based on the input SAR image, indicating that our method is capable of capturing structural characteristics in SAR images. At the same time, the edges of the recovered port are more standardized than other methods, which proves that EPCGAN can effectively constrain the edges of the graphics in the generated image through the gradient branch. Making full use of the features and structural information in the input SAR image, the EPCGAN generate results with better texture in the second and fourth image and more natural and realistic results in the third image.
CycleGAN can generate images with good structural information, but unsatisfactory partial translation results usually appear in the images (such as the port in the first image, the building in the fourth image and the additional artifacts of the third image) due to the lack of pixel loss calculated based on the generated image and the target image in the training process. Pix2pix only uses L1 loss to update the network during the training process, which leads to disappointing visual effects when Pix2pix is applied for SAR-to-optical translation. We cannot distinguish the river in the result generated by Pix2pix in the fourth image as it includes a number of undesirable artifacts. The first image and fourth image were from Test_3 and Test_4. The image blocks were not from the large picture from which some blocks were chosen for the training set, which proves that Pix2pix may have insufficient robustness when applied to SAR-to-optical translation. S-CycleGAN can generate better results than Pix2pix and CycleGAN, but the structural information and edges in the generated pictures often do not respond well. The visual comparison proves that our proposed method can better utilize and maintain the structural information in the SAR image based on the gradient branch and the proposed EPC, which helps to generate optical images that are easier to detect and recognize.

A comparison of Textural and Structural Information
The gradient information of the image can well reflect the texture and structure of the image. In order to demonstrate the effectiveness of our method in image texture and structure restoration, we obtained the corresponding gradient map through the last images generated by different methods in Figure 5, and the results are shown in Figure 6. We can see that there are great differences in textural information between SAR image and optical images, and that speckle noise in the SAR image seriously pollutes texture information. Pix2pix had poor visual results on the unseen dataset. CycleGAN and S-CycleGAN can reduce the influence of speckle noise, but the structures of roads and buildings cannot be restored well. The proposed EPCGAN created the image with the best textural and structural information. It is worth noting that there was also a gap between the textural information of images generated by EPCGAN and optical images, which was due to the lack of information contained in SAR images, and the reasons were discussed in our introduction. Higher resolution SAR data are expected to reduce the impacts of these factors.

Model Complexity Analysis
In this section, the influence of the proposed EPCGAN on model complexity is studied. We summarize the parameters and floating-point operations (FLOPs) of the proposed EPCGAN and other methods for the SAR-to-optical translation compared in Section 4.2. Model parameter numbers refer to the numbers of parameters in the network that needed to be updated during training, which determined the neural network's demand on video memory. Generally, researchers hope to obtain better performance indicators with fewer parameters, whereas fewer parameters representing the model can be more easily deployed in industrial scenarios. FLOPs is the index that is used to measure the complexity of the model. Since the SAR-to-optical translation was realized by the generator after the network was trained, we only calculated the parameters and FLOPs of a single generator. Figure 7 illustrates the PSNR values, SSIM values and parameter numbers of different methods on Test_1. Compared with the other methods, the proposed EPCGAN had a smaller model and better performance than them. It should be noted that the proposed EPC achieved edge-preserving and content-adaptive convolution without introducing extra parameters, whose parameters were equal to the convolutional layer with the same specifications. Table 3 illustrates the training time and FLOPs of different methods. While achieving the best results with good structure and texture information, the FLOPs of the EPCGAN were higher, and more training time was required due to the use of RRDB and gradient branches. It is worth noting that EPC will introduce the calculation of multiple variables during back propagation, resulting in the extension of the network training time. We are considering optimizing this part in future work.

Ablation Experiment
In our method, the proposed EPC and gradient branch are used, both of which play unique roles. In order to prove the effectiveness of gradient branch and EPC, we did an ablation study to show the effects of the gradient branch and EPC. It should be noted that in EPCGAN (without EPC), we only delete the EPC in the network structure, and all loss functions are reserved for training. For EPCGAN (without gradient branching) we delete the gradient branch, and the gradient loss is removed during training. We trained on the same training set and tested on four test sets.
EPCGAN achieved the highest SSIM values on all test sets in Table 4, indicating that the complete method had better results in terms of structure and vision. Both EPC and gradient branch could effectively improve the quality of translated images, but due to the difficulty of the task, EPCGAN with EPC and gradient branch could only achieve less improvement than EPCGAN (without EPC) and EPCGAN (without gradient branch). We also performed a visual comparison in the ablation experiment. For the second image in Figure 8, without the EPC and gradient branch, the bridge in the generated optical image was translated into having small irregular bends, which is inconsistent with the real scene. The bridge in the image that was generated by the model without gradient branching is less obvious due to the lack of the overall gradient auxiliary information provided by the gradient branch. Additionally, without content-adaptive EPC, buildings that are not obvious enough in the gradient map will also not be obvious enough in the generated optical image. The complete EPCGAN generated the optical image with the best visual effects and good structure.

Goals and Difficulties for SAR-to-Optical Translation
SAR-to-optical translation is a difficult task due to the huge differences between SAR images and optical images. In most image-to-image translation tasks that transform images belonging to one domain to another domain, the images between the two domains are different but often have a strong connection. For example, converting a character photo into an anime photo is a task of image-to-image translation, the contours of the characters provide a reliable basis for the generation of animation photos, and then CGANs can be utilized to generate visually realistic corresponding pictures. However, the SAR-to-optical translation is different for multiple reasons.
First of all, as we mentioned in Section 1, there exists a big gap between an SAR image and its optical image due to the imaging principle. Some features in the optical image will not be reflected in the SAR image. Accordingly, some obvious targets in the optical image may be consistent with the surrounding environment in the SAR image. The SAR image can provide detailed surface characteristics of the object, so the different coverage information will be obvious in the SAR image. However, the same kind of coverage information may have many different appearances in an optical image; for instance, deep water and shallow water often appear almost the same in an optical image, and optical images obtained in different weather conditions and lighting of the same place are very different. All these factors create great difficulties for SAR-to-optical translation.
Secondly, there is severe speckle noise and possible geometric distortion in SAR images. Speckle noise masks the real effective information to affect the feature extraction, and distortion seriously affects the translation, usually resulting in a distorted generated optical image.
At last, differently from other image-to-image translation tasks whose goal is to produce an overall visually realistic effect, we hope to obtain a sufficiently realistic optical image through SAR-to-optical translation. However, due to the different resolutions of remote sensing and the reasons mentioned above, it is difficult to recover optical images with excellent visual effects.
Based on the points we discussed above, we can understand that SAR-to-optical translation is a unique and difficult task. This leads to serious performance degradation when many image-to-image translation methods are directly applied to SAR-to-optical translation, and it is very necessary to design the network structure, loss function and preprocessing method according to the characteristics of the SAR image. In addition, because the optical image does not match the information in the SAR image, the goal of this conversion should be to use as much information in the SAR image as possible to output an optical image with a better structure. Our method was designed based on this.

Comparative Analysis of PAC and EPC
PAC is pixel adaptive convolution, proposed in [34], with excellent performance, which modifies the weight of the filter by the content in each window. However, the weight modification is not effectively restricted in PAC and the content may have an excessive influence on the weight of the filter. Therefore, the feature map used to influence the weight of the filter usually has a very small coefficient, which should be obtained through multiple adjustments, to avoid excessive influence in the experiment in [34]. At the same time, the coefficient may not be optimal for each image due to differences between images. We effectively restrict the weight modification through regularization in EPC. In addition, the structural information in SAR images may be lost or blurred when PAC modifies the filter weight, and our method can effectively retain the texture information of the SAR image. In order to prove the effectiveness of EPC, we conducted a comparative experiment in which the only difference between the EPCGAN and EPCGAN(PAC) was which model was chosen out of PAC and EPC.
Our method achieved the highest SSIM and PSNR values on all test sets (see Table 5), indicating that the EPC method had better results for the mean square error, structure and vision. We also provided a visual comparison in Figure 9. EPC can achieve clearer texture edges and better visual effects.

Network Structure and Loss Function for SAR-to-Optical Translation
For the overall network framework, cGAN is the current optimal solution for SAR-tooptical translation, which leverages GAN to enhance the visual quality of the generated images. Nevertheless, the loss function needs to be designed according to the characteristics of SAR-to-optical translation. The generated image will be blurry with poor structure if only the pixel loss calculated based on the generated image and the target image is used in training. Perceived loss, DCT loss and some other losses will be effective options that first convert the output image and the target image before calculating the loss. In our method, multiple loss functions are used, which have different effects. In order to prove the correctness of our choice, we conducted an ablation study to show the effects of different loss functions. For the combination of multiple loss functions, when one loss was removed, its influence could clearly be reflected in the translation process.
An image quality assessment and a visual comparison are given in Table 6 and Figure 10. When the MSE loss is not used, we obtained poor PSNR value, and the translation error in CycleGAN would also appear, which proves that MSE loss can effectively constrain the translation process. It is worth noting that the MSE loss can be replaced with L1 loss; both of them are calculated based on the error between pixels. When the VGG loss is not used, the generated image is blurred, and the key target has unsatisfactory visual quality. It is worth noting that the images generated with our method had good structural edges due to the use of gradient loss. When the gradient loss was not used, we found that the edge of the port was very irregular and blurred, which proves that our gradient loss can help the recovery of the image edge. The rationality and effectiveness of our loss function can be proved by this phenomenon.
It is effective to extract additional information from SAR images to assist with generating optical images. The edge information of SAR images contains a lot of information, which is helpful for the generation of optical images. The proposed method provides auxiliary information for the reconstruction of the optical image according to the gradient map of the input image through the gradient branch, which is proved to be effective in the experiment. In addition, owing to the limited information possessed by single-channel SAR images, the use of multi-pol SAR images for image restoration is also a direction worth exploring.

Conclusions
In this paper, we summarized the difficulties and goals in SAR-to-optical translation based on the discussion of the characteristics of optical images and SAR images. After that, we proposed the EPCGAN and EPC for SAR-to-optical translation and conducted comparative experiments and ablation studies that demonstrated excellent performance of WDCNN against the other methods for SAR-to-optical translation. The trained standard convolution is content agnostic, which will cause the model to ignore some of the content features that we hope to be reflected in the generated optical image when facing complex SAR images. By combining traditional decomposition methods, we developed a novel EPC to perform content-adaptive convolution on SAR images while maintaining the texture characteristics in the SAR image. The EPC decomposes the content of convolution windows based on the texture component extracted by the edge extraction operator and achieves content-adaptive convolution by multiplying convolutional filter weights with an edge-preserving kernel generated from the content component in each window. Based on the proposed EPC, a new model EPCGAN was introduced for SAR-to-optical translation tasks. EPCGAN has two generators and two discriminators based on the CycleGAN framework, which can learn SAR-to-optical translation and optical-to-SAR translation at the same time. Since an SAR image contains very rich structural information, we designed the gradient branch in the generator of EPCGAN to leverage the edge information in an SAR image, which contains abundant useful information and basic features of the target structure. The introduction of edge information through the gradient branch and the proposed EPC effectively improve the structural quality of the generated image. The graphics in the generated image have clearer edges, and the generated image is more realistic and natural to our vision. At the same time, EPCGAN has excellent robustness that can handle SAR images with complex terrain, since EPC is content-adaptive. In addition, we discussed the impact of the loss function and the specific network structure on the SAR-to-optical translation. These findings provide an important reference for the design of a network structure and loss function in SAR-to-optical translation tasks. In addition, our scheme provides ideas for how to improve the structural information and visual quality of optical images and how to make full use of the complex information in SAR images. Since the design of EPCGAN is based on network structure, the proposed EPCGAN has the potential to be used in other tasks and become a general method for GAN. We will consider conducting experiments to explore the possibility of creating a general method of GAN and construct datasets to conduct detection experiments to prove the utility of our method in practical applications in the future.  Data Availability Statement: The SEN1-2 dataset is downloaded free of charge from the library of the Technical University of Munich according to the link in [35].

Conflicts of Interest:
The authors declare no conflict of interest.