Underwater Image Translation via Multi-Scale Generative Adversarial Network

: The role that underwater image translation plays assists in generating rare images for marine applications. However, such translation tasks are still challenging due to data lacking, insufﬁcient feature extraction ability, and the loss of content details. To address these issues, we propose a novel multi-scale image translation model based on style-independent discriminators and attention modules (SID-AM-MSITM), which learns the mapping relationship between two unpaired images for translation. We introduce Convolution Block Attention Modules (CBAM) to the generators and discriminators of SID-AM-MSITM to improve its feature extraction ability. Moreover, we construct style-independent discriminators that enable the discriminant results of SID-AM-MSITM to be not affected by the style of images and retain content details. Through ablation experiments and comparative experiments, we demonstrate that attention modules and style-independent discriminators are introduced reasonably and SID-AM-MSITM performs better than multiple baseline methods.


Introduction
Underwater images play critical roles in diverse marine-related military and scientific applications, such as seabed sediment classification [1], submarine cable detection [2], and mine recognition [3].However, the complex underwater environment limits the use of camera devices, including Kinect units [4] and binocular stereo cameras [5], which makes it difficult to obtain real underwater images.Intuitively, image translation [6] provides a viable direction to obtain such scarce data.Specifically, image translation methods translate source-domain non-underwater images into target-domain underwater ones, which reassigns particular attributes of underwater images.The translated underwater images are valuable for advanced visual tasks such as target detection, 3D reconstruction, and target segmentation [7,8].
Existing image translation methods are generally categorized into two groups, i.e., conventional and Generative Adversarial Network (GAN)-based [9] methods.First, conventional methods [10,11] extract low-level features to transfer input images' texture or devise various Convolutional Neural Network-based (CNN) [12], such as image style transfer [13], to make use of semantic content for image translation.Second, GAN-based methods, such as supervised models including Pix2Pix [14] and Pix2PixHD [15], and unsupervised models including StyleGAN [16] and StarGAN [17], use generators to translate images from one image domain to another image domain.By comparison, GAN-based methods do not require researchers to design complex loss functions, which saves manpower.
In view of the immense potential of GAN, researchers attempt to apply it to underwater image translation tasks.Li et al. [18] propose an unsupervised WaterGAN that uses inair RGB images and depth maps to generate corresponding realistic underwater images.Wang et al. [19] use an unsupervised image translation method that also takes in-air RGB-D images to generate realistic underwater images.Li et al. [20] work to generate images with underwater style using in-air RGB images.
Although existing methods have achieved a certain success, they still face multiple challenges.(1) Data lacking.The necessary paired in-air and depth images are too scarce to train translation models.In the absence of data, using the general image translation methods, it is difficult to achieve good results.(2) Insufficient feature extraction ability.Underwater optics images and underwater sonar images present obvious colors, which reduce the visibility of objects in translated images and reduce the quality of the translated images.These translated images limit the performance of subsequent advanced computer vision tasks [21].Therefore, the colors of the underwater images pose challenges to the feature extraction ability of image translation.(3) Loss of content details.GAN models are sensitive to the style of images (such as color and texture) [22], which makes image translation models ignore the content information of images.Similar to the lack of feature extraction ability, the loss of content details also limits the performance of subsequent advanced computer vision tasks.
With the aim of addressing the above challenges, in this paper, we propose a multiscale underwater image translation model based on style-independent discriminators and attention modules (SID-AM-MSITM).Specifically, our contributions mainly include the following: 1.
In response to the data lacking challenge, we construct the backbone model of SID-AM-MSITM based on a fundamental image translator, TuiGAN [23].TuiGAN conducts image translation tasks based on only two unpaired images, and we thus make further improvements on its encoders and decoders.

2.
In response to the challenge of insufficient feature extraction ability, we propose to apply Convolution Block Attention Modules (CBAM) [24] to the generators and discriminators of SID-AM-MSITM.CBAM assigns the weight distribution of feature maps in the two dimensions of channel and space and increases the weight of important features, so as to make SID-AM-MSITM pay attention to meaningful information.

3.
In response to the loss of content details, we further improve SID-AM-MSITM by constructing style-independent discriminators.The discriminators give similar results when discriminating images with the same content and different styles, so as to make SID-AM-MSITM focus on the content information instead of the style information.

4.
We conduct systematical experiments based on multiple datasets, including submarine, underwater optics, sunken ship, crashed plane, and underwater sonar images.
Compared with multiple baseline models, SID-AM-MSITM improves the ability to access effective information and retain content details.
The rest of this paper is organized as follows: Section 2 details the methodology of SID-AM-MSITM.Section 3 presents our ablation and comparative experiments, and their corresponding analysis.Section 4 concludes this work.

Methodology
In this section, we will present our proposed SID-AM-MSITM in detail.Based on TuiGAN as a backbone architecture, SID-AM-MSITM improves its generators and discriminators by introducing CBAM and makes further improvements by devising styleindependent discriminators.Figure 1 shows the architecture of SID-AM-MSITM and the process of translating a non-underwater image into an underwater image.
In the underwater image translation task, the source domain image means the nonunderwater image I X , and the target domain image means the underwater image I Y .We use SID-AM-MSITM to translate non-underwater images into underwater images, also known as generating underwater images.Using SID-AM-MSITM, we also reconstruct translated underwater images into non-underwater images.The original image represents the initial image that has not been processed by SID-AM-MSITM.The discriminators {D 0 Y , D 1 Y , ...D N Y } learn the distribution of the target domain using a variety of loss functions for model training, including adversarial loss WGAN-GP [25], cycle-consistency loss, identity loss, and total variation loss [26].Among them, the cycleconsistency loss helps SID-AM-MSITM to avoid the mode collapse, the identity loss helps it to align colors and textures, and TV loss smooths the generated images.Finally, the translated underwater image I 0 XY is obtained at the highest scale.In Figure 1, the discriminator processes images translated by a generator of the same color as it.
In order to obtain the generators {G 0 YX , G 1 YX , ...G N YX }, we also train the generators

Generators and Discriminators with CBAM Modules
We start with presenting generators and discriminators with CBAM modules.CBAM enables SID-AM-MSITM to focus on critical features in a given image, so as to improve the quality of generated images [27].
As illustrated in Figure 2, each CBAM contains two modules, a channel attention module [28] and a spatial attention module.Specifically, the channel attention module enables SID-AM-MSITM to focus on critical features in a given image, so as to obtain accurate weights of channel features.The spatial attention module performs Max Pooling and Average Pooling on the spatial dimension of compressed features.The yellow blocks represent the input features of the source domain, and the blue blocks represent the input features of the target domain.G n XY and G n YX share the same architecture but with different weights.The working principle of G n XY is as follows: where 0 ≤ n < N, ⊗ represents pixel-level multiplication.At the lowest scale N, I n+1 XY is replaced with an image with pixel values of 0. First, SID-AM-MSITM uses the encoder φ to preprocess I n X to I n XY,φ .Then, SID-AM-MSITM uses the encoder A n to generate mask X n .Finally, SID-AM-MSITM uses the linear combination to obtain output I n XY .
Similarly, the implementation of the translation of I Y → I YX at scale n is as follows: where 0 ≤ n < N. At the lowest scale N, I n+1 YX is also replaced with an image with pixel values of 0.
In this way, the generators focus on the regions that synthesize current scale details in the images.Meanwhile, it maintains the previously learned global structures as unaffected.

Style-Independent Discriminators
Next, we present our proposed style-independent discriminators, which focus on the images' content information rather than their style information, so as to enable SIM-AM-MSITM to avoid losing the content details in non-underwater images.
When two sets of images share the same content but different styles, it is ideal to make discriminators give similar discriminant results to the images.Thus, SID-AM-MSITM uses instance-level as well as vector-level style difference losses to train style-independent discriminators.
Figure 4 illustrates style and content differences.As illustrated in the figure, the first two generated images share the same content information, while the latter two share the same style information.s x and s y represent the style information of images x and y respectively.Y , where 0 ≤ n ≤ N.Then, we will describe how style-independent discriminators eliminate style differences at the instance level as well as the vector level, respectively.Instance-level style difference refers to the style difference between the image obtained by stylizing its pixels and the original image.Vector-level style difference refers to the style difference between the image obtained by stylizing its encoded vectors and the original image.

Instance-Level Style-Independent Discriminators
Instance-level style-independent discriminators use a special regularization term, so as to reduce style differences between the images obtained by stylizing its pixels and the original images.
We first adjust weight α to gradually increase the proportion of generated images among the mixed ones at multi-scales.Then, we make discriminators to reduce the differences between the original non-underwater images or the underwater images and the final mixed images, which are constrained by a consistency loss.Such progress is formulated as follows: În where 0 ≤ n ≤ N, 0 ≤ α < 1.I n X and I n Y represent the source-domain and target-domain images at the current scale, respectively.I n XY and I n YX denote the generation images at the current scale.α indicates the weight of the linear combination and gradually becomes smaller as the scale rises.L con uses an L 1 paradigm to process the instance-level styleindependent loss, and D(•) represents the discriminant results of discriminators.
As the scale rises, the style of instance-level mixed-style images În X is closer to the target-domain ones and away from the source-domain ones.The style of instance-level mixed-style images În Y is closer to the source-domain ones and away from the target-domain ones.Discriminators penalize the distances between the source-domain image outputs or the target-domain image outputs and the mixed-style image outputs.
Figure 6 shows the training process of instance-level style-independent discriminators D Y .↑ and ↓ indicate the value rising and descending as the scale rises, respectively.

Vector-Level Style-Independent Discriminators
Based on the above instance-level discriminators, it is not enough to generate images since this is limited by style-independent pixels.Therefore, we devise vector-level styleindependent discriminators that further mix the encoded vectors of the source-domain and target-domain images at each scale.We put the mixed encoded vectors into a decoder, and utilize its generated images as well as the source-domain ones or the target-domain ones for model training.
First, we encode the source domain-images and the target-domain images, respectively using VGG 19 [29] and then process them using AdaIN [30].The results obtained are linearly combined with the encoded vectors of the source-or target-domain images.Then, we put the results of the linear combinations into a decoder to get the vector-level mixedstyle images I n X and I n Y .The decoder is a convolutional network that is symmetric to VGG 19 and upsamples the mixed encoded vectors into images.Finally, we utilize discriminators to penalize the distances between the source-domain image outputs or the target-domain image outputs and the mixed-style image outputs.Such progress is formulated as follows: where 0 ≤ n ≤ N, 0 ≤ α < 1.I n X and I n Y represent the source-domain and target-domain images at the current scale, respectively.Encoder(•) is VGG 19 and Decoder(•) is symmetric to VGG 19.α is the weight coefficient that becomes smaller as the scale rises.L con uses an L 1 paradigm to process the vector-level style-independent loss.D(•) represents the discriminant results of discriminators.
Figure 7 shows the training process of vector-level style-independent discriminators.↑ and ↓ indicate the value rising and descending as the scale rises, respectively.

Implementation
To implement SID-AM-MSITM, we utilize Adam [31] as its optimizer and LeakyReLU [32] as its activation function.At the lowest scale, images of the model are 100 × 100-pixel ones.And at the highest scale, the size of images is 250 × 250 pixels.The model uses 6 scales.

Experiment and Result Analysis
In this section, we will present our experiments and the corresponding results analysis.We first introduce the evaluation metrics and then present the ablation and comparative experiments, respectively.
(1) PSNR: PSNR measures the distance between the distributions of two images.We use PSNR to calculate the distance between the source-domain images and reconstructed images.A larger PSNR value indicates a smaller difference between the two images.(2) SSIM: SSIM measures the similarity of two images.The value of SSIM is between 0 and 1, and a larger SSIM indicates a better reconstruction effect, which suggests the translation effect of an image translation model.(3) Entropy: Information entropy measures the complexity of an image.Larger information entropy indicates complex images that contain more information.(4) SIFID: Single Image Fréchet Inception Distance (SIFID) is a special type of Fréchet Inception distance (FID) [37].It measures the deviation between the feature distribution of two single images, and smaller SIFID indicates the better effect of generated images.

Ablation Experiment
We use five different datasets, including submarine, underwater optics, sunken ship, crashed plane, and underwater sonar datasets.The underwater optics images are from the URPC2020 dataset [38].These images are difficult to collect and are not large in number.Specifically, the submarine, sunken ship, and crashed plane images are content categories, and the underwater sonar and underwater optics images are style ones.Meanwhile, the submarine, sunken ship, and crashed plane images are non-underwater images, and the underwater sonar and underwater optics images are underwater ones.
Figure 8 presents the utilized datasets, which are grouped into six different combinations, i.e., (1) sunken ships with underwater sonars, (2) sunken ships with underwater optics, (3) crashed planes with underwater sonars, (4) crashed planes with underwater optics, (5) submarines with underwater sonars, and (6) submarines with underwater optics.Table 1 presents the PSNR results of each model.It is observed that the PSNR values of the images reconstructed using the TuiGAN with CBAM modules on the three combinations of the datasets are higher than that of TuiGAN, and the maximum difference reaches 8.63.The PSNR values of the images reconstructed using TuiGAN with style-independent discriminators on the five combinations of datasets are higher than that of TuiGAN, and the maximum difference reaches 4.7.The PSNR values of the images reconstructed using SID-AM-MSITM on all combinations of the datasets are not less than that of TuiGAN, and the maximum difference reaches 4.58.The promising results indicate that the combination of CBAM modules and style-independent discriminators significantly improves the effective information acquisition ability of backbone TuiGAN and is suitable for all combinations of datasets.Table 2 presents the SSIM results of each model.SSIM is also a metric to measure the effect of model reconstruction.It is observed that the SSIM results of the images reconstructed using TuiGAN with CBAM modules on four combinations of datasets are higher than that of TuiGAN, and the maximum difference reaches 0.17.The SSIM results of the images reconstructed using TuiGAN with style-independent discriminators on all combinations of datasets are not less than that of TuiGAN, and the maximum difference reaches 0.09.The SSIM values of the images reconstructed using SID-AM-MSITM on all combinations of datasets are higher than that of TuiGAN, and the maximum difference reaches 0.15.These promising results also indicate that CBAM modules and style-independent discriminators improve the ability to access effective information.Table 3 presents the Entropy results of each model.It is observed that the Entropy result of the images generated using TuiGAN with CBAM modules on only one combination of datasets is higher than that of TuiGAN, and the difference reaches 0.21.The Entropy results of the images generated using TuiGAN with style-independent discriminators on four combinations of datasets are higher than that of TuiGAN, and the maximum difference reaches 0.86.The Entropy results of the images reconstructed using SID-AM-MSITM on all combinations of datasets are higher than that of TuiGAN, and the maximum difference reaches 0.78.These promising results indicate that style-independent discriminators improve the diversity of generated images.The style-independent discriminators improve TuiGAN's ability to retain content details and are suitable for the combinations of all datasets.
In summary, based on the above ablation results, our proposed SID-AM-MSITM achieves promising underwater image translation performance in terms of PSNR, SSIM, and Entropy.The ablation experiments demonstrate that CBAM modules enhance the feature extraction ability of the network, so as to enhance the ability to access effective information.Moreover, we prove that style-independent discriminators improve the diversity of the generated images without weakening the reconstruction performance, which indicates SID-AM-MSITM retains the content details of non-underwater images.

Comparative Experiment
The above ablation experiments demonstrate the overall effect of SID-AM-MSITM.In the following, we further compare it with multiple advanced image translation models, including CycleGAN [39], FUNIT [40], AdaIN [30], and SinDiffusion [41].These models are selected as baselines since they present promising performance and cover general image translation models, CycleGAN, FUNIT, and AdaIN, as well as the emerging SinDiffusion.
(1) CycleGAN: CycleGAN is one of the most typical translation models using cycle consistency.The model assumes the potential correspondence between source-domain and target-domain images.
(2) FUNIT: FUNIT is an unsupervised few-shot image translation model that achieves satisfactory performance based on limited data.(3) AdaIN: AdaIN is an image translation model that achieves real-time and arbitrary style transfer.(4) SinDiffusion: SinDiffusion is a diffusion model that works on a single natural image.
Figure 15 presents the comparison results between the images translated using SID-AM-MS-ITM and other baseline models.Through visual effect comparison, it is observed that SID-AM-MSITM has learned the style of target-domain images and retains the content of source-domain images.Moreover, the images translated using SID-AM-MSITM show little difference between adjacent pixels as well as excellent smoothness, which is superior to CycleGAN which shows obvious adjacent pixels difference after amplification.Compared with FUNIT and AdaIN, SID-AM-MSITM retains the content of source-domain images and learns better texture and color information from the target domain.
Next, we use SIFID to quantitatively compare SID-AM-MSITM with these baselines.Table 4 shows the SIFID results.It is observed that SID-AM-MSITM achieves the best (smallest) SIFID values.For example, the SIFID values of the images translated using SID-AM-MSITM are roughly 0.02 × 10 −2 to 0.058 × 10 −2 smaller than that of CycleGAN, 17.408 × 10 −2 to 17.55 × 10 −2 smaller than that of FUNIT, 9.31 × 10 −2 to 9.418 × 10 −2 smaller than that of AdaIN, 0.002 × 10 −2 to 0.018 × 10 −2 smaller than that of TuiGAN, and 1.118 × 10 −2 to 1.25 × 10 −2 smaller than that of SinDiffusion.This demonstrates that the images translated using SID-AM-MSITM are closer to the source-domain images and retain more content information than other models.In summary, SID-AM-MSITM is superior to multiple baseline models in improving the ability to access effective information and avoiding the loss of content details.

Conclusions
In this work, we propose a novel multi-scale image translation model with attention modules and style-independent discriminators (SID-AM-MSITM), to complete the underwater image translation task.We use a multi-scale generative adversarial network, TuiGAN, to construct a backbone architecture, which translates images from low scales to high scales.We introduce CBAM modules into the generators and discriminators at multi-scales and devise style-independent discriminators to improve the generative and discriminant effects.Based on systematical ablation and comparative experiments, we demonstrate that SID-AM-MSITM has the ability to acquire effective information and retain the content details of non-underwater images during the underwater image translation process, and it requires only two unpaired images to complete the image translation.
However, there are still some problems in the current research.In the use of styleindependent discriminators, SID-AM-MSITM uses the number between 0 and 1 in the linear combination to achieve the translation from the source domain to the target domain.We will continue to study whether there is a more appropriate interval to train style-independent discriminators.We only select several source-domain and target-domain images as the dataset, which has certain limitations.In order to measure the performance of the model comprehensively, we will use other underwater target images to verify the versatility of SID-AM-MSITM, such as the UIEB database [42].

Figure 1 . 1 X}
Figure 1.The architecture of SID-AM-MSITM (the process of translating a non-underwater image into an underwater image).As in TuiGAN, generally, SID-AM-MSITM downsamples two images into different scales {I 0 X , I 1 X , ...I N X .I 0 Y , I 1 Y , ...I N Y }, and each scale corresponds to two generators {G n XY , G n YX } and two discriminators {D n Y , D n X }.The generators {G 0 XY , G 1 XY , ...G N−1 XY } utilize the downsampled source domain images {I 0 X , I 1 X , ...I N−1 X } as well as their previous-scale upsampled generated images {I 1↑ XY , I 2↑ XY , ...I N↑ XY } to generate new images.At the lowest scale N, the previous-scale upsampled generated image is replaced with an image with pixel values of 0.The discriminators {D 0 Y , D 1 Y , ...D N Y } learn the distribution of the target domain using a variety of loss functions for model training, including adversarial loss WGAN-GP[25], cycle-consistency loss, identity loss, and total variation loss[26].Among them, the cycleconsistency loss helps SID-AM-MSITM to avoid the mode collapse, the identity loss helps it to align colors and textures, and TV loss smooths the generated images.Finally, the translated underwater image I 0 XY is obtained at the highest scale.In Figure1, the discriminator processes images translated by a generator of the same color as it.In order to obtain the generators {G 0 YX , G 1 YX , ...G N YX }, we also train the generators {G 0 YX , G 1 YX , ...G N YX } and discriminators {D 0 X , D 1 X , ...D N X } in a similar way.

Figure 2 .
Figure 2. The structure of CBAM modules.CBAM represents the Convolution Block Attention Module.Conv represents convolution operations.

Figure 3 Figure 3 .
Figure 3 shows the generator and discriminator of SID-AM-MSITM, where the generator G n XY implements the translation of a source domain image I n X to a generated image I n XY .Firstly, SID-AM-MSITM simply processes I n X from source domain X to obtain an intermediate image I n XY,φ through the CBAM module and convolution operations.Then, SID-AM-MSITM utilizes I n XY,φ , I n X , and an upsampled image after generating at the previous scale I n+1↑ XY to concatenate in the direction of the channel, and obtain a mask image X n through the CBAM module and convolution operations.Finally, a generated image I n XY is obtained using the linear combination of X n , I n XY,φ , and I n+1↑ XY .I n XY and the target domain image I n Y are input into the discriminator D n Y to obtain discriminant results for model training, where 0 ≤ n < N.

Figure 4 .
Figure 4.The illustration of style and content differences.

Figure 5 Figure 5 .
Figure 5 shows a style-independent discriminator D n Y of SIM-AM-MSITM, which requires a source domain image I n X , a generation image at current scale I n XY , an instance-level mixed style image În X , and a vector-level mixed style image I n X to make the discriminator D n Y style-independent.

1 Figure 6 .
Figure 6.The training process of instance-level style-independent discriminators.

Figure 7 .
Figure 7.The training process of vector-level style-independent discriminators.

Figure 14 .
Figure 14.Generated and reconstructed images using the submarines and the underwater optics.

Figure 15 .
Figure 15.The comparison results of the images generated using SID-AM-MSITM and other models.