Medium Transmission Map Matters for Learning to Restore Real-World Underwater Images

Underwater visual perception is essentially important for underwater exploration, archeology, ecosystem and so on. The low illumination, light reflections, scattering, absorption and suspended particles inevitably lead to the critically degraded underwater image quality, which causes great challenges on recognizing the objects from the underwater images. The existing underwater enhancement methods that aim to promote the underwater visibility, heavily suffer from the poor image restoration performance and generalization ability. To reduce the difficulty of underwater image enhancement, we introduce the media transmission map as guidance to assist in image enhancement. We formulate the interaction between the underwater visual images and the transmission map to obtain better enhancement results. Even with simple and lightweight network configuration, the proposed method can achieve advanced results of 22.6 dB on the challenging Test-R90 with an impressive 30 times faster than the existing models. Comprehensive experimental results have demonstrated the superiority and potential on underwater perception. Paper's code is offered on: https://github.com/GroupG-yk/MTUR-Net.


INTRODUCTION
With the development of science and technology, underwater research activities are also increasing, such as underwater object detection and tracking [1], underwater robots [2]and underwater monitoring [3].However, the light reflections, scattering, absorption and suspended particles inevitably result in poor visibility with inhomogeneous illumination in the collected underwater images.In detail, the light is absorbed and scattered by suspended particles in the underwater setting, resulting in hazy effects on the images captured by the cameras.Water also attenuates light as a function of its salinity, light wavelength and depth since the red light is more attenuated due to a longer wavelength.Besides, the light intensity decreases with the increase of water depth.Such properties reduce visibility underwater and hamper the applicability of computer vision methods.Early single-image underwater image restoration work used traditional physical methods to directly change the pixel value of the image [2] [4].However, these methods have limited capabilities when faced with diverse underwater environments.Recently, driven by the release of a series of paired training sets including [5] [6] [7], deep convolution neural networks (CNN) based models have been proposed by learning the mapping between underwater images and restored images.Representative methods include the WaterGAN Li et al. [8] and Ucolor by Li et al. [9], which consider the restoration in multi-channel spaces, and better results are obtained when compared the traditional physical designs as in Fig. 5.However, the quality improvement is limited due to the ignorance of other factors, such as distance-dependent attenuation and scattering.Considering the underwater imaging process, these factors can be considered by utilizing the semantics contained in the medium transmission map [9], such as the design proposed in this paper.By analyzing the results in Fig. 5(e), the improvement by the medium transmission map can be fully reflected by producing more visually pleasing results in terms of color, contrast, and naturalness.
In this work, our goal is to eliminate the influence of light scattering and attenuation on underwater images in real time to support intelligent underwater perception systems.Inspired by the depth-guided deraining model by Hu et al. [10], we introduce the medium transmission map (MT) and formulate a MT-guided restoration framework.Specifically, a multitask learning network is designed to generate both the MT and restoration outputs jointly.A multi-level (including both feature level and output level) knowledge interaction mechanism is proposed for better mining the guidance from the MT learning space.Furthermore, to maximally reduce the computational burden caused by the MT learning branch, parameters in some specific stages are shared across these two related tasks, thus enabling a real-time process of the underwater images.
In summary, this work has the following contributions: • We re-examined how to better use the medium transmission map.We can get good results by relying on RGB map alone using various preprocessing and color embedding, proving that MT map is of great significance for learning a more powerful real-world underwater image restoration network.
• A multi-task learning framework is formulated for leveraging the MT map, and a novel multi-level knowledge interaction mechanism is proposed for better mining the guidance from the MT learning space.
• Comparative study on two real-world benchmarks demonstrated the superiority of our MTUR-Net over the state-of-the-art in terms of both restoration quality and inference speed.
The rest of our paper is organized as follows.Section 2 briefly introduces the existing underwater image enhancement methods.Section 3 presents the proposed underwater enhancement algorithm.The experimental results are reported in Section 4.2, followed by the conclusion in Section 5.

RELATED WORK
Physical prior based methods.Based on adjusting the pixel value to improve visual quality originally, and physical model-based methods are used widely before long, which have obtained impressive results, while still exist some shortcomings, that they are almost slow work and sensitive to different kinds of underwater images.
Recently, the development of scientific and technological artificial intelligence, the method based on deep learning has achieved remarkable results.Underwater image enhancement framework is mostly based on convolutional neural network(CNN) or generative adversarial network(GAN).For example, Li et al. [5] proposed a simple CNN mdoel named Water-Net using gated fusion.Li et al. [11] proposed UWCNN that based on underwater scene prior.Li et al. [9] proposed an underwater image enhancement network: embedding a multi-color space via medium transmission-guided.
J. Li et al. [8] used GANs and image formation models for supervised learning.To avoid requiring paired training data, it was proposed that a weakly supervised underwater color correction network (UCycleGAN) in [12].A multiscale dense GAN for powerful underwater image enhancement was described in [13].
Including the above research, these underwater image enhancement models often overlook the most important point, which focus in the real underwater environment, serving data under real conditions.For instance.[12] use CycleGAN [14] network structure directly, and a simple multi-scale convolutional network is used in [5]. in [11] , faced an underwater image of input, how to select the corresponding UWCNN model is challenging.[9] is still not absolutely effective under the real underwater conditions.
In contrast to the above, our method has the following characteristic : (1) We trained and learned deep-guided nonlocal features and regressed the residual mapping to produce a clear output image.(2) our method adopts end-to-end training and is adaptable and convenient for most underwater scene.and, (3) our method achieves perfect performance on real underwater image datasets which is better than recently state-ofthe-art methods.

METHODOLOGY
Fig. 2 shows the overall architecture of our medium transmission map guided underwater image restoration network (MTUR-Net).This network takes underwater images as input, and predicts the corresponding MT map and underwater enhanced images as output in an end-to-end manner.In general, the network first uses CNN to extract semantics and generate feature maps and share weights.Then two decoding branches are generated.(i) The MT prediction subnet, which uses the encoding and decoding network, to regress a medium transmission map from the input.(ii) The underwater image enhancement network, guided by the predicted MT map, predicts the enhanced image from the input underwater image.

MT Prediction Subnet
We review the haze removal method based on dark Channel prior [15], which is widely used in harsh visual scenarios such as fog, dust and underwater.[16][17] [18].The image formation model can be expressed as [19]: This equation is defined on three RGB color channels.I represents the observed image, A is the airlight color vector, J is the surface brightness vector at the intersection of the scene and the real world light corresponding to the pixel x = (x,y), and T (x) is the transmission along the light.And Y. -T.Peng et al. [20] proposed a new Dark Channel Prior (DCP) algorithm that can effectively estimate ambient light Fig. 2. Schematic diagram of MTUR-Net.It consists of an encoder-decoder network for predicting MT map (green), a set of dilated residual blocks (yellow) to generate local features, convolutional layer (purple) for process MT features before fusion, and the convolutional layer (blue part) to upsample the feature map and generate underwater enhanced images.⊕ pixel-wise addition and is suitable for enhancing foggy, hazy, sandstorm, and underwater images.Inspired by DCP, transferred T (X) has wide applicability, we use the medium transmission (MT) map (T ) as our attention map.It's effectiveness will be demonstrated in ablation experiments.From [20], the actual input underwater image does not have a corresponding ground true medium transmission map, it is difficult to train a deep neural network to estimate the medium transmission map.So the medium transmission map can be estimated as: ( T is the estimated medium transmission map, Ω(x) is a local patch centered at x and c is RGB channel.The schematicdiagram of the proposed module using the MT map is shown in Fig. 3.We use MT map as a feature selector to weigh the importance of different spatial locations of features, as shown in Fig. 3. Assign more weight to high-quality pixels (pixels larger MT values), which can be expressed as: F, O represent the characteristics of the output and input respectively.In detail, the MT map prediction sub-network uses 4 blocks to extract features.Each block has a convolution operation, a group normalization [21] and a proportional exponent linear unit (SELU) [22] Then, it uses lateral connections to influence the detailed information decoded in the underwater feature map.Finally, another convolution operation is used, plus a sigmoid function, to return to T by adding a supervision (input MT map in the training data).

Underwater Image Enhancement Subnet
In the underwater image enhancement subnet, we use the convolution to reduce the resolution of the feature map, Then, followed by 11 dilated residual blocks(DRB) [23] to Increase the size of the perceptual field out reducing the resolution.Each DRB has a 3 * 3 dilated convolution [24], a ReLU nonlinear function, and another 3 * 3 dilated convolution that adds input and output feature maps using skip connections.To avoid gliding issues, we set the dilation ratio of these 11 DRBs as 1,1,2,2,4,8,4,2,2,1 according to [25].Moreover, use the horizontally connected convolution module to add the MT prediction feature to the output feature map.After that, we use convolution to change the feature map to the size of the MT map and concatenate them together.Finally, through the convolution operation, scale the feature map to the size of the input image.

EXPERIMENTS
In this section, we will first illustrate the details of the parameter design and then explain the settings of the entire experimental process.Above all, we compare our model with several existing models that performed well and provide ablation experiments at the end of this section to study the effective parts of MTUR-Net.

Parameter Settings
To train the network, we chose the real underwater image dataset illuminated in Li et al. [9], which contains 890 pairs of images from [5] and 1250 pairs of images from [11].We trained our network on a single NVIDIA 3090 Ti GPU with a batch size of 8, the initial learning rate is set to 1e-3, and network optimization is carried out by Adam.

Experiment Setup
To test the proposed model, we took the remaining 90 pairs of real data in UIEB and recorded them as Test-R90, and to synthesize the multi-faceted results, we also tested 60 challenging images in UIEB, which were recorded as Test-C60.
To prove the advancement of this proposed model, we compared our method with other SOTA, including a physical model-based model and Deep-learning-based model.For the physical model is an extension of their previous work to deal with underwater image restoration called Underwater Dark Channel Prior(UDCP) [2].What's more, Water-Net [5], a simple CNN model through gated fusion, Ucolor [9], a net-work embedding with the color space guided by media transmission, while a fully-convolutional conditional GAN-based model FUnIE-GAN [26], and a method using Generative Adversarial Networks (GANs) we chose [27].To control variables, we chose the same training data and loss function as MTUR-Net.

Comparitive Study
In this experiment, we choose two evaluation methods, including the visual evaluation and quantitative evaluation, to compare the specific effects of our model with other models.
Visual Evaluation.In open water, due to the longest wave tension and fast propagation speed, red light compared to other wavelengths is absorbed more.Therefore, the underwater image appears blue or green.In order to clearly observe the effect of the image via MTUR-Net processing, we provide a comparison chart of the corresponding results obtained in different ways.Fig. 4 shows that the output obtained by MTUR-Net has the best performance.Our solution can repair the chromatic aberration caused by different water areas and see the details in the dark water and the texture of fish in the muddy water on the restored image.
Quantitative Evaluation.We provide full-reference evaluation and non-reference evaluation to quantitatively analyze the performance of different methods.
We conduct a full-reference evaluation using PSNR, SSIM, and FPS.Although the real-world environmental situation may differ from the reference image, the results of a fully-reference evaluation using the reference image can provide some feedback on the performance of different methods.A higher PSNR means that the result is less distorted, a higher SSIM means that the result is more similar to the reference image structure, and a higher FPS means that the processing process is more efficient.In Table 1, We can find that our method achieves the best PSNR and SSIM, while the FPS value is also ideal.Then we use UCIQE [28] and UIQM [29] for a nonreference evaluation.In principle, the higher UCIQE score, the better balance of the standard deviation of the chroma, contrast of brightness, and average of saturation; for the higher UIQM score, the better the result is subjectively visually performed.In Table 2, our proposed model obtain one of the best scores in UCIQE and UIQM.However, when we visually compared the image with the first place, we found that there were many small squares on UGAN's image, but the score was still very high, indicating that this evaluation standard still needs to be improved.
In order to further verify the effect of MTUR method, avoid the influence of subjective judgment of relevant experimenters on visualization results, and make our proposed method more convincing, besides quantitative evaluation, we also conducted a series of research: We prepared 420 pictures expend from Test-C60 test set, each image corresponding seven different type(raw, MTUR, FUnIE-GAN, UGAN, Ucolor, UDCP and WaterNet) and then we vited 20 experimenters and asked them to compare the quality of the images in terms of chromatic aberration, visibility, clarity, etc.,  and select the best performance without knowing the corresponding experimental method of each image.After that we summarized in Table 3.As shown in the table, MTUR received the best rating in 42 of the 60 images in the TESTC-60 test set, and especially that MTUR generally obtains better recovery for details in a dark environment, combined with the image features.

Ablation Study
We performed ablation experiments on test-R90 to verify the effectiveness of each part of the network.First, the first line's basic network architecture is to remove the entire Medium Transmission Map module.So the network sustained the enhanced image directly based on the feature map generated from the dilated residual block (DRB) in underwater image enhancement subnet.The second line removes the skip connection between two subnetworks.Then, we did a comparative test to remove concatenation and only retain skip Connection.From the experimental results, we can find that without the final concatenation operation, the effect will be greatly re- duced.Through these three experiments can prove that MT prediction subnetwork has a profound impact on image enhancement.After that, we try to reduce the convolution operation after concatenation, and we find that the effect also has an impact.In the last two ablation experiments, we tried to concatenate or add all DRB blocks together through skip connection to enhance the connection between shallow layer and deep layer network, and the results showed that the effect did not perform well.

CONCLUSION
In this paper, to solve the pain points existing in underwater image enhancement at this stage, we demonstrated the value of physical prior, in particular the medium transmission map, for restoring the real-world underwater images.By formulating a very simple network for learning both the prior and restoration results jointly, and encapsulating the knowledge interaction process across these two tasks at both feature and output levels, much better restoration features are learned thus guaranteeing much better results.Besides producing the best results on two real-world benchmarks, our model is also able to process the underwater images in a real-time speed, making it a potential framework to be deployed into intelligent underwater systems.
In the future, we will explore the upper-bound of benefits by the medium transmission map, and also continue the exploration of a more suited knowledge interaction design for better fusing the physical prior.

Fig. 1 .
Fig. 1.Comparison of the results of different methods for processing a real underwater picture.It can be seen from the results that our method restores the chromatic aberration and enhances the contrast.

Fig. 3 .
Fig. 3. Medium transmission guidance module.The MT map T is a feature selector, T weighs the importance of the different spatial positions for F .

Fig. 4 .
Fig. 4. Visual comparison of different images (from Test-R90) enhanced by state-of-the-art methods and our MTUR-Net.

Fig. 5 .
Fig. 5. Test-C60 visual image comparison.Here we can see the difference between our image and the UGAN image.We don't have any obvious pixel cubes, and the contrast and color difference of objects are better.

Table 1 .
[5]parison the State-of-the-Arts Using the PSNR and SSIM on the Test-R90 Dataset[5]

Table 3 .
The generated image equality evaluation results of different methods on Test-C60.

Table 4 .
Component Analysis.The Basic Model is MTUR-Net without the MT-Guided Non-Local Module