A Novel Underwater Image Enhancement Using Optimal Composite Backbone Network

Continuous exploration of the ocean has made underwater image processing an important research field, and plenty of CNN (convolutional neural network)-based underwater image enhancement methods have emerged over time. However, the feature-learning ability of existing CNN-based underwater image enhancement is limited. The networks were designed to be complicated or embed other algorithms for better results, which cannot simultaneously meet the requirements of suitable underwater image enhancement effects and real-time performance. Although the composite backbone network (CBNet) was introduced in underwater image enhancement, we proposed OECBNet (optimal underwater image-enhancing composite backbone network) to obtain a better enhancement effect and shorten the running time. Herein, a comprehensive study of different composite architectures in an underwater image enhancement network was carried out by comparing the number of backbones, connection strategies, pruning strategies for composite backbones, and auxiliary losses. Then, a CBNet with optimal performance was obtained. Finally, cross-sectional research of the obtained network with the state-of-the-art underwater enhancement network was performed. The experiments showed that our optimized composite backbone network achieved better-enhanced images than those of existing CNN-based methods.


Introduction
Exploration of the ocean world has attracted more attention in recent years [1]. As a vital part of image processing for revealing and recognition underwater scenes, underwater image enhancement plays an import role in marine resources and marine military fields [2]. Unfortunately, because of differences in illumination conditions in the complicated underwater environments, the captured images are usually of low quality, with characteristics such as having inauthentic color, blurring, and noise, making the enhancement of such images difficult [3]. In order to accomplish this challenging task, many algorithms have been introduced.
In the early stages, the technical breakthroughs in image enhancement often relied on hardware performances [4]. However, advances in image enhancement not only abated the equipment costs but also indicated being more beneficial for accommodating complicated underwater environments. Researchers used mathematical and statistical analyses to modify the channel values of underwater images to obtain images with high visual quality [5]. Sometimes the enhancement effect for underwater images was regarded as an optics problem, and the corresponding algorithms were established on physical models, such as the dark channel prior (DCP) mechanism [6] and the Retinex model [7]. Some researchers regard underwater image enhancement as a bionic problem [8]. More recently, deep learning gradually became the main underwater image enhancement technology due to its strong modeling ability and efficient inference. Many studies have shown that neural networks obtained enhanced images with higher quality compared to images enhanced by traditional methods [9]. The convolutional neural network (CNN) [10] and generative adversarial network (GAN) [11] were proposed for enhancing underwater images. However, these algorithms are unsuitable because they consume time and space, making them inapplicable.
CBNet shows better performance [12] compared to single-chain networks and provides a broad space for improvement in the network architecture [13]. So herein, CBNet was introduced in the enhancement of underwater images. In order to minimize time and space consumption and meet the actual requirement of marine engineering, CNN models without other embedded algorithms were established. To explore the architectural features of CBNet, comprehensive enhancement research using it in different variants was carried out. An Optimal underwater image Enhancement Composite Backbone Network (OECBNet) was finally obtained, which we compared with several state-of-the-art CNN-based methods. Both full-reference indexes and non-reference indexes indicate that the enhancement results of our proposed network are better than recent state-of-the-art CNN-based underwater image enhancement methods. Our proposed network also performs well in real-time image enhancement, with one 350 × 350 underwater image taking only 2.3 × 10 −3 s to enhance.
The strategy of our research is as follows: Section 1 is the introduction to our research. Section 2 describes the related work, including underwater image enhancement methods, underwater image datasets, and CNN backbone networks. Section 3 describes some architectural details of our backbone network. Section 4 makes a comprehensive exploration of the architectural features of CBNet, and the optimal underwater image-enhancing composite backbone network is obtained from this section. Section 5 evaluates our network performance according to cross-sectional research with other methods. Section 6 makes a conclusion of this paper. Contributions in this paper are summarized as shown in the following three points:

•
We introduced a composite backbone into CNN-based underwater image enhancement. The composite backbone networks consist of several uniform end-to-end CNN backbones in specific connection strategies. Composite backbone networks were applied throughout the enhancement networks; • We conducted a comprehensive study about underwater image enhancement using CBNet in different variants by first investigating the impact of backbone number on the enhancement results. Then, the different connection strategies between backbones and some corresponding pruning strategies were introduced in our research. Moreover, the experiments were conducted in auxiliary losses; • We proposed an optimal underwater image-enhancing composite backbone network. The optimized composite backbone network consists of two backbones in full-connected composition (FCC), and two stages are pruned in the lead backbone.
The network accurately expresses objects as well as background colors and is well adapted to the engineering field.

Deep Learning-Based Underwater Image Enhancement
Underwater image enhancement methods based on deep learning are classified into five classes according to their network architecture [14]. One idea for constructing image enhancement networks is inspired by semantic segmentation, such as U-Net [15]. Yan et al. [16] proposed a cascaded U-Net for image restoration and achieved enhanced results with less training and inference costs. Moreover, Liu et al. [17] proposed a GAN with an encoder-decoder-based generator, which enabled multi-scale feature extraction and fusion. Other kinds of widely used networks are constructed with multiple branches. Xue et al. [18] enhanced images by predicting the coarse result, veil map, and compensation map according to a multi-branch network. Li et al. [19] introduced white balance (WB), gamma correction (GC), and histogram equalization (HE)-enhanced images in multiplebranch enhancement and conducted a gated fusion. Similarly, Wu et al. [20] proposed a GAN for enhancement and introduced WB, HE, and DCP enhanced images for multibranch training in the generator. Wang et al. [21] introduced HSV color space for image adjustment; on this basis, Chen et al. [22] conducted an enhancement fusion of RGB, HSV, and Lab color spaces. Some enhancement methods embedded several units for better feature learning in the networks. Li et al. [9] constructed an E-unit for underwater enhancement, and Chen et al. [23] applied the E-unit in a multi-branch network. Further, depth maps or transmission maps are integrated into some networks [24]. However, to implement these networks, special hardware such as deep cameras are required. Based on the excellent performance of GAN, many researchers tend to upgrade their methods by improving the GAN-based algorithm [25] by incorporating multiple generators [26] or multiple discriminators [27] to enhance underwater images.

Underwater Dataset
The dataset is a sensitive part when it comes to deep learning. Researchers use different equipment to try and collect images in a variety of marine environments. Liu et al. proposed a real-world underwater image enhancement dataset (RUIE) [28] with a large number of diverse light-scattering images and rich detection targets. Karen et al. created an underwater object tracking dataset (UOT100) [29] that includes 104 underwater videos for object detection. However, the lack of reference images was always a challenge for these datasets to achieve enhancement. To solve the problem, Anwar conducted research on underwater image synthesis and proposed a method for simulating different underwater scenes [9], which can be applied as full-reference enhancement result evaluations. While retaining the real underwater environment, Li et al. proposed an underwater image enhancement benchmark (UIEB) dataset [19] consisting of a reference subset and a challenging subset. The reference dataset included 890 underwater images as well as corresponding reference images, which partially solved the challenge of the lack of reference images and promoted innovations in underwater image enhancement methods.

Backbone Network
The backbone network plays a significant part in CNN models. Earlier networks were often designed to be as deep as possible [30] and connections as dense as possible [31], but their performance generally degenerated with increasing depth. To improve their quality, researchers added branches in different scales into the backbone networks. ResNet, ResNeXt, and Res2Net were proposed in succession [32]. In addition, Liang et al. proposed a recurrent convolutional neural network (RCNN) [33] which recurrently connected each convolution layer, and the contextual information was efficiently integrated. Meanwhile, Zhang et al. [34] proposed a lightweight backbone network called MobileNet, which greatly reduced the time complexity of the backbone network. In order to expand the sensory field of each pixel, many backbone networks embed multi-scale architectures [35], especially feature pyramid blocks [36]. Another inspiration was from U-Net [15], which introduced downsampling and upsampling in backbone networks [37]. Liu et al. proposed a composite backbone network (CBNet) [12], which brought a breakthrough in object detection. Thereafter, they continued to explore the architecture of CBNet and proposed a general framework with novel connection strategies (CBNetV2) [13]. CBNet has now been widely applied to counter the challenges of object detection and semantic segmentation [38]. Zhao et al. improved adjacent higher-level composition in CBNet, achieving the detection and recognition of images in low-quality underwater videos, which brought CBNet into underwater vision [39]. CBNet has been proven to be excellent for feature extraction and is gradually being applied in underwater image enhancement.

Global Architecture
The global architecture of CBNet in general is shown in Figure 1. K identical backbones in parallel and the Backbone i are compositely connected with backbones from Backbone 1 to Backbone i − 1. Output images from the lead backbone are sent to output layers, and output images are obtained from the output convolution block. In addition, each stage obtains images of the same height and weight as input images for pixel-level enhancement. The channel numbers of output images from the stages are constantly the same. mentation [38]. Zhao et al. improved adjacent higher-level composition in CBNet, achieving the detection and recognition of images in low-quality underwater videos, which brought CBNet into underwater vision [39]. CBNet has been proven to be excellent for feature extraction and is gradually being applied in underwater image enhancement.

Global Architecture
The global architecture of CBNet in general is shown in Figure 1. K identical backbones in parallel and the Backbone i are compositely connected with backbones from Backbone 1 to Backbone i − 1. Output images from the lead backbone are sent to output layers, and output images are obtained from the output convolution block. In addition, each stage obtains images of the same height and weight as input images for pixel-level enhancement. The channel numbers of output images from the stages are constantly the same.

Backbone Architecture
The backbone architecture of our proposed network is as shown in Figure 2, which skillfully ensures that output images are the same height and weight as input images without the shape of input images getting limited. The backbone is empirically constructed with two previous stages, three middle stages, and one last stage in series. Each stage contains a convolutional layer followed by a batch norm layer and a ReLU layer. During the previous stages, the convolutional layers are operated with a 7 × 7 convolution kernel and padding = 3, and the large convolution kernels expand the receptive field for each pixel, thus improving the global image enhancement effect. During the middle stages, the convolutional layers are operated with a 5 × 5 convolution kernel and padding = 2, and the introduction of more middle layers for convolution processing improves the learning ability of the network. During the last stages, the convolutional layers are operated with a 3 × 3 convolution kernel and padding = 1, and last stages are placed before output to accurately enhance the pixelwise detail.

Backbone Architecture
The backbone architecture of our proposed network is as shown in Figure 2, which skillfully ensures that output images are the same height and weight as input images without the shape of input images getting limited. The backbone is empirically constructed with two previous stages, three middle stages, and one last stage in series. Each stage contains a convolutional layer followed by a batch norm layer and a ReLU layer. During the previous stages, the convolutional layers are operated with a 7 × 7 convolution kernel and padding = 3, and the large convolution kernels expand the receptive field for each pixel, thus improving the global image enhancement effect. During the middle stages, the convolutional layers are operated with a 5 × 5 convolution kernel and padding = 2, and the introduction of more middle layers for convolution processing improves the learning ability of the network. During the last stages, the convolutional layers are operated with a 3 × 3 convolution kernel and padding = 1, and last stages are placed before output to accurately enhance the pixelwise detail.

Output Convolution Block
Output blocks play an important role in CNN-based underwater image enhancement for information fusion and channel values' normalization. However, the importance of output blocks was generally ignored in previous underwater image enhancement networks. Our proposed output convolution block incorporated a 3 × 3 convolutional layer with padding = 1, followed by a Sigmoid layer instead of the traditional ReLU layer. Application of the Sigmoid layer ensures that the pixel channel values of output images lie

Output Convolution Block
Output blocks play an important role in CNN-based underwater image enhancement for information fusion and channel values' normalization. However, the importance of output blocks was generally ignored in previous underwater image enhancement networks. Our proposed output convolution block incorporated a 3 × 3 convolutional layer with padding = 1, followed by a Sigmoid layer instead of the traditional ReLU layer. Application of the Sigmoid layer ensures that the pixel channel values of output images lie within the standard range of [0, 1].

Research on CBNet-Based Underwater Image Enhancement
In this section, we explore the architectural features of CBNet. First, we introduce some experiment details. In Section 4.2, we analyze the impact of backbone number on underwater image enhancement. In Section 4.3, we evaluate the underwater image enhancement performance of different variants. In Section 4.4, we implement pruning strategies in some variants. In Section 4.5, we construct auxiliary losses for partial special variants. Finally, we make a summary of composite backbone architecture and choose a network as the optimal composite backbone network for subsequent comparison.

Implementation Details
We applied the UIEB dataset [22] to our study. To construct a training set, we randomly chose 800 underwater images as well as their corresponding reference images from a reference subset. Other images in the UIEB dataset were treated as test images. All images were preprocessed by stretching to 350 × 350, and training set images were cut into 320 × 320 at random. Our loss function was a weighted sum of L1 loss and SSIM loss as follows: where ω l 1 and ω SSIM are weight parameters used to balance two loss components. Weight parameters are selected as ω l 1 = 2, ω SSIM = 1. The two loss components L l 1 and L SSIM are calculated as follows: where I P represents the predicted image and I R represents the reference image. t I P (i) and t I R (i) are the corresponding channel values in a certain pixel i. µ I P (i) and µ I R (i) are average channel values of a 11 2 square interval with center i. σ I P (i) and σ I R (i) represent corresponding standard deviations. σ I P I R (i) corresponds to the covariance, while parameters C 1 = 0.02 and C 2 = 0.03 are selected to stabilize the SSIM function. C, H, and W mean channels, height, and weight, respectively. We applied ADAM optimizer for network training and initialized the learning rate as 1 × 10 −3 and decreased it by 1 × 10 −5 in each epoch. We extracted the trained network models after 100 epochs, which means it takes nearly two hours to train each model. Pytorch as well as Anaconda were jointly used with Nvidia GTX 3080 GPU for programming.
We used the rest of the 90 pairs of images from the UIEB reference subset for network evaluation, and three classical full-reference indications, namely: mean square error (MSE), peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM), were adopted. A lower MSE or higher PSNR score indicated that the pixel values of the predicted results were closer to references. A higher SSIM indicated that the predicted result was more similar to the reference. However, these indications could only evaluate images from a single channel. To make evaluation results more rigorous, we conducted evaluations from both gray-scale images and RGB images as follows: E gray (I, I ) = E(RGB2 gray(I), RGB2 gray(I )) (5) where E represents evaluation functions of indications such as MSE, PSNR, or SSIM. I is the enhanced image while I R , I G , and I B are corresponding maps in R, G, and B channels; I is the reference image while I R , I G , and I B are corresponding maps in R, G, and B channels.

Backbone Number Analysis
CBNet was a combination of K backbones with the same architecture. We define the backbones as Bone 1, Bone 2, Bone 3, . . . , Bone K. For any defined backbone, if k = 1, 2, . . . , K − 1, Bone k is considered an auxiliary backbone; if k = K, Bone k is the lead backbone. Generally, there are two approaches for information fusion between backbones, that is, residual fusion and concat fusion. Concat fusion is to amalgamate several images with the same shape in the specified dimension, and the channel number of the output image is the sum of input image channel numbers. Residual fusion is a simple image subtraction. The traditional same-level composition (SLC), as shown in Figure 4a, is applied in backbone number analysis; thus, the lth stage transformation of concat-fused information and residual-fused information are shown respectively: where Cat represents the concat fusion of images. F l C is the combinational function of concat-fused information in the lth stage, F l R is the combinational function of residual-fused information in the lth stage, while x l k is the output image from the lth stage in the kth backbone. When l = 7, F 7 C and F 7 R indicate the combinational function of the output block, while x 7 k indicates the output-enhanced image. We conducted analyses of underwater image enhancement from networks with different backbone numbers. The output channel number from each stage was set as 24 to meet the internal memory demand. Underwater image enhancement analyses were conducted, as shown in Figure 3, details of analyses from different backbone number as shown in Table S1. According to SSIM evaluations in Figure 3a, networks with concat fusion obtained the best SSIM score when K = 3, and networks with residual fusion obtained the best SSIM score when K = 2. According to MSE evaluations in Figure 3b, as the number of backbones increased, MSE gray indicators increased significantly, while MSE RGB from concat fusion networks obtained a weak downward trend. In our case, networks with concat fusion obtained better MSE scores than networks with residual fusion. The advantage of networks with concat fusion became evident according to RSNR evaluation in Figure 3c, and the best PSNR scores were obtained when K = 3. According to Figure 3d, the average runtime of concat fusion networks increased linearly with the number of backbones. Compared with concat fusion networks, the runtime of residual fusion networks decreased to some extent. According to the above results, increasing the number of backbones can increase the amount of information and thus increase the possibility of obtaining high-quality results. However, with the increase in information, the time complexity will increase, and the interference information will increase, too. What is more, the introduction of too-deep network enhancement will gradually distort output images. Thus, excessively increasing the auxiliary backbone is often counterproductive. Suitable backbone architectures would appear when K = 2 or 3.
decreased to some extent. According to the above results, increasing the number of back bones can increase the amount of information and thus increase the possibility of obtain ing high-quality results. However, with the increase in information, the time complexit will increase, and the interference information will increase, too. What is more, the intro duction of too-deep network enhancement will gradually distort output images. Thus excessively increasing the auxiliary backbone is often counterproductive. Suitable back bone architectures would appear when K = 2 or 3.

Connection Strategy Analysis
The analysis in Section 4.2 showed that concat fusion networks achieve relatively suitable underwater image enhancement results when K = 2 or 3. To minimize the time and space complexity, we employed CBNet with two concat-fused backbones for further exploration. To further explore the CBNet architecture, we constructed several variants of CBNet in different connection strategies besides the traditional SLC ( Figure 4). Details of different composite strategies are explained in Sections 4.3.1-4.3.5. The output channel number from each stage was set as 32.

Connection Strategy Analysis
The analysis in Section 4.2 showed that concat fusion networks achieve relatively suitable underwater image enhancement results when K = 2 or 3. To minimize the time and space complexity, we employed CBNet with two concat-fused backbones for further exploration. To further explore the CBNet architecture, we constructed several variants of CBNet in different connection strategies besides the traditional SLC ( Figure 4). Details of different composite strategies are explained in Sections 4.3.1-4.3.5. The output channel number from each stage was set as 32. To obtain stable image enhancement results, higher-level features can be taken to enhance the lower-level features. In the AHLC variant shown in Figure 4b, output information of the adjacent higher-level stage from the auxiliary backbone was sent to the lead backbone. The lth stage transformation of AHLC is shown in Equation (8): where l = 1, x 0 k indicates input image.

Adjacent Lower-Level Composition (ALLC)
Another line for stabilizing image enhancement results is to take lower-level features to enhance the higher-level features, which is just opposite to AHLC. As shown in Figure  4c, ALLC sent the adjacent lower-level stage output information to the lead backbone. The lth stage transformation of ALLC is shown in Equation (9)  To obtain stable image enhancement results, higher-level features can be taken to enhance the lower-level features. In the AHLC variant shown in Figure 4b, output information of the adjacent higher-level stage from the auxiliary backbone was sent to the lead backbone. The lth stage transformation of AHLC is shown in Equation (8): where l = 1, x 0 k indicates input image. Another line for stabilizing image enhancement results is to take lower-level features to enhance the higher-level features, which is just opposite to AHLC. As shown in Figure 4c, ALLC sent the adjacent lower-level stage output information to the lead backbone. The lth stage transformation of ALLC is shown in Equation (9):

Dense Same-Level Composition (DSLC)
As shown in Figure 4d, the output information of each stage from the lead backbone was fused with the output information of non-lower-level stages from the auxiliary backbone and sent to the next stage. The architecture of DSLC is similar to that of DenseNet [33]. The lth stage transformation of DSLC is as shown in Equation (10):

Dense Higher-Level Composition (DHLC)
As shown in Figure 4e, the output information of higher-level stages was sent to the lead backbone. The lth stage transformation of DHLC is as shown in Equation (11):

Full-Connected Composition (FCC)
To connect the auxiliary backbone to the leading backbone as compactly as possible, each auxiliary stage backbone output information was sent to all lead backbone stages, as shown in Figure 4f. The lth stage transformation of FCC is as shown in Equation (12): We then evaluated underwater image enhancement results from networks with different composite strategies described above. The output channel number from each stage was set as 32. Evaluations of the enhanced underwater images from different connection strategies are shown in Table 1, details as shown in Table S2. According to the evaluations, FCC achieved the best scores in SSIM indications, DHLC achieved the best scores in MSE and PSNR indications, and AHLC achieved the best scores in MSE gray and PSNR gray . The results revealed that FCC obtained enhancement images almost similar to reference images, DHLC obtained enhancement images closely resembling reference images, while AHLC obtained enhancement images closer to reference images from gray-scale than other connection strategies.
Further, Table 2 shows the runtime comparison among networks with different connection strategies. According to the results, due to the reduction in the connection between the lead backbone and the auxiliary backbone, ALLC completed underwater image enhancement within the shortest runtime (2.55 × 10 −3 s per image on average). Meanwhile, DHLC constructed pretty complex connections between the lead backbone and the auxil-iary backbone, thus completing underwater image enhancement in the longest runtime. Notably, FCC constructed similar concat connections before each stage, which simplified the network procedure; thus, FCC also enhanced images relatively fast, with an average runtime of only 2.59 × 10 −3 s.

Pruning Strategy Analysis
To reduce the excessive connection between network backbones, we simplified the architecture of CBNet to some extent by pruning. Pruning strategies were applied in CBNet when K = 2. As shown in Figure 5, p i represents pruning i stages from the lead backbone; two backbones share the first i stages. The i+1th stage transformation is as shown in Equation (13): where Cat(x 1 ) represents the connection from the auxiliary backbone to the lead backbone. When i = 0, the network is not pruned; when i = 6, the network is a single backbone network. According to Section 4.3, enhancement results from FCC and DHLC achieved excellent scores, and both FCC and DHLC variants were applied for subsequent studies. ( ( ( ), )),0 6 where 1 ( ) Cat x represents the connection from the auxiliary backbone to the lead backbone. When i = 0, the network is not pruned; when i = 6, the network is a single backbone network. According to Section 4.3, enhancement results from FCC and DHLC achieved excellent scores, and both FCC and DHLC variants were applied for subsequent studies. We conducted analyses of underwater image enhancement from networks with different pruning strategies, as shown in Figure 6, details of analyses from different pruning strategies as shown in Table S3. According to the SSIM evaluation in Figure 6a, FCC obtained better SSIM scores than DHLC. Both FCC and DHLC obtained the best enhancement result when pruning two stages. Moreover, FCC(p2) achieved SSIMgray = 0.9409 and SSIMRGB = 0.9119, which is a breakthrough in full-reference underwater image enhancement. According to the MSE evaluation in Figure 6b, the best MSE scores for FCC and DHLC were also obtained when pruning two stages, with MSE scores for FCC being slightly lower than those for DHLC. According to Figure 6c, the best PSNR scores for DHLC were obtained in p2. On the contrary, the best PSNR scores for FCC were obtained in p3 and p4, respectively. According to a runtime evaluation in Figure 6d, the runtime of FCC had an inverse relationship with pruning stages, with runtime decreasing smoothly with an increase in pruning stages. The runtime of FCC remained in the range of 1.4 × 10 −3~2 .6 × 10 −3 s. The runtime of DHLC from p0 to p3 was excessively long, and there was a sudden decline in runtime between p3 and p4. In conclusion, FCC(p2) obtained the best underwater image enhancement results, and DHLC(p2)-enhanced results obtained the best PSNR scores. These architectures were applied in subsequent studies. We conducted analyses of underwater image enhancement from networks with different pruning strategies, as shown in Figure 6, details of analyses from different pruning strategies as shown in Table S3. According to the SSIM evaluation in Figure 6a, FCC obtained better SSIM scores than DHLC. Both FCC and DHLC obtained the best enhancement result when pruning two stages. Moreover, FCC(p 2 ) achieved SSIM gray = 0.9409 and SSIM RGB = 0.9119, which is a breakthrough in full-reference underwater image enhancement. According to the MSE evaluation in Figure 6b, the best MSE scores for FCC and DHLC were also obtained when pruning two stages, with MSE scores for FCC being slightly lower than those for DHLC. According to Figure 6c, the best PSNR scores for DHLC were obtained in p 2 . On the contrary, the best PSNR scores for FCC were obtained in p 3 and p 4, respectively. According to a runtime evaluation in Figure 6d, the runtime of FCC had an inverse relationship with pruning stages, with runtime decreasing smoothly with an increase in pruning stages. The runtime of FCC remained in the range of 1.4 × 10 −3~2 .6 × 10 −3 s. The runtime of DHLC from p 0 to p 3 was excessively long, and there was a sudden decline in runtime between p 3 and p 4 . In conclusion, FCC(p 2 ) obtained the best underwater image enhancement results, and DHLC(p 2 )-enhanced results obtained the best PSNR scores. These architectures were applied in subsequent studies.
in p3 and p4, respectively. According to a runtime evaluation in Figure 6d, the runtime of FCC had an inverse relationship with pruning stages, with runtime decreasing smoothly with an increase in pruning stages. The runtime of FCC remained in the range of 1.4 × 10 −3~2 .6 × 10 −3 s. The runtime of DHLC from p0 to p3 was excessively long, and there was a sudden decline in runtime between p3 and p4. In conclusion, FCC(p2) obtained the best underwater image enhancement results, and DHLC(p2)-enhanced results obtained the best PSNR scores. These architectures were applied in subsequent studies.

Auxiliary Loss Analysis
Considering the special architecture of AHLC and DHLC, output information of the lead backbone's last stage was directly transmitted to Output Conv Block. As a result, the input information of Output Conv Block had the same channel number as the output information of the auxiliary backbones. Thus, auxiliary losses could be introduced in the above variants, and information from the auxiliary backbones was sent to the Output Conv Block, generating an auxiliary image. The auxiliary losses were computed as shown in Equation (14): where k aux I represents the auxiliary image from the kth backbone, k = 1, 2, ..., k − 1. The total loss was constructed as a weighted sum of the main loss as well as auxiliary losses, as shown in Equation (15): where λk is weight for auxiliary loss from I k aux. We introduced auxiliary losses in AHLC, DHLC as well as DHLC(p2) according to Sections 4.3 and 4.4. The architecture is shown in Figure 7. Moreover, we put two auxiliary strategies into the experiment; one is the full-auxiliary strategy (written down as f-a), where loss weight is selected as a constant that λ1 = 0.3, and the other is the semi-auxiliary strategy (written down as s-a) where during first 50 epochs, λ1 = 0.3 and λ1 = 0 later on.

Auxiliary Loss Analysis
Considering the special architecture of AHLC and DHLC, output information of the lead backbone's last stage was directly transmitted to Output Conv Block. As a result, the input information of Output Conv Block had the same channel number as the output information of the auxiliary backbones. Thus, auxiliary losses could be introduced in the above variants, and information from the auxiliary backbones was sent to the Output Conv Block, generating an auxiliary image. The auxiliary losses were computed as shown in Equation (14): where I k aux represents the auxiliary image from the kth backbone, k = 1, 2, . . . , k − 1. The total loss was constructed as a weighted sum of the main loss as well as auxiliary losses, as shown in Equation (15): where λ k is weight for auxiliary loss from I k aux . We introduced auxiliary losses in AHLC, DHLC as well as DHLC(p 2 ) according to Sections 4.3 and 4.4. The architecture is shown in Figure 7. Moreover, we put two auxiliary strategies into the experiment; one is the full-auxiliary strategy (written down as f-a), where loss weight is selected as a constant that λ 1 = 0.3, and the other is the semi-auxiliary strategy (written down as s-a) where during first 50 epochs, λ 1 = 0.3 and λ 1 = 0 later on.
where λk is weight for auxiliary loss from I k aux.
We introduced auxiliary losses in AHLC, DHLC as well as DHLC(p2) according to Sections 4.3 and 4.4. The architecture is shown in Figure 7. Moreover, we put two auxiliary strategies into the experiment; one is the full-auxiliary strategy (written down as f-a), where loss weight is selected as a constant that λ1 = 0.3, and the other is the semi-auxiliary strategy (written down as s-a) where during first 50 epochs, λ1 = 0.3 and λ1 = 0 later on.  We evaluated semi-auxiliary and full-auxiliary strategies underwater image enhancement results, then compared them with enhancement results without auxiliary losses. Evaluations of underwater image enhancement results are shown in Figures 8-10, details as shown in Table S4. REVIEW 13 We evaluated semi-auxiliary and full-auxiliary strategies underwater image hancement results, then compared them with enhancement results without auxi losses. Evaluations of underwater image enhancement results are shown in Figure 8 details as shown in Table S4.   Table S4.    According to SSIM evaluation ( Figure 8) and MSE evaluation (Figure 9), AHLC tained the best scores in the semi-auxiliary strategy (MSERGB = 161.22), which is a breakthrough in MSE evaluation. For DHLC, the full-auxiliary strategy and the me According to SSIM evaluation ( Figure 8) and MSE evaluation (Figure 9), AHLC obtained the best scores in the semi-auxiliary strategy (MSE RGB = 161.22), which is also a breakthrough in MSE evaluation. For DHLC, the full-auxiliary strategy and the method without auxiliary losses obtained better results than the semi-auxiliary strategy. DHLC(p 2 ) obtained the best scores without auxiliary losses. However, according to Figure 10, auxiliary strategies did not perform well in RSNR. As auxiliary losses are abandoned during network testing, the introduction of auxiliary losses does not affect the network runtime.

Summary
According to the comprehensive analyses above, various network architectures and optimization strategies of CBNet have achieved significant effects in underwater image enhancement. Among them, FCC with two pruned stages (FCC(p 2 )), DHLC with two pruned stages (DHLC(p 2 )), and AHLC with the semi-auxiliary strategy (AHLC(s-a)) obtained excellent image enhancement results. We conducted a comparison among the three networks, as shown in Table 3, where FCC(p 2 ) obtained the best scores in SSIM evaluations, DHLC(p 2 ) obtained the best scores in PSNR evaluations, and AHLC(s-a) obtained the best scores in MSE RGB . Considering FCC(p 2 ) can obtain effective underwater image enhancement in the shortest time, hence meeting the demand of real-time performance, FCC(p 2 ) was selected as the optimal underwater image-enhancing composite backbone network (OECBNet).

Cross-Sectional Research
To verify the performance of our proposed OECBNet, we compared OECBNet with several state-of-the-art CNN-based methods, including UWCNN [9], Water-Net [19], UIEC2-Net [21], IW-Net [23], and MC-CBNet [22]. Besides the 90 pairs of images from the UIEB reference subset, 60 images from the UIEB challenge subset were applied to the analysis. Both full-reference comparison and non-reference comparison were applied in the cross-sectional study. We applied two indicators in the non-reference comparison, one being underwater color image quality evaluation (UCIQE) [40], which consists of color density σ c , saturation µ s, and contrast con l, as shown in Equation (16), and the other being underwater image quality measure (UIQM) [41], which is a combination of three sub-indicators: underwater image colorfulness measurement (UICM), underwater image sharpness measurement (UISM), and underwater image contrast measurement (UIConM). The expression is shown in Equation (17):

Reference Subset Evaluation
First, we tested our network as well as state-of-the-art CNN-based methods in the reference subset, and some underwater image enhancement results are shown in Figure 11. As they are affected by natural illumination and differences in water quality, underwater images bear different features. In shallow water, the color features of images are reserved greatly due to the relatively high illumination. As the light gradually spreads into the deep underwater, the red, green, and blue spectra disappear in succession; hence, many underwater images tend to have cool tones, such as green and blue. On the other hand, images from the water-sediment interphase tend to be yellowish-brown, while images from low-illumination areas appear dim. According to the subjective comparison shown in Figure 11, it is visible that our proposed OECBNet obtained images the closest to reference images from high illumination, low illumination, and greenish or blueish underwater images and obtained clearer images from yellowish underwater images even than reference images. This indicated that OECBNet effectively eliminated the blue and green color bias in underwater images, reduced the effect of sediment on underwater images, and accurately expressed the color of objects and the environment. In addition, our proposed method retained detailed features of the underwater images and enhanced them to appear brighter and clearer.
A full-reference comparison of OECBNet as well as state-of-the-art CNN-based methods is shown in Table 4. OECBNet obtained the best scores in SSIM and MSE evaluations but did not achieve the best PSNR scores, indicating that images obtained from OECBNet still bear great deviations in some pixels. A non-reference comparison of OECBNet and the state-of-the-art CNN-based methods is shown in Table 5. OECBNet obtained the best scores in UCIQE, UIQM, UICM, and UISM and had better scores than reference images in UCIQE, UIQM, UISM, and UIconM, which verified the excellent performance of our proposed network. As they are affected by natural illumination and differences in water quality, underwater images bear different features. In shallow water, the color features of images are reserved greatly due to the relatively high illumination. As the light gradually spreads into the deep underwater, the red, green, and blue spectra disappear in succession; hence, many underwater images tend to have cool tones, such as green and blue. On the other hand, images from the water-sediment interphase tend to be yellowish-brown, while images from low-illumination areas appear dim. According to the subjective comparison shown in Figure 11, it is visible that our proposed OECBNet obtained images the closest to reference images from high illumination, low illumination, and greenish or blueish underwater images and obtained clearer images from yellowish underwater images even than reference images. This indicated that OECBNet effectively eliminated the blue and green color bias in underwater images, reduced the effect of sediment on underwater images, and accurately expressed the color of objects and the environment. In addition, our proposed method retained detailed features of the underwater images and enhanced them to appear brighter and clearer.
A full-reference comparison of OECBNet as well as state-of-the-art CNN-based methods is shown in Table 4. OECBNet obtained the best scores in SSIM and MSE evaluations but did not achieve the best PSNR scores, indicating that images obtained from OECBNet still bear great deviations in some pixels. A non-reference comparison of OECBNet and the state-of-the-art CNN-based methods is shown in Table 5. OECBNet obtained the best scores in UCIQE, UIQM, UICM, and UISM and had better scores than reference images in UCIQE, UIQM, UISM, and UIConM, which verified the excellent performance of our proposed network. The runtime analysis is shown in Table 6. Our proposed network had an obvious real-time advantage compared with most underwater image enhancement networks and achieved underwater image enhancement much swifter than state-of-the-art CNN-based methods, except for UWCNN. It can be seen that the optimization of the architecture has achieved an obvious effect on time complexity and effectively solved the problem of the runaway growth of the time complexity caused by the complex network.

Challenging Subset Evaluation
To further verify the excellent image enhancement capability of our network, we tested our network against the state-of-the-art CNN-based methods using a challenging subset from the UIEB dataset. The results are shown in Figure 12.

Challenging Subset Evaluation
To further verify the excellent image enhancement capability of our network, w tested our network against the state-of-the-art CNN-based methods using a challengin subset from the UIEB dataset. The results are shown in Figure 12.  Our proposed method eliminated color cast and obtained more realistic images than other strategies. For underwater images from low-visibility environments, OECBNet clearly restored the underwater scene. According to the challenge subset, the texture structures of the images are relatively simple, and the visibility is relatively low. It is relatively harder to see differences among the enhanced images from different algorithms on the macro level. Although there were no significant differences among IW-Net (e), MC-CBNet (f), and OECBNet (g) enhanced results, our proposed OECBNet achieved better color control in detail, which can be clearly represented by quantitative evaluation.
Due to the lack of reference images in the challenging subset, we applied non-reference indicators to quantitatively evaluate the enhanced underwater images. The comparison of OECBNet with state-of-the-art CNN-based methods is shown in Table 7. OECBNetenhanced images obtained the best scores in UCIQE, UIQM, UICM, and UISM, hence achieving the best underwater image enhancement results.

Conclusions
In this paper, we applied a composite backbone network (CBNet) in underwater image enhancement by first constructing the global architecture of the proposed CBNet as well as backbone details, then conducting comprehensive research on the architecture of CBNet. Since the backbone is an important part of CBNet, we analyzed the impact of backbone number for CBNet-based underwater image enhancement, then compared the performances of different connection strategies in CBNet, which includes same-level composition (SLC), adjacent higher-level composition (AHLC), adjacent lower-level composition (ALLC), dense same-level composition (DSLC), dense higher-level composition (DHLC), and full-connected composition (FCC). Due to the suitable performance of DHLC and FCC, both of them were selected for pruning strategy analysis. In addition, the auxiliary loss was also applied in CBNet research. To summarize, we selected a network with the best enhancement results as the optimal underwater image-enhancing composite backbone network (OECBNet) for subsequent research. According to the cross-sectional research conducted, the images obtained by our proposed network accurately expressed the colors of objects as well as the colors of corresponding environments while retaining fine feature details. According to quantitative analyses, our proposed network obtained better scores than state-of-the-art CNN-based methods in both full-reference indications and non-reference indications. In addition, our proposed network also performed well in realtime image enhancement and achieved a record 2.3 × 10 −3 s per image. We postulate that CBNet, with more than three backbones and multiple connection strategies, might achieve better performance in underwater image enhancement and recommend further research into these architectures. We expect that our proposed underwater image enhancement network will be applied in ROV for taking underwater videos and underwater detection.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/biomimetics8030275/s1, Table S1: Analyses from different backbone number; Table S2: Analyses from different connection strategies; Table S3: Analyses from different pruning strategies; Table S4: Auxiliary loss analysis. LGF20F020009)); and (3) the key R&D Program of Zhejiang Province focusing on the research and application development of medical nursing care robot and elderly care service system of internet hospitals. (Research on intelligent service technology and equipment of health and elderly care support the research and application development of medical nursing care robot and elderly care service system of internet hospitals (no. 2020C03107).
Institutional Review Board Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to it being only available to teams interested in collaboration.

Conflicts of Interest:
The authors declare that they have no known competing financial interest or personal relationships that could have appeared to influence the work reported in this paper.