To comprehensively validate the effectiveness of the proposed model in depth estimation and underwater image restoration, this section designs a detailed experimental plan and conducts benchmark tests against current mainstream methods. The plan includes an overview of datasets, model implementation details, presentation of experimental results, subjective and objective evaluations, and a series of ablation experiments to verify the effectiveness of each module.
4.1. Datasets and Implementation Details
This subsection details the datasets used to validate the effectiveness of the proposed method, as well as specific implementation details, including the experimental environment and model parameters.
- (1)
Datasets
We utilize the FLSea dataset [
43] for model training. This dataset provides approximately 20,000 degraded images from 12 different underwater locations across two distinct sea areas, along with corresponding depth maps and clear reference images. We use 15,000 of these images as the training set, and the remaining data not used for training serves as the test set.
The Seathru [
27] dataset is divided into 5 scenes based on different underwater imaging characteristics. It contains 1100 underwater images with corresponding depth maps, primarily consisting of low-light, dim images with a greenish hue.
The UIEB [
44] dataset contains 890 original underwater images and their high-quality reference images. UIEB-S is a subset of 200 images we curated from UIEB, featuring extremely severe color casts, posing a high demand on the model’s color correction capability.
HURLA is an open-source deep-sea image dataset containing a large number of images of marine relics and underwater species captured under artificial lighting. We select 200 images for evaluation based on different scenes and object types.
These datasets encompass all challenges in underwater image restoration, including various scenarios with different color casts, lighting conditions, objects, and water bodies. We use the FLSea dataset to validate the model’s performance upper bound on data of similar distribution. The FLSea dataset is randomly split into a training set (15,000 images) and a test set (the remaining images). Seathru, UIEB, UIEB-S, and HURLA serve as completely independent cross-domain test sets to evaluate the model’s generalization ability, having not participated in any training or validation process. Given the large dataset size and the primary focus on cross-domain evaluation, we adopt an 80/20 split within the training set for validation set hyperparameter tuning. The final reported results are based on the model trained on the complete training set and evaluated on the independent test sets. The objective metrics for all compared methods are computed on the exact same data splits to ensure a fair comparison.
- (2)
Implementation details
Our network model is implemented using the PyTorch framework (version 2.4.1) and trained and tested on an Intel Core i5-13600KF CPU, 32 GB RAM, and an NVIDIA GeForce RTX 4070 GPU. Model parameters are initialized using Kaiming initialization, with a total parameter count of approximately 11.9 M, classifying it as a lightweight model. All images undergo standardized preprocessing and are resized to a resolution of 640 × 480. No data augmentation is used during training. We employ the AdamW optimizer with parameters set to betas = (0.9, 0.999) and weight_decay = 1 × 10−4, and set the initial learning rate to 1 × 10−4. The model is optimized with a batch size of 6 for a total of 50 epochs. After each epoch, PSNR is computed on the validation set, and the best model is saved (early stopping is not used). The random seed is fixed to 42 to ensure reproducibility. During testing, image preprocessing is consistent with training, and the output images are directly used as the final results.
4.2. Depth Estimation Comparative Experiments
This subsection provides a comprehensive evaluation of the effectiveness of the proposed depth estimation method.
The depth estimation evaluation experiments are conducted on the FLSea and Seathru datasets. These two datasets provide reliable reference depth maps, enabling the calculation of objective metrics. They also encompass multiple challenges such as different lighting conditions, depth scenes, and water body environments. We compare our method with several currently recognized strong models: Monodepth2 [
18], Manydepth [
20], LapDepth [
21], and UDepth [
22]. The experimental results are shown in
Figure 6.
Samples A–C in
Figure 6 are from the FLSea dataset, and samples D–F are from the Seathru dataset. These images cover various typical underwater visual environments, such as different lighting, different color casts, and different distances. Sample A represents a seabed environment under natural light near the surface. Samples B and D can verify depth estimation consistency for differently colored objects at the same position under bright and low-light conditions. Sample C contains environments at various distances with noticeable haze. Samples E and F show spherical coral reefs from different angles under low-light conditions.
- (1)
Subjective evaluation
To comprehensively assess the effectiveness of the proposed depth estimation model, we conduct a subjective evaluation by comparing the visual effects of our method with other mainstream methods from the perspectives of lighting conditions, target distance, and color consistency.
From the perspective of lighting conditions, Monodepth2 exhibits a layered structure in continuous depth regions under low light (e.g.,
Figure 6D–F). Manydepth shows large-area blurring in highlight regions (e.g.,
Figure 6A,C) and loses depth details under low light, making it difficult to distinguish the distance levels of different objects. In contrast, our method maintains clear object contours and reasonable depth level distributions under different lighting conditions.
From the perspective of different target distances, in multi-plane scenes (e.g.,
Figure 6C) and curved surface scenes (e.g.,
Figure 6E,F), LapDepth incorrectly estimates foreground objects with colors similar to the background as background, leading to confusion in depth levels. UDepth suffers from depth value overestimation, with depth values in most areas being larger than the ground truth. In contrast, our method provides results closer to the ground truth depth map both in multi-plane scenes and from different directions of the spherical coral reefs, accurately reflecting the hierarchical structure and surface geometry of the targets.
From the perspective of depth consistency for differently colored objects at the same distance, as shown in
Figure 6B,D, for white lines and black-and-white squares located at the same position as the seabed, other methods estimate significantly different depth values for these areas compared to their surrounding areas at the same depth, which is inconsistent with reality. Our method exhibits better depth consistency in these areas, assigning similar depth estimates to differently colored objects at the same distance, aligning more closely with the true 3D structure of the scene.
In summary, regarding subjective evaluation, compared to mainstream methods, our method performs better in terms of lighting conditions, target distance, and color consistency. The overall visual effect is significantly superior to the compared algorithms, fully demonstrating the comprehensive performance of our method.
- (2)
Objective evaluation
For objective evaluation, we conduct a comprehensive comparison using the Absolute Relative Error (AbsRel), Squared Relative Error (SqRel), Root Mean Squared Error (RMSE), and δ-threshold accuracy metrics [
45,
46], providing more convincing data support for our proposed method. Lower values of AbsRel, SqRel, and RMSE indicate smaller errors between the model-predicted depth values and the ground truth, signifying better model performance. The accuracy metrics represent the proportion of pixels whose ratio of predicted depth to ground truth depth falls within a certain threshold. Higher values indicate more accurate prediction results. In this paper, the thresholds for δ1, δ2, and δ3 are set to 1.5, 1.5
2, and 1.5
3, respectively.
Table 1 presents the quantitative comparison results of different models on the FLSea dataset. We use bold font to indicate the best results and underline to indicate the second-best results.
From the comparison results on the FLSea dataset, our model achieves the best values on all error metrics. Specifically, AbsRel (0.747) is about 6.4% lower than the second-best Monodepth2 (0.798), SqRel (0.272) is about 19.8% lower than the second-best Monodepth2 (0.339), and RMSE (0.228) is about 7.7% lower than the second-best Monodepth2 (0.247). This indicates that our model’s predicted depth is closer to the ground truth with smaller errors. Among other models, Monodepth2 performs relatively well on error metrics, while Manydepth and UDepth have larger errors, especially UDepth with an AbsRel as high as 1.355. Our model also achieves the highest scores on all threshold accuracy metrics. Specifically, δ1 (0.543) is about 13.6% higher than the second-best LapDepth (0.478), δ2 (0.756) is slightly higher than the second-best LapDepth (0.749), and δ3 (0.884) is slightly higher than the second-best LapDepth (0.875). This indicates that our model performs better in terms of the accuracy and consistency of depth map prediction.
Table 2 presents the quantitative comparison results of different models on the Seathru dataset.
Our model still achieves the best values on all error metrics on the Seathru dataset. Specifically, AbsRel (0.784) is about 19.9% lower than the second-best Monodepth2 (0.979), SqRel (0.277) is about 29.9% lower than the second-best Monodepth2 (0.395), and RMSE (0.183) is about 38.6% lower than the second-best LapDepth (0.298). This indicates that our model achieves more significant error reduction on the Seathru dataset, especially on RMSE, where the deviation between predicted depth and ground truth is greatly reduced. Monodepth2 performs relatively well on AbsRel and SqRel but has a higher RMSE; LapDepth performs relatively well on RMSE but has higher AbsRel and SqRel. Our model also achieves the highest scores on all accuracy metrics. Specifically, δ1 (0.602) is about 23.9% higher than the second-best LapDepth (0.486), δ2 (0.829) is about 16.6% higher than the second-best LapDepth (0.711), and δ3 (0.918) is about 9.2% higher than the second-best LapDepth (0.841). This indicates a substantial improvement in the accuracy of depth map prediction by our model, bringing it closer to the true depth values. Although LapDepth performs relatively well on accuracy metrics, it is still far below our model.
Our depth estimation model consistently outperforms other comparative models across various metrics on both datasets. This demonstrates the adaptability of our method to complex underwater environments, effectively mitigating the problem of underwater image degradation characteristics interfering with depth cue extraction. The stable performance of our method across two significantly different datasets, along with the notable improvements in both error metrics and accuracy metrics, indicates good adaptability to different water quality conditions, lighting environments, and shooting distances. This overcomes the deficiency of traditional methods in generalizing across different water bodies. Particularly noteworthy is the excellent performance of error metrics in low-light scenes, proving that our method effectively enhances the extraction of depth features under low-light conditions.
4.3. Image Restoration Comparative Experiments
This subsection provides a comprehensive evaluation of the image restoration results, validating the adaptability of the proposed method to different degradation types and underwater environments.
The image restoration evaluation experiments are conducted on several challenging underwater image datasets. The proposed method is compared with several representative methods, including a model-free method (CLAHE [
11]), physics-based model methods (IBLA [
23] and ULAP [
24]), and deep learning-based methods (UWCNN [
13], FUnIE-GAN [
14] and U-Transformer [
9]). Experimental results with clear reference images are shown in
Figure 7, and results without reference images are shown in
Figure 8.
Samples A,B in
Figure 7 are from the FLSea dataset, C,D are from the UIEB dataset, and E–G are from the UIEB-S dataset. Samples A–C in
Figure 8 are from the Seathru dataset, and D–F are from the HURLA dataset. These samples involve different lighting conditions, seabed planes, close-ups of marine life, coral reef structures, and cases of severe color cast. Furthermore, we reused the depth estimation samples from the previous step (
Figure 7A,B and
Figure 8B,C) to demonstrate the effectiveness of the continuous task of depth estimation to image restoration.
- (1)
Subjective Evaluation
To comprehensively evaluate the performance of the overall image restoration model, this paper systematically analyzes the restoration effects from three aspects: brightness compensation, color correction, and detail recovery.
In terms of brightness compensation, for low-light images (
Figure 8A–C), the restoration effects of the comparison methods are poor and can even lead to more severe degradation and distortion. For instance, ULAP darkens the dim areas in the image, UWCNN exhibits abnormal color distortion, and while CLAHE can effectively increase brightness, it does not alleviate the greenish cast caused by the environment. U-Transformer can effectively enhance brightness, but its restoration results introduce unnatural yellow tones that do not exist in the original image. For example, the green coral reefs in the figure turn yellow after restoration. The CLAHE method improves image brightness to some extent (
Figure 8D–F), but excessive enhancement often leads to an overall overly bright and oversaturated image. In contrast, our method provides brightness compensation for both targets and the surrounding environment, presenting clearer image details, and achieves color cast correction while enhancing image brightness.
In terms of color correction, existing typical methods have limited ability to handle color distortion. For images with severe color cast (
Figure 7E–G) and greenish cast under low light (
Figure 8A–C), U-Transformer can alleviate color cast to some extent, but its correction effect is limited for severe color cast (as shown in
Figure 7F). The restoration results of other methods still exhibit a strong greenish cast, especially UWCNN, which also shows abnormal color distortion. Our method effectively eliminates the severe color cast in the images. The restoration effects of other methods on yellowish-tone images (
Figure 8D–F) are also limited. In particular, the results from the IBLA method show abnormal purple hues, and the results from ULAP appear reddish. In contrast, our method can effectively address severe color cast problems, maintaining stable color restoration effects.
In terms of detail recovery, the differences among the methods are particularly significant in the textures of close-range objects and the fine structures of organisms. As shown in
Figure 7A,B, in the restoration of seabed rock areas, methods like ULAP and FUnIE-GAN blur the geometric details of the rock surfaces, weakening their original texture features and edge definition. Our method more clearly restores the uneven textures and crack details of the rock blocks, with higher edge sharpness. Our method also shows advantages in the restoration of marine organisms: as shown in
Figure 7C and
Figure 8F, the textures of marine fish and the surrounding environmental details of coral and sand grains are more clearly recovered; in
Figure 7D and
Figure 8D, the complex tentacle structures of anemone-like organisms are also more significantly restored. In contrast, the restoration results of other methods in these areas appear smoother or with blurred contours.
Overall, our method outperforms existing representative algorithms in terms of color correction, brightness compensation, detail preservation, and robustness in complex scenes. Its restoration results are visually more natural and realistic, closest to the reference ground truth (GT) images, demonstrating the effectiveness of the joint optimization of the depth estimation model and the physical imaging model.
- (2)
Objective Evaluation
To more comprehensively evaluate the effectiveness of the proposed method, we conducted full-reference and no-reference evaluations for different methods.
Table 3 presents the evaluation results of the full-reference metrics Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) [
47], used to measure the pixel-level or structural similarity between the restored image and the ground-truth reference image.
Table 4 presents the evaluation results of the no-reference metrics UIQM and UCIQE [
48], used to quantify the visual perceptual quality of the restored image in terms of color, clarity, and contrast. Similarly, we use bold font in the tables to indicate the best result for each metric and underline to indicate the second-best result.
Analyzing
Table 3, our method achieves the highest scores on the SSIM metric, proving that our method has a significant advantage in restoring the structural information of images, with its output being structurally closest to the clear reference image. Especially on the most challenging severe color cast dataset UIEB-S, it reaches 0.8954. Our method significantly outperforms other methods, showing a 2.8% improvement over the second-best CLAHE (0.8712), a 27.2% improvement over the worst IBLA, a 15.4% improvement over ULAP, a 21.0% improvement over UWCNN, a 16.7% improvement over FUnIE-GAN, and a 16.7% improvement over U-Transformer, indicating its outstanding capability for restoring images with extreme color cast.
On the PSNR metric, our method performs best on both the UIEB and UIEB-S datasets, with a particularly large improvement on UIEB-S. Our method’s PSNR (23.35) is nearly 3 dB higher than the second-best FUnIE-GAN (20.50). It shows improvements of 4.29 dB over CLAHE, 4.87 dB over IBLA, 3.25 dB over ULAP, 2.86 dB over U-Transformer, and the largest improvement of 5.61 dB over UWCNN. This significant improvement directly proves the excellence of our method in suppressing noise and artifacts introduced during the restoration process and reducing pixel-level errors. Although on the FLSea dataset, our method (24.09) is slightly lower than IBLA (25.24), it is still much higher than other comparison methods, showing significant improvements of 7.31 dB and 6.31 dB over ULAP and UWCNN, respectively. This demonstrates a better balance between pixel accuracy and structural fidelity, showcasing its stable performance. Our method has the largest leading advantage on UIEB-S, a result that strongly proves the exceptional generalization ability and robustness of our method in handling the most challenging underwater image degradation problems, especially severe color distortion.
Analyzing
Table 4, the no-reference evaluation results indicate that our method performs excellently on the UCIQE metric. Especially on the most challenging dataset UIEB-S, it shows a 3.8% improvement over the second-best ULAP and a significant 16.9% improvement over UWCNN. On FLSea, it also shows overall improvement, with a particularly large 9.3% improvement over FUnIE-GAN. On UIEB, it is only 0.3% lower than ULAP but shows substantial improvements compared to other methods. UCIQE mainly evaluates the colorfulness, saturation, and contrast of the image, indicating that our method can very effectively restore the color information lost in underwater images and produce results that are visually rich in color with appropriate contrast.
However, our method is not the best on the UIQM metric, generally performing at a medium-to-high level. The reason for this is that UIQM is very sensitive to changes in color saturation. Traditional enhancement methods like CLAHE and generative models like FUnIE-GAN often significantly boost the global contrast and saturation of images, leading to higher UIQM scores, but sometimes introducing unnatural colors or over-enhancement (as seen in the results of
Figure 7 and
Figure 8). In contrast, our method focuses more on accurate color correction and detail recovery rather than blindly increasing saturation. This may result in relatively lower UIQM scores but brings higher fidelity (SSIM/PSNR) and more natural colors (UCIQE).
In recent years, significant progress has been made in the field of underwater image quality assessment (IQA). For instance, PUIQA [
49] incorporates physics-informed guidance and multi-scale perception, explicitly considering physical priors such as non-uniform illumination and backscatter gradient. Our method performs restoration through the depth-driven Akkaynak-Treibitz physical model, and its output images possess inherent advantages in terms of physical consistency. It is anticipated that these images will achieve high scores in the physics-informed dimensions of metrics like PUIQA. New metrics such as PUIQA inherit the focus of traditional metrics on structural information. The excellent performance of our method in SSIM (0.8954 on UIEB-S) has already demonstrated its structural preservation capability, and this advantage is expected to extend to the evaluation of these new metrics.
In summary, the core advantage of our method lies in its balance and accuracy. We not only perform excellently on the no-reference metric UCIQE but, more importantly, achieve comprehensive leadership on the fidelity metrics that measure restoration accuracy. This means our method can produce visually pleasing results while ensuring that the restored image is closer to the real, clear scene in terms of structure and pixel level.
4.5. Ablation Study
This subsection systematically validates the specific contributions of the key modules proposed in this paper to the depth estimation performance through ablation experiments. Since an accurate depth map is a prerequisite for image restoration based on a physical model, the performance of depth estimation directly affects the final restoration result. Therefore, this section focuses on ablation analysis of the depth estimation network.
We conducted systematic ablation experiments with various configurations on the Seathru dataset. These configurations included a Baseline based on AdaBins and UDepth, one with the encoder replaced by MobileNetV3-small (V3), one with attention module (CBAM), one with the Channel–Spatial Hybrid Attention Module (CSHAM) added, and the complete model (Ours). This allowed for a quantitative assessment of each module’s impact on depth estimation accuracy and model efficiency. All configurations were tested under identical training settings, with the results presented in
Table 6.
Analyzing
Table 6, the experimental data reveals several key points. Firstly, compared to the Baseline, replacing the encoder with MobileNetV3-small significantly reduces the model’s parameter count (from 15.6 M to 11.8 M, a decrease of approximately 24.4%) while improving all depth estimation metrics. For example, AbsRel decreases from 0.898 to 0.852, SqRel from 0.345 to 0.315, and RMSE from 0.211 to 0.201. This lightweight design does not compromise performance; instead, it enhances accuracy through more efficient feature extraction. Secondly, introducing the CSHAM module further reduces AbsRel, SqRel, and RMSE to 0.804, 0.289, and 0.198, respectively, with only a minimal increase in parameters. Compared to introducing CBAM, our designed CSHAM achieves better evaluation metrics with fewer parameters, indicating that CSHAM effectively guides the model to focus on features more critical for depth estimation in underwater scenes, thereby significantly improving depth estimation accuracy. Finally, our complete model achieves the best performance on all error and accuracy metrics. It reduces AbsRel by 12.7%, SqRel by 19.7%, and RMSE by 13.3% compared to the Baseline, while maintaining the parameter count at 11.9 M, achieving an optimal balance between accuracy and efficiency.
In summary, the ablation experiments confirm the effectiveness of each proposed module in enhancing underwater monocular depth estimation performance. The effective combination of these modules enables the complete model to achieve a superior balance between parameter count and accuracy, laying a solid foundation for subsequent high-quality image restoration.
4.6. Experimental Summary
Through systematic experimental design and analysis, the effectiveness and advancement of the proposed method have been comprehensively validated across three dimensions: depth estimation, image restoration, and ablation studies. In terms of depth estimation, both subjective visual comparisons and objective metric evaluations demonstrate that our method produces more accurate and consistent depth maps across various underwater environments, significantly outperforming existing mainstream methods. Regarding image restoration, our method exhibits outstanding performance from both subjective visual perspectives (such as brightness, detail, and color) and objective metrics (such as PSNR, SSIM, and UCIQE), with particularly notable advantages in scenes with extreme color cast. The ablation study further confirms the necessity and contribution of each innovative module. Overall, the experimental results demonstrate that the technical approach of deeply integrating monocular depth estimation with a physical imaging model can effectively address the degradation problems caused by complex underwater environments, providing reliable technical support for enhancing the performance of underwater vision tasks.
However, our work also has certain limitations. During experiments, we observed that restoration performance is sometimes suboptimal for background regions with large depth values, such as distant water bodies or dim far-range targets. Our analysis suggests two main factors contributing to this: First, in these regions, light signals undergo extreme attenuation and scattering, resulting in severe loss of image information and providing very weak or even contradictory visual cues for depth estimation. This leads to inaccuracies in depth estimation for distant areas. Second, this fundamental problem is amplified by inherent issues in our training datasets: the depth values in reference depth maps for these distant or information-lost regions are often marked as “NaN,” indicating unavailable or unreliable ground truth depth. This data absence prevents our model from effectively learning how to correctly regress depth in these regions during training, leading to depth estimation biases during testing, which further propagate through the physical imaging model and result in restoration failures in those areas.