Next Article in Journal
Modeling Seasonal Salinity Dynamics in the Navío Quebrado Coastal Lagoon, Colombia
Previous Article in Journal
Coproparasitological Survey of Stranded Cetaceans on Portugal’s Mainland Coastline
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Underwater Image Restoration Integrating Monocular Depth Estimation with a Physical Imaging Model

1
School of Information Science and Engineering, Chongqing Jiaotong University, Chongqing 400074, China
2
School of Mechanical and Electrical Engineering, Harbin Engineering University, Harbin 150001, China
*
Author to whom correspondence should be addressed.
J. Mar. Sci. Eng. 2026, 14(6), 563; https://doi.org/10.3390/jmse14060563
Submission received: 9 February 2026 / Revised: 7 March 2026 / Accepted: 17 March 2026 / Published: 18 March 2026
(This article belongs to the Section Ocean Engineering)

Abstract

Underwater images suffer from quality degradation such as haze, detail blurring, color distortion, and low contrast due to factors like light scattering and wavelength-dependent attenuation in water. This severely hinders the high-quality completion of target detection tasks for Autonomous Underwater Vehicles (AUV) relying on image information. Although deep learning-based methods have gained widespread attention, existing approaches still face challenges such as insufficient feature extraction and limited generalization in complex real-world scenes. Methods based on physical models, on the other hand, heavily rely on depth information which is difficult to obtain accurately. To address these issues, this paper proposes a novel underwater image restoration method that integrates depth estimation with the Akkaynak-Treibitz physical imaging model. In the depth estimation stage, efficient and robust feature extraction is achieved through a lightweight encoder–decoder architecture combined with a channel–spatial hybrid attention mechanism. To overcome the inherent scale ambiguity problem in monocular depth estimation, which prevents direct output of absolute depth consistent with the real scene, sparse depth priors are introduced. Subsequently, adaptive depth binning and depth map optimization are realized via m-Vision Transformer and convolutional regression. In the image restoration stage, the acquired high-quality depth map is combined with the Akkaynak-Treibitz physical imaging model for inverse solving, achieving high-quality restoration from degraded to clear images. Experimental results demonstrate that the proposed method outperforms mainstream depth estimation methods (LapDepth, UDepth, etc.) and mainstream image restoration methods (CLAHE, FUnIE-GAN, etc.) in terms of evaluation metrics and visual perceptual quality. When processing the extremely degraded UIEB-S dataset, the proposed method achieves evaluation metrics of SSIM = 0.8954, UCIQE = 0.6107, and PSNR = 23.35 dB. Compared to the CLAHE and FUnIE-GAN methods, SSIM improved by 2.8% and 16.7%, UCIQE improved by 9.6% and 14.3%, and PSNR improved by 22.5% and 13.9%, respectively. Comprehensive subjective and objective evaluation results validate the effectiveness of the proposed method in addressing image quality degradation, particularly demonstrating outstanding capability in severe color cast correction and detail recovery.

1. Introduction

With the increasing demands for marine resource exploration and underwater investigation, underwater vision is playing an increasingly crucial role. However, unlike terrestrial environments, underwater optical imaging faces more severe challenges. As light propagates through water, it undergoes intense absorption and scattering effects, accompanied by significant wavelength-dependent attenuation, leading to prevalent image degradations such as haze, detail blurring, color distortion, and low contrast in captured underwater images [1].
For deep-sea target detection by AUV [2], sunlight is absent, and only artificial light sources are available. Moreover, underwater images captured under near-surface sunlight and those under deep-sea artificial illumination exhibit different lighting conditions and image characteristics, consequently requiring different image enhancement approaches [3]. The modeling, detection, and compensation of artificial light sources represent a relatively new research topic compared to general underwater image processing [4]. Furthermore, the limited intensity of artificial light, coupled with significant underwater attenuation, often results in weak illumination for distant scenes. The degradation of underwater image quality under various lighting conditions severely impacts the accuracy and reliability of subsequent visual tasks (such as target recognition [5] and 3D reconstruction [6]). Therefore, developing effective underwater image clarification techniques capable of handling complex underwater environments is a key issue in the field of underwater vision [7,8,9].
To address the challenges posed by degraded images, existing research on underwater image clarification primarily follows three paths: physics-based model methods, non-model methods, and deep learning methods. Non-model methods focus on image enhancement by adjusting features like brightness and contrast to improve image quality, such as Histogram Equalization (HE) [10] and Contrast Limited Adaptive Histogram Equalization (CLAHE) [11]. These methods are computationally efficient but prone to over-enhancement and artifacts, leading to loss of detail in bright areas and amplified noise in dark areas. Deep learning-based restoration methods adopt a data-driven approach [12], leveraging large amounts of training data to learn the degradation process and restoration mapping, such as the convolutional neural network-based UWCNN [13] and the generative adversarial network-based FUnIE-GAN [14]. However, their generalization capability faces severe challenges in complex scenes not covered by the training data. Most methods struggle to effectively restore images under low-light conditions, and their correction effectiveness for images with severe color casts is limited.
The core of physics-based model restoration methods lies in accurately estimating physical parameters related to the imaging process, such as water attenuation coefficients, background light, and scene depth maps. Here, the depth map represents the distance between each pixel in the scene and the camera, involving spatial distribution and object proximity. However, physics-based methods typically rely heavily on the accuracy of the depth map, which is extremely challenging to obtain in practical underwater applications [15], Real-time Data Acquisition Systems (RTDAQS) in bathymetric Light Detection and Ranging (LiDAR) instruments rely on external bathymetric LiDAR [16]. Scene depth estimation is generally categorized into binocular and monocular depth estimation [17]. The accuracy of binocular depth estimation depends on the baseline distance, limiting its practical application. Monocular depth estimation offers flexible configuration and holds significant potential for practical applications.
Existing methods exhibit clear shortcomings when addressing the challenges of monocular depth estimation:
First, some depth estimation networks directly employ large networks designed for terrestrial scenes. For instance, the Monodepth2 [18] method uses a large ResNet [19] network, resulting in high computational complexity, making it difficult to deploy on resource-constrained AUVs. Furthermore, these networks lack targeted optimization for underwater features. The complex degradation characteristics of underwater images interfere with the extraction of depth-related features (like texture and edges), leading to inaccurate information extraction.
Second, most existing methods treat depth estimation and image restoration as independent tasks, neglecting the deep integration of high-accuracy estimated depth maps with interpretable physical models. For example, restoration methods like UWCNN [13] and FUnIE-GAN [14] only construct a restoration system mapping from degraded to clear images, lacking physical interpretability.
Third, as training data often come from specific water bodies, existing models suffer significant performance degradation when facing different water types and lighting conditions. For instance, methods like Manydepth [20] exhibit loss of depth details when processing images under artificial lighting, leading to inaccurate depth estimation. Methods like LapDepth [21] and UDepth [22] produce inconsistent depth value estimates for objects of different colors located at the same position.
Finally, due to the lack of sufficient geometric information in a single image, models can typically only predict relative depth rather than directly outputting absolute depth consistent with the real scene. This scale ambiguity limits the accuracy and reliability of depth estimation in practical applications. Methods like IBLA [23] and ULAP [24] convert the obtained relative depth to absolute depth using a conversion coefficient (typically 8 or 9) determined through extensive experiments. However, in complex dynamic environments, it is difficult to accurately determine this coefficient, making it challenging for such methods to achieve accurate results.
In summary, constructing a monocular depth estimation model that can adapt to complex underwater environments and is robust to degraded images from various scenes, and then using its output depth map to drive a physical model for high-fidelity image restoration, is a promising yet challenging technical route.
To address the aforementioned challenges, this paper proposes an underwater image restoration method that integrates monocular depth estimation with a physical imaging model. The core contributions include:
  • To achieve deep integration of depth estimation and image restoration tasks, this paper constructs a depth estimation framework serving physical restoration. Unlike typical methods that treat the two as independent stages, this framework directly embeds the depth map predicted by the depth estimation network (UMdepth) as a core physical parameter to drive the inverse solution of the Akkaynak-Treibitz physical model. This design imposes indirect constraints and optimization on depth prediction through the physical consistency of the restoration task, achieving an organic unity of data-driven perception and physics-based restoration.
  • To address the bottleneck of model computational complexity in restoring images under weak lighting and extreme color casts, we propose a monocular depth estimation model named UMdepth. This model employs an efficient lightweight encoder–decoder, significantly reducing computational costs while maintaining performance comparable to existing methods. Through a selective skip-connection mechanism, effective alignment of multi-scale features between the encoder and decoder is achieved.
  • To enhance the model’s ability to extract depth features and obtain more accurate depth maps, a Channel–Spatial Hybrid Attention Mechanism (CSHAM) is proposed. Different from previous methods using only a single type of attention, this module employs a serial channel and spatial attention mechanism, enabling the model to adaptively focus on regions and channel features critical for depth estimation in underwater scenes.
  • To overcome the inherent scale ambiguity problem in monocular depth estimation, sparse depth priors are introduced. Unlike early fusion methods that connect depth priors with the degraded input image, we adopt a late fusion strategy that concatenates depth priors with the feature maps output by the encoder–decoder. This avoids mutual interference between original image texture information and depth cues at the early stage.
  • To achieve refined depth regression, an adaptive binning mechanism based on an improved Vision Transformer (m-ViT) is proposed. Unlike typical methods that treat depth estimation as absolute regression or use fixed bins, this mechanism dynamically predicts the depth range and bin widths for each input image based on its content via the m-ViT module. This approach allows the model to flexibly adapt to different scenes and varying underwater shooting distances and depth ranges.
The remainder of this paper is organized as follows: Section 2 briefly introduces the underwater imaging model and reviews related work on depth estimation and image restoration. Section 3 details the specific implementation of the proposed method. Comprehensive experimental validation is conducted in Section 4. Finally, conclusions and future research directions are given in Section 5.

2. Related Work

This section systematically reviews the fundamental theories and technical methods related to the core issue of underwater image restoration. First, the underwater physical imaging model is introduced, providing the theoretical basis for the proposed method. Subsequently, related research on typical underwater monocular depth estimation and image restoration is analyzed separately, aiming to clarify the technical positioning and innovation space of this study.

2.1. Underwater Imaging Model

Establishing an accurate underwater imaging model is the theoretical foundation for understanding image degradation mechanisms and designing restoration algorithms [25]. This subsection focuses on the physical degradation mechanism model of light propagation in water, which provides theoretical support for the subsequent physical restoration method [26].
The Akkaynak-Treibitz underwater imaging model [27] can be expressed as:
I c = D c + B c = J c e β c D z z + B c 1 e β c B z z
where c ( R , G , B ) represents the RGB color channel, I c denotes the degraded image, D c represents the direct attenuation component from the scene, B c represents the backscatter component, controlled by different attenuation coefficients β c D and β c B , respectively, J c is the undegraded image, B c denotes the background light, and z represents the depth information between the scene and the camera in the image.

2.2. Underwater Monocular Depth Estimation and Image Restoration

This subsection systematically analyzes the research status and technical challenges of underwater monocular depth estimation and image restoration methods, aiming to reveal the limitations of existing methods in coping with complex underwater environments and provide a basis for subsequently proposing a solution that integrates depth estimation with physical models.
Existing methods for underwater image restoration are primarily based on the imaging characteristics under natural lighting conditions, making it difficult to adapt to complex underwater environments such as artificial illumination. In past research on underwater image restoration involving depth information, early methods performed approximate depth map estimation and image restoration through the Dark Channel Prior (DCP) [28], later giving rise to derivatives such as the Underwater Dark Channel Prior (UDCP) [29] and methods based on color channel differences. Additionally, there are typical methods such as IBLA [23] based on image blurriness and light absorption, ULAP [24] based on underwater light attenuation prior, methods based on the difference between bright and dark channels [30], and methods combining the red channel prior with the Maximum Intensity Prior (MIP) [31] to obtain depth maps. These methods show significant effectiveness under near-surface natural lighting conditions. However, they exhibit problems such as detail loss and poor color correction performance in complex environments dominated by artificial illumination, such as the deep sea.
With the development of deep learning technologies, some researchers have adopted a transfer learning approach, first training monocular depth estimation models on large-scale terrestrial image datasets [32] and then fine-tuning them on underwater image datasets. Under this research path, architectures based on self-supervised learning have shown significant potential. Among them, Monodepth2 [18] employs an encoder–decoder structure using a ResNet network for image feature extraction; Manydepth [20] achieves depth estimation by constructing reprojection and multi-view cost volumes. Another technical path focuses on structural innovation within the supervised learning framework. For example, AdaBins [33] improves depth estimation accuracy by transforming depth regression into a binning problem; LapDepth [21] enhances detail preservation capability by introducing a Laplacian pyramid on this basis. Furthermore, UDepth [22] enhances robustness to underwater degradation features by optimizing the feature extraction module of AdaBins; UWdepth [34] extends the frameworks of Monodepth2 and Manydepth by introducing a depth consistency loss to constrain the training process for more accurate depth estimation. Moreover, other innovative network architectures are under exploration, such as URSDEN [35] which designs positional attention and multi-dilated convolution depth-aware units; Scene-cGAN [36] builds a generator based on U-Net for depth estimation; Osmosis [37] utilizes the relatively new diffusion model. However, we find that they still have obvious limitations: most of these methods still fail to effectively solve the interference of severe underwater image degradation on depth feature extraction, ultimately resulting in limited generalization ability in complex real-world scenes.

3. Method

To address the challenges of restoring images under weak lighting and extreme degradation in complex underwater environments, this paper proposes a restoration method that deeply integrates high-precision depth estimation with an interpretable physical imaging model. An end-to-end underwater image restoration system is constructed to achieve deep integration of the depth estimation and image restoration tasks. Different from traditional methods, the restoration-oriented depth estimation optimization mechanism in this work ensures both the accuracy of depth estimation and the physical rationality of the restoration process. A lightweight encoder–decoder module is designed to significantly reduce computational costs, while a hybrid attention mechanism is introduced to enhance the model’s feature extraction capability. Furthermore, depth priors are incorporated to overcome scale ambiguity. This section details the overall design philosophy of the proposed method, its differences from previous approaches, and the specifics of its core modules.

3.1. Model Architecture and Overall Framework

This subsection elaborates on the proposed model architecture and the overall framework, including its working principles and specific workflow.
(1)
Model architecture
As shown in Figure 1, the proposed method consists of a feature extraction module based on an encoder–decoder with an attention mechanism, a depth estimation optimization module based on m-Vision Transformer and convolutional regression, and a physical restoration module.
(2)
General idea of the proposed method
The starting point of this method is to establish a deep integration between data-driven perception and physical model constraints. The overall framework is as follows: First, through an improved encoder–decoder architecture and a channel–spatial hybrid attention mechanism, the expression of key features insensitive to degradation is adaptively enhanced. Then, an m-Vision Transformer module combining sparse depth priors and adaptive binning is designed to alleviate scale ambiguity and improve cross-domain generalization. Finally, the predicted high-quality depth map is embedded as a reliable parameter into the Akkaynak-Treibitz imaging model to achieve physics-driven, high-fidelity image restoration. Different from AdaBins and UDepth: First, we use a more lightweight encoder–decoder module, significantly reducing the parameter count. While they employ basic symmetric skip connections, we achieve effective multi-scale feature alignment through selective connections. Second, they lack targeted attention mechanisms and depth priors, whereas our designed Channel–Spatial Hybrid Attention Module (CSHAM) optimizes features from dual dimensions. Third, their binning methods use fixed ranges, while our adaptive range prediction dynamically adjusts based on image content.

3.2. Encoder–Decoder Module

The encoder–decoder module performs efficient feature extraction from underwater images, enhancing the extraction of key underwater features while maintaining model efficiency, and focuses on solving the problems of high computational complexity and insufficient extraction of underwater degradation features.
The encoder gradually transforms the input image into a low-dimensional feature representation. We replace the complex encoder module in AdaBins with a more lightweight and efficient encoder module based on MobileNetV3-small [38]. This model offers superior performance and is more suitable for scenarios with limited storage space and power consumption, outputting high-level semantic features with a spatial size of 20 × 15 and 576 channels after downsampling.
In the decoder design, a decoding path consisting of 6 consecutive upsampling modules is constructed, which receives the feature representation from the encoder. Each module contains a 3 × 3 transposed convolution, a ReLU activation function, and a CSHAM attention module. Through a carefully designed skip connection mechanism, the feature maps are channel-wise concatenated with the corresponding scale feature maps from the encoder (where each decoder layer connects to the 10th, 8th, 5th, 3rd, 2nd, and 1st layers of the encoder, respectively), forming a multi-level feature pyramid structure. This ensures the effective fusion of spatial details and feature information to extract richer underwater image degradation features. Compared to previous symmetric structures, this selective connection helps utilize features from different levels in the encoder more effectively, avoiding potential information redundancy or interference from simple connections.
Unlike methods such as UDepth that employ large networks like ResNet as the encoder, this paper uses MobileNetV3-small optimized for mobile deployment, significantly reducing the number of encoder parameters while maintaining feature extraction capability. Moreover, compared to the simple skip connections in U-Net, the carefully designed feature pyramid connection strategy achieves effective alignment of multi-scale features between the encoder and decoder.

3.3. Channel–Spatial Hybrid Attention

This module (CSHAM) adaptively calibrates feature responses through a dual attention mechanism, enhancing the model’s ability to extract key underwater features.
As shown in Figure 2, CSHAM integrates channel attention and spatial attention, improving the model’s feature extraction capability by separately modeling attention in the channel and spatial dimensions. The CSHAM module retains the serial architecture of CBAM [39] while making significant improvements to its sub-modules. To address the issue of increased parameter count caused by using a shared MLP in the channel attention module, we replace its shared MLP structure with two convolutional operations to reduce computational complexity. Specifically, after performing global average pooling and global max pooling on the input features separately, each is processed through a sub-network containing two 1 × 1 convolutional layers (with ReLU activation in between). The outputs from the two branches are then summed and passed through a Sigmoid function to generate the channel weights. Its structure is shown in Figure 3.
For the spatial attention module, we adjust the convolutional kernel size from 7 × 7 to 5 × 5. This adjustment is based on the following considerations: First, in deeper networks where feature map resolution has been significantly reduced through multiple downsampling steps, a 5 × 5 convolution can already cover a sufficient contextual area to capture spatial dependencies. Second, reducing the kernel size from 7 × 7 to 5 × 5 in this scenario results in a limited reduction in the receptive field but a significant decrease in computational complexity. Finally, a smaller kernel helps alleviate overfitting and enhances the module’s generalization capability. This sub-module also concatenates the results of average pooling and max pooling along the channel dimension, then generates spatial weights through 5 × 5 convolution and Sigmoid activation. Its structure is shown in Figure 4.
Unlike attention mechanisms that employ only a single dimension, such as URSDEN [35], CSHAM’s dual-path attention mechanism can optimize features simultaneously from both channel and spatial dimensions. Furthermore, we embed CSHAM into each upsampling layer of the decoder, differing from methods that use it only in the deepest layers, achieving multi-level, full-process attention guidance and enhancing the perception capability for underwater features at different scales.

3.4. Depth Prior Parameterization

To overcome the inherent scale ambiguity in monocular depth estimation, we extract sparse depth priors to provide geometric consistency constraints for the network.
The specific implementation process is as follows: First, SIFT keypoint extraction [40] is performed independently within each image block of the same size to ensure uniform distribution of feature points across the image. Considering that underwater low-texture or turbid scenes may lead to an insufficient number of SIFT feature points, we introduce a feature point quantity threshold judgment and random supplementation mechanism. A target number of at least 200 feature points is set. When the number of actual SIFT-extracted feature points is less than 200, a sufficient number of pixel positions are randomly selected from the image’s depth map, and their depth values are used as supplemented feature points. These randomly sampled points are uniformly distributed across the entire image plane and merged with the original SIFT feature points to form a complete set of 200 feature points. This design ensures that the depth prior information is always input to the network in a fixed-dimensional and spatially uniform form under any circumstances. However, the number of feature points varies from image to image, so they need to be transformed into a fixed and appropriate parameterized form for network input. Referring to the method in [41], we convert them into two mappings, M 1 ( x , y ) and M 2 ( x , y ) , matching the model resolution (320 × 240). M 1 ( x , y ) is a full-size dense depth map obtained by nearest-neighbor interpolation of the sparse depth feature points. For each pixel ( x , y ) , its depth value d p is taken from the nearest feature point p ( x i , y j ) . The calculation formula for M 1 is:
M 1 ( x , y ) = d p , p = arg min ( i , j ) ( x , y ) ( x i , y j ) 2
M 2 ( x , y ) is a continuous probability map, representing the conversion of the distance from each pixel to its nearest feature point into a probability value, explaining the distance from each pixel to the closest feature point. Since distance is inversely proportional to probability, it can be assumed that this probability follows a Gaussian distribution. The calculation formula for M 2 ( x , y ) is:
M 2 ( x , y ) = 1 δ 2 π exp min ( x , y ) ( x i , y j ) 2 2 2 δ 2
Unlike early fusion methods that concatenate the sparse depth map and the RGB image as network input, we concatenate the depth prior with the feature maps output by the encoder–decoder as input to m-ViT. This late fusion strategy avoids mutual interference between original image texture information and depth cues at the early stage.

3.5. m-Vision Transformer

m-ViT aims to utilize global contextual information to perform depth optimization on the multi-resolution features provided by the encoder–decoder. It transforms depth estimation from a regression problem of absolute values into a structured probability distribution learning problem through an adaptive depth binning mechanism.
The structure of m-ViT is shown in Figure 5. AdaBins lightweighted the ViT [42] for image recognition, and we have modified it further. First, the feature maps from the encoder–decoder and the depth prior parameterization information are processed through a convolutional block to form a sequence of patch embeddings, which serve as input to the transformer encoder. The output of the encoder passes through a 1 × 1 convolutional kernel and performs a dot product operation with the feature map processed by a 3 × 3 convolution, generating a range attention map. One output of the encoder is fed into a multi-layer perceptron (MLP) head, which outputs n bin widths and the estimated depth range r for the current image, rather than having a fixed depth range determined by dataset specifications or manually set to a reasonable range as in AdaBins. To ensure all bin widths are positive, we add a very small positive number τ to all bin widths. The bin widths are then normalized and multiplied by the estimated depth range to obtain the final bin width b i , allowing the bin division to dynamically adjust according to the actual depth range of each image:
b i = r b ˜ i + τ j = 1 n b ˜ j + τ
where b ˜ is the bin width before correction, i = 1 , 2 , , n , and the value of τ is set to 10−3.
Compared to AdaBins, predicting the unique depth range r for each image enables the model to adaptively adjust the depth bin division based on the input image. This improvement not only retains the advantage of global context modeling but also enhances the model’s robustness and prediction accuracy when facing inputs with distributions different from the training data, especially in complex real-world scenarios with unknown depth scales.

3.6. Convolutional Regression

The core objective of this module is to transform the high-level features output by m-ViT into a high-precision dense depth map. The proposed method abandons the traditional paradigm of directly regressing depth values, adopting a more effective “discrete-to-continuous” strategy. Depth prediction is framed as a probability distribution estimation problem based on adaptive depth bins.
The convolutional regression module in this paper employs a lightweight architecture to achieve refined depth value regression. First, a simple 1 × 1 convolutional layer is used to transform the channel dimension of the attention map, converting its channel count to the preset number of depth bins n , generating an n-dimensional score vector for each pixel, where each score corresponds to the confidence of belonging to a depth bin. This score vector is then normalized via the Softmax activation function, outputting a probability distribution for each pixel:
P ( x , y ) = ( p 1 , p 2 , , p n )
where p i ( x , y ) represents the probability that the depth value of the pixel falls into the i-th depth bin. Finally, the final depth value is obtained by calculating the weighted sum of the center value c i of each depth bin and the corresponding probability of the pixel:
d ^ ( x , y ) = i = 1 n p i ( x , y ) c i
Compared to methods that directly regress depth values, the advantage of this convolutional regression mechanism lies in transforming the continuous regression problem into a prediction task with structural constraints. The network not only needs to determine which depth range a pixel most likely belongs to but also needs to achieve fine-grained depth value determination through probability weighting. This design encourages the network to learn more discriminative features.

3.7. Loss Functions

During training, we adopt a strategy combining multiple loss functions. Specifically, we incorporate the Root Mean Square Error ( L R M S E ), BerHu loss ( L B e r H u ), Scale-Invariant Logarithmic loss ( L S I L o g ), and Chamfer distance loss ( L C D ), imposing constraints on the predicted depth map from multiple perspectives such as pixel-level accuracy, scale invariance, and spatial consistency.
RMSE serves as the fundamental regression loss, penalizing large errors more heavily through the squared term, ensuring the global numerical accuracy of depth prediction. Its calculation formula is:
L R M S E = 1 n i = 1 n ( d ^ i d i ) 2
where d ^ i and d i represent the predicted depth and ground truth depth of the i-th pixel, respectively, and n is the number of valid pixels.
The BerHu loss maintains the stability of the L1 loss when the error is small and inherits the strong gradient property of the L2 loss when the error is large. Its formula is as follows:
L B e r H u = 1 n i = 1 n d ^ i d i , i f d ^ i d i c ( d ^ i d i ) 2 + c 2 2 c , o t h e r w i s e
where the threshold is set to c = 0.2 × max i d ^ i d i .
The Scale-Invariant Logarithmic loss addresses the scale ambiguity issue in depth estimation by transforming the optimization target from absolute depth values to the relative scale relationship of depths through logarithmic transformation:
L S I L o g = 1 n i = 1 n d ˜ i 2 λ n 2 i = 1 n d ˜ i 2
where d ˜ i = log d i log d ^ i , and λ is set to 0.85.
The Chamfer distance loss is considered a regularization term, used to measure the matching degree between the distribution of predicted depth bin centers and the true depth values. Its calculation formula is:
L C D = b B min d D b d 2 2 + d D min b B b d 2 2
where B represents the set of predicted depth bin centers, and D represents the set of true depth values.
Finally, the total loss is defined as a linear combination of the above losses:
L = ( 1 α ) ( L R M S E + L B e r H u + L C D ) / 3 + α L S I L o g
To determine the value of the balancing coefficient α , we conducted a sensitivity analysis by comparing model performance with different α values. It was ultimately determined that setting α = 0.5 achieves the best balance on the evaluation metrics. When α is too small, the contribution of the scale-invariant logarithmic loss is insufficient, leading to significant scale drift during cross-scene generalization; when α is too large, the influence of the pixel-level accuracy losses is weakened, resulting in decreased depth estimation accuracy for local details.

3.8. Image Restoration

After obtaining the accurate depth map, which serves as a key parameter for the physical imaging model, we utilize the depth map to drive the physical imaging model for underwater image restoration.
The Akkaynak-Treibitz physical imaging model is shown in Equation (1) earlier. Following the research of Akkaynak and Treibitz, we acquire its key parameters. Based on this, the depth information becomes the only unknown variable for restoring the clear image, thus transforming the complex underwater image restoration task into an accurate depth estimation task. Akkaynak and Treibitz used a 3D reconstruction (SFM) method to obtain the scene depth map from multiple images. In contrast, through the depth estimation network described in this paper, we can obtain a more accurate scene depth map, thereby enabling efficient and reliable underwater image restoration based on the physical model.

4. Experiments

To comprehensively validate the effectiveness of the proposed model in depth estimation and underwater image restoration, this section designs a detailed experimental plan and conducts benchmark tests against current mainstream methods. The plan includes an overview of datasets, model implementation details, presentation of experimental results, subjective and objective evaluations, and a series of ablation experiments to verify the effectiveness of each module.

4.1. Datasets and Implementation Details

This subsection details the datasets used to validate the effectiveness of the proposed method, as well as specific implementation details, including the experimental environment and model parameters.
(1)
Datasets
We utilize the FLSea dataset [43] for model training. This dataset provides approximately 20,000 degraded images from 12 different underwater locations across two distinct sea areas, along with corresponding depth maps and clear reference images. We use 15,000 of these images as the training set, and the remaining data not used for training serves as the test set.
The Seathru [27] dataset is divided into 5 scenes based on different underwater imaging characteristics. It contains 1100 underwater images with corresponding depth maps, primarily consisting of low-light, dim images with a greenish hue.
The UIEB [44] dataset contains 890 original underwater images and their high-quality reference images. UIEB-S is a subset of 200 images we curated from UIEB, featuring extremely severe color casts, posing a high demand on the model’s color correction capability.
HURLA is an open-source deep-sea image dataset containing a large number of images of marine relics and underwater species captured under artificial lighting. We select 200 images for evaluation based on different scenes and object types.
These datasets encompass all challenges in underwater image restoration, including various scenarios with different color casts, lighting conditions, objects, and water bodies. We use the FLSea dataset to validate the model’s performance upper bound on data of similar distribution. The FLSea dataset is randomly split into a training set (15,000 images) and a test set (the remaining images). Seathru, UIEB, UIEB-S, and HURLA serve as completely independent cross-domain test sets to evaluate the model’s generalization ability, having not participated in any training or validation process. Given the large dataset size and the primary focus on cross-domain evaluation, we adopt an 80/20 split within the training set for validation set hyperparameter tuning. The final reported results are based on the model trained on the complete training set and evaluated on the independent test sets. The objective metrics for all compared methods are computed on the exact same data splits to ensure a fair comparison.
(2)
Implementation details
Our network model is implemented using the PyTorch framework (version 2.4.1) and trained and tested on an Intel Core i5-13600KF CPU, 32 GB RAM, and an NVIDIA GeForce RTX 4070 GPU. Model parameters are initialized using Kaiming initialization, with a total parameter count of approximately 11.9 M, classifying it as a lightweight model. All images undergo standardized preprocessing and are resized to a resolution of 640 × 480. No data augmentation is used during training. We employ the AdamW optimizer with parameters set to betas = (0.9, 0.999) and weight_decay = 1 × 10−4, and set the initial learning rate to 1 × 10−4. The model is optimized with a batch size of 6 for a total of 50 epochs. After each epoch, PSNR is computed on the validation set, and the best model is saved (early stopping is not used). The random seed is fixed to 42 to ensure reproducibility. During testing, image preprocessing is consistent with training, and the output images are directly used as the final results.

4.2. Depth Estimation Comparative Experiments

This subsection provides a comprehensive evaluation of the effectiveness of the proposed depth estimation method.
The depth estimation evaluation experiments are conducted on the FLSea and Seathru datasets. These two datasets provide reliable reference depth maps, enabling the calculation of objective metrics. They also encompass multiple challenges such as different lighting conditions, depth scenes, and water body environments. We compare our method with several currently recognized strong models: Monodepth2 [18], Manydepth [20], LapDepth [21], and UDepth [22]. The experimental results are shown in Figure 6.
Samples A–C in Figure 6 are from the FLSea dataset, and samples D–F are from the Seathru dataset. These images cover various typical underwater visual environments, such as different lighting, different color casts, and different distances. Sample A represents a seabed environment under natural light near the surface. Samples B and D can verify depth estimation consistency for differently colored objects at the same position under bright and low-light conditions. Sample C contains environments at various distances with noticeable haze. Samples E and F show spherical coral reefs from different angles under low-light conditions.
(1)
Subjective evaluation
To comprehensively assess the effectiveness of the proposed depth estimation model, we conduct a subjective evaluation by comparing the visual effects of our method with other mainstream methods from the perspectives of lighting conditions, target distance, and color consistency.
From the perspective of lighting conditions, Monodepth2 exhibits a layered structure in continuous depth regions under low light (e.g., Figure 6D–F). Manydepth shows large-area blurring in highlight regions (e.g., Figure 6A,C) and loses depth details under low light, making it difficult to distinguish the distance levels of different objects. In contrast, our method maintains clear object contours and reasonable depth level distributions under different lighting conditions.
From the perspective of different target distances, in multi-plane scenes (e.g., Figure 6C) and curved surface scenes (e.g., Figure 6E,F), LapDepth incorrectly estimates foreground objects with colors similar to the background as background, leading to confusion in depth levels. UDepth suffers from depth value overestimation, with depth values in most areas being larger than the ground truth. In contrast, our method provides results closer to the ground truth depth map both in multi-plane scenes and from different directions of the spherical coral reefs, accurately reflecting the hierarchical structure and surface geometry of the targets.
From the perspective of depth consistency for differently colored objects at the same distance, as shown in Figure 6B,D, for white lines and black-and-white squares located at the same position as the seabed, other methods estimate significantly different depth values for these areas compared to their surrounding areas at the same depth, which is inconsistent with reality. Our method exhibits better depth consistency in these areas, assigning similar depth estimates to differently colored objects at the same distance, aligning more closely with the true 3D structure of the scene.
In summary, regarding subjective evaluation, compared to mainstream methods, our method performs better in terms of lighting conditions, target distance, and color consistency. The overall visual effect is significantly superior to the compared algorithms, fully demonstrating the comprehensive performance of our method.
(2)
Objective evaluation
For objective evaluation, we conduct a comprehensive comparison using the Absolute Relative Error (AbsRel), Squared Relative Error (SqRel), Root Mean Squared Error (RMSE), and δ-threshold accuracy metrics [45,46], providing more convincing data support for our proposed method. Lower values of AbsRel, SqRel, and RMSE indicate smaller errors between the model-predicted depth values and the ground truth, signifying better model performance. The accuracy metrics represent the proportion of pixels whose ratio of predicted depth to ground truth depth falls within a certain threshold. Higher values indicate more accurate prediction results. In this paper, the thresholds for δ1, δ2, and δ3 are set to 1.5, 1.52, and 1.53, respectively.
Table 1 presents the quantitative comparison results of different models on the FLSea dataset. We use bold font to indicate the best results and underline to indicate the second-best results.
From the comparison results on the FLSea dataset, our model achieves the best values on all error metrics. Specifically, AbsRel (0.747) is about 6.4% lower than the second-best Monodepth2 (0.798), SqRel (0.272) is about 19.8% lower than the second-best Monodepth2 (0.339), and RMSE (0.228) is about 7.7% lower than the second-best Monodepth2 (0.247). This indicates that our model’s predicted depth is closer to the ground truth with smaller errors. Among other models, Monodepth2 performs relatively well on error metrics, while Manydepth and UDepth have larger errors, especially UDepth with an AbsRel as high as 1.355. Our model also achieves the highest scores on all threshold accuracy metrics. Specifically, δ1 (0.543) is about 13.6% higher than the second-best LapDepth (0.478), δ2 (0.756) is slightly higher than the second-best LapDepth (0.749), and δ3 (0.884) is slightly higher than the second-best LapDepth (0.875). This indicates that our model performs better in terms of the accuracy and consistency of depth map prediction.
Table 2 presents the quantitative comparison results of different models on the Seathru dataset.
Our model still achieves the best values on all error metrics on the Seathru dataset. Specifically, AbsRel (0.784) is about 19.9% lower than the second-best Monodepth2 (0.979), SqRel (0.277) is about 29.9% lower than the second-best Monodepth2 (0.395), and RMSE (0.183) is about 38.6% lower than the second-best LapDepth (0.298). This indicates that our model achieves more significant error reduction on the Seathru dataset, especially on RMSE, where the deviation between predicted depth and ground truth is greatly reduced. Monodepth2 performs relatively well on AbsRel and SqRel but has a higher RMSE; LapDepth performs relatively well on RMSE but has higher AbsRel and SqRel. Our model also achieves the highest scores on all accuracy metrics. Specifically, δ1 (0.602) is about 23.9% higher than the second-best LapDepth (0.486), δ2 (0.829) is about 16.6% higher than the second-best LapDepth (0.711), and δ3 (0.918) is about 9.2% higher than the second-best LapDepth (0.841). This indicates a substantial improvement in the accuracy of depth map prediction by our model, bringing it closer to the true depth values. Although LapDepth performs relatively well on accuracy metrics, it is still far below our model.
Our depth estimation model consistently outperforms other comparative models across various metrics on both datasets. This demonstrates the adaptability of our method to complex underwater environments, effectively mitigating the problem of underwater image degradation characteristics interfering with depth cue extraction. The stable performance of our method across two significantly different datasets, along with the notable improvements in both error metrics and accuracy metrics, indicates good adaptability to different water quality conditions, lighting environments, and shooting distances. This overcomes the deficiency of traditional methods in generalizing across different water bodies. Particularly noteworthy is the excellent performance of error metrics in low-light scenes, proving that our method effectively enhances the extraction of depth features under low-light conditions.

4.3. Image Restoration Comparative Experiments

This subsection provides a comprehensive evaluation of the image restoration results, validating the adaptability of the proposed method to different degradation types and underwater environments.
The image restoration evaluation experiments are conducted on several challenging underwater image datasets. The proposed method is compared with several representative methods, including a model-free method (CLAHE [11]), physics-based model methods (IBLA [23] and ULAP [24]), and deep learning-based methods (UWCNN [13], FUnIE-GAN [14] and U-Transformer [9]). Experimental results with clear reference images are shown in Figure 7, and results without reference images are shown in Figure 8.
Samples A,B in Figure 7 are from the FLSea dataset, C,D are from the UIEB dataset, and E–G are from the UIEB-S dataset. Samples A–C in Figure 8 are from the Seathru dataset, and D–F are from the HURLA dataset. These samples involve different lighting conditions, seabed planes, close-ups of marine life, coral reef structures, and cases of severe color cast. Furthermore, we reused the depth estimation samples from the previous step (Figure 7A,B and Figure 8B,C) to demonstrate the effectiveness of the continuous task of depth estimation to image restoration.
(1)
Subjective Evaluation
To comprehensively evaluate the performance of the overall image restoration model, this paper systematically analyzes the restoration effects from three aspects: brightness compensation, color correction, and detail recovery.
In terms of brightness compensation, for low-light images (Figure 8A–C), the restoration effects of the comparison methods are poor and can even lead to more severe degradation and distortion. For instance, ULAP darkens the dim areas in the image, UWCNN exhibits abnormal color distortion, and while CLAHE can effectively increase brightness, it does not alleviate the greenish cast caused by the environment. U-Transformer can effectively enhance brightness, but its restoration results introduce unnatural yellow tones that do not exist in the original image. For example, the green coral reefs in the figure turn yellow after restoration. The CLAHE method improves image brightness to some extent (Figure 8D–F), but excessive enhancement often leads to an overall overly bright and oversaturated image. In contrast, our method provides brightness compensation for both targets and the surrounding environment, presenting clearer image details, and achieves color cast correction while enhancing image brightness.
In terms of color correction, existing typical methods have limited ability to handle color distortion. For images with severe color cast (Figure 7E–G) and greenish cast under low light (Figure 8A–C), U-Transformer can alleviate color cast to some extent, but its correction effect is limited for severe color cast (as shown in Figure 7F). The restoration results of other methods still exhibit a strong greenish cast, especially UWCNN, which also shows abnormal color distortion. Our method effectively eliminates the severe color cast in the images. The restoration effects of other methods on yellowish-tone images (Figure 8D–F) are also limited. In particular, the results from the IBLA method show abnormal purple hues, and the results from ULAP appear reddish. In contrast, our method can effectively address severe color cast problems, maintaining stable color restoration effects.
In terms of detail recovery, the differences among the methods are particularly significant in the textures of close-range objects and the fine structures of organisms. As shown in Figure 7A,B, in the restoration of seabed rock areas, methods like ULAP and FUnIE-GAN blur the geometric details of the rock surfaces, weakening their original texture features and edge definition. Our method more clearly restores the uneven textures and crack details of the rock blocks, with higher edge sharpness. Our method also shows advantages in the restoration of marine organisms: as shown in Figure 7C and Figure 8F, the textures of marine fish and the surrounding environmental details of coral and sand grains are more clearly recovered; in Figure 7D and Figure 8D, the complex tentacle structures of anemone-like organisms are also more significantly restored. In contrast, the restoration results of other methods in these areas appear smoother or with blurred contours.
Overall, our method outperforms existing representative algorithms in terms of color correction, brightness compensation, detail preservation, and robustness in complex scenes. Its restoration results are visually more natural and realistic, closest to the reference ground truth (GT) images, demonstrating the effectiveness of the joint optimization of the depth estimation model and the physical imaging model.
(2)
Objective Evaluation
To more comprehensively evaluate the effectiveness of the proposed method, we conducted full-reference and no-reference evaluations for different methods. Table 3 presents the evaluation results of the full-reference metrics Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) [47], used to measure the pixel-level or structural similarity between the restored image and the ground-truth reference image. Table 4 presents the evaluation results of the no-reference metrics UIQM and UCIQE [48], used to quantify the visual perceptual quality of the restored image in terms of color, clarity, and contrast. Similarly, we use bold font in the tables to indicate the best result for each metric and underline to indicate the second-best result.
Analyzing Table 3, our method achieves the highest scores on the SSIM metric, proving that our method has a significant advantage in restoring the structural information of images, with its output being structurally closest to the clear reference image. Especially on the most challenging severe color cast dataset UIEB-S, it reaches 0.8954. Our method significantly outperforms other methods, showing a 2.8% improvement over the second-best CLAHE (0.8712), a 27.2% improvement over the worst IBLA, a 15.4% improvement over ULAP, a 21.0% improvement over UWCNN, a 16.7% improvement over FUnIE-GAN, and a 16.7% improvement over U-Transformer, indicating its outstanding capability for restoring images with extreme color cast.
On the PSNR metric, our method performs best on both the UIEB and UIEB-S datasets, with a particularly large improvement on UIEB-S. Our method’s PSNR (23.35) is nearly 3 dB higher than the second-best FUnIE-GAN (20.50). It shows improvements of 4.29 dB over CLAHE, 4.87 dB over IBLA, 3.25 dB over ULAP, 2.86 dB over U-Transformer, and the largest improvement of 5.61 dB over UWCNN. This significant improvement directly proves the excellence of our method in suppressing noise and artifacts introduced during the restoration process and reducing pixel-level errors. Although on the FLSea dataset, our method (24.09) is slightly lower than IBLA (25.24), it is still much higher than other comparison methods, showing significant improvements of 7.31 dB and 6.31 dB over ULAP and UWCNN, respectively. This demonstrates a better balance between pixel accuracy and structural fidelity, showcasing its stable performance. Our method has the largest leading advantage on UIEB-S, a result that strongly proves the exceptional generalization ability and robustness of our method in handling the most challenging underwater image degradation problems, especially severe color distortion.
Analyzing Table 4, the no-reference evaluation results indicate that our method performs excellently on the UCIQE metric. Especially on the most challenging dataset UIEB-S, it shows a 3.8% improvement over the second-best ULAP and a significant 16.9% improvement over UWCNN. On FLSea, it also shows overall improvement, with a particularly large 9.3% improvement over FUnIE-GAN. On UIEB, it is only 0.3% lower than ULAP but shows substantial improvements compared to other methods. UCIQE mainly evaluates the colorfulness, saturation, and contrast of the image, indicating that our method can very effectively restore the color information lost in underwater images and produce results that are visually rich in color with appropriate contrast.
However, our method is not the best on the UIQM metric, generally performing at a medium-to-high level. The reason for this is that UIQM is very sensitive to changes in color saturation. Traditional enhancement methods like CLAHE and generative models like FUnIE-GAN often significantly boost the global contrast and saturation of images, leading to higher UIQM scores, but sometimes introducing unnatural colors or over-enhancement (as seen in the results of Figure 7 and Figure 8). In contrast, our method focuses more on accurate color correction and detail recovery rather than blindly increasing saturation. This may result in relatively lower UIQM scores but brings higher fidelity (SSIM/PSNR) and more natural colors (UCIQE).
In recent years, significant progress has been made in the field of underwater image quality assessment (IQA). For instance, PUIQA [49] incorporates physics-informed guidance and multi-scale perception, explicitly considering physical priors such as non-uniform illumination and backscatter gradient. Our method performs restoration through the depth-driven Akkaynak-Treibitz physical model, and its output images possess inherent advantages in terms of physical consistency. It is anticipated that these images will achieve high scores in the physics-informed dimensions of metrics like PUIQA. New metrics such as PUIQA inherit the focus of traditional metrics on structural information. The excellent performance of our method in SSIM (0.8954 on UIEB-S) has already demonstrated its structural preservation capability, and this advantage is expected to extend to the evaluation of these new metrics.
In summary, the core advantage of our method lies in its balance and accuracy. We not only perform excellently on the no-reference metric UCIQE but, more importantly, achieve comprehensive leadership on the fidelity metrics that measure restoration accuracy. This means our method can produce visually pleasing results while ensuring that the restored image is closer to the real, clear scene in terms of structure and pixel level.

4.4. Computational Efficiency Analysis

Underwater image restoration techniques are often deployed on platforms with limited computational resources (e.g., AUVs), making real-time performance and lightweight design crucial. This section evaluates the computational efficiency of different methods, focusing on their feasibility in practical deployment scenarios.
To better simulate the practical impact of deploying restoration technology on devices such as AUVs, we adopted a lower-performance NVIDIA GeForce GTX 1050Ti GPU as the experimental platform to emulate resource-constrained scenarios. The evaluation metrics include the number of parameters (Params), floating-point operations (FLOPs), and frames per second (FPS). All methods were tested under the same hardware environment with a unified input image resolution of 640 × 480. It is important to note that some comparison methods (e.g., FUnIE-GAN and U-Transformer) in their original implementations resize input images to 256 × 256, which significantly reduces their computational overhead. Therefore, the FLOPs and FPS reported in the table may underestimate their actual consumption at the original resolution. Table 5 lists the computational efficiency comparison of various methods.
As shown in Table 5, traditional methods (CLAHE, UDCP) apply fixed rules or mathematical formulas directly and require no learned parameters, resulting in higher FPS. Among deep learning methods, the proposed UMdepth model has moderate parameters (11.89 M) and FLOPs (17.87 G) but achieves 92 FPS, significantly higher than UWCNN and U-Transformer and comparable to CycleGAN. Notably, although FUnIE-GAN achieves a high FPS of 293, its actual processing resolution is 256 × 256. If it were to directly process 640 × 480 images, its computational cost would increase substantially, and FPS would drop significantly. U-Transformer has a high parameter count of 65.60 M and FLOPs of 66.20 G, with only 11 FPS, making it difficult to meet real-time requirements. Overall, while maintaining excellent restoration performance, our method achieves a real-time processing speed of 92 FPS with a relatively low parameter count, striking a good balance among parameters, FLOPs, and FPS, making it well-suited for deployment on resource-constrained underwater platforms [50].

4.5. Ablation Study

This subsection systematically validates the specific contributions of the key modules proposed in this paper to the depth estimation performance through ablation experiments. Since an accurate depth map is a prerequisite for image restoration based on a physical model, the performance of depth estimation directly affects the final restoration result. Therefore, this section focuses on ablation analysis of the depth estimation network.
We conducted systematic ablation experiments with various configurations on the Seathru dataset. These configurations included a Baseline based on AdaBins and UDepth, one with the encoder replaced by MobileNetV3-small (V3), one with attention module (CBAM), one with the Channel–Spatial Hybrid Attention Module (CSHAM) added, and the complete model (Ours). This allowed for a quantitative assessment of each module’s impact on depth estimation accuracy and model efficiency. All configurations were tested under identical training settings, with the results presented in Table 6.
Analyzing Table 6, the experimental data reveals several key points. Firstly, compared to the Baseline, replacing the encoder with MobileNetV3-small significantly reduces the model’s parameter count (from 15.6 M to 11.8 M, a decrease of approximately 24.4%) while improving all depth estimation metrics. For example, AbsRel decreases from 0.898 to 0.852, SqRel from 0.345 to 0.315, and RMSE from 0.211 to 0.201. This lightweight design does not compromise performance; instead, it enhances accuracy through more efficient feature extraction. Secondly, introducing the CSHAM module further reduces AbsRel, SqRel, and RMSE to 0.804, 0.289, and 0.198, respectively, with only a minimal increase in parameters. Compared to introducing CBAM, our designed CSHAM achieves better evaluation metrics with fewer parameters, indicating that CSHAM effectively guides the model to focus on features more critical for depth estimation in underwater scenes, thereby significantly improving depth estimation accuracy. Finally, our complete model achieves the best performance on all error and accuracy metrics. It reduces AbsRel by 12.7%, SqRel by 19.7%, and RMSE by 13.3% compared to the Baseline, while maintaining the parameter count at 11.9 M, achieving an optimal balance between accuracy and efficiency.
In summary, the ablation experiments confirm the effectiveness of each proposed module in enhancing underwater monocular depth estimation performance. The effective combination of these modules enables the complete model to achieve a superior balance between parameter count and accuracy, laying a solid foundation for subsequent high-quality image restoration.

4.6. Experimental Summary

Through systematic experimental design and analysis, the effectiveness and advancement of the proposed method have been comprehensively validated across three dimensions: depth estimation, image restoration, and ablation studies. In terms of depth estimation, both subjective visual comparisons and objective metric evaluations demonstrate that our method produces more accurate and consistent depth maps across various underwater environments, significantly outperforming existing mainstream methods. Regarding image restoration, our method exhibits outstanding performance from both subjective visual perspectives (such as brightness, detail, and color) and objective metrics (such as PSNR, SSIM, and UCIQE), with particularly notable advantages in scenes with extreme color cast. The ablation study further confirms the necessity and contribution of each innovative module. Overall, the experimental results demonstrate that the technical approach of deeply integrating monocular depth estimation with a physical imaging model can effectively address the degradation problems caused by complex underwater environments, providing reliable technical support for enhancing the performance of underwater vision tasks.
However, our work also has certain limitations. During experiments, we observed that restoration performance is sometimes suboptimal for background regions with large depth values, such as distant water bodies or dim far-range targets. Our analysis suggests two main factors contributing to this: First, in these regions, light signals undergo extreme attenuation and scattering, resulting in severe loss of image information and providing very weak or even contradictory visual cues for depth estimation. This leads to inaccuracies in depth estimation for distant areas. Second, this fundamental problem is amplified by inherent issues in our training datasets: the depth values in reference depth maps for these distant or information-lost regions are often marked as “NaN,” indicating unavailable or unreliable ground truth depth. This data absence prevents our model from effectively learning how to correctly regress depth in these regions during training, leading to depth estimation biases during testing, which further propagate through the physical imaging model and result in restoration failures in those areas.

5. Conclusions

This paper addresses the core challenges faced by existing methods in underwater image restoration tasks, namely the difficulty in obtaining depth information in complex environments and insufficient generalization capability. It proposes an end-to-end restoration framework that integrates monocular depth estimation with a physical imaging model. The starting point of this method is to establish a synergistic optimization mechanism between data-driven perception and physical model constraints. In the depth estimation stage, we designed an efficient encoder–decoder architecture and integrated an improved channel–spatial hybrid attention module into the decoder path, enabling the network to adaptively focus on key features less affected by degradation. To overcome the inherent scale ambiguity problem in monocular depth estimation, we introduced a depth prior parameterization strategy based on sparse feature points and designed an m-ViT module to achieve adaptive depth binning. This allows the model to dynamically adjust the depth range based on the input image content, thereby enhancing the model’s cross-domain generalization capability. In the image restoration stage, we embed the predicted high-quality depth map as a key physical parameter into the Akkaynak-Treibitz imaging model for inverse solution, achieving high-quality restoration from degraded images to clear images.
Extensive experiments on multiple datasets demonstrate that the proposed method exhibits excellent performance across various environments and image types. In the depth estimation task, the key metrics of our method achieve optimal values while maintaining good robustness (SqRel and RMSE remain stable around 0.27 and 0.2, respectively). In the image restoration task, our method leads comprehensively across multiple datasets. The full-reference metrics SSIM and PSNR remain above 0.87 and 21.67 dB, respectively, while the no-reference metrics UIQM and UCIQE remain stable within the ranges of 2.56–2.86 and around 0.6, respectively. Particularly when processing the severely color-cast UIEB-S dataset, the method shows significant improvements and outstanding performance across all metrics. This indicates that the restoration results possess excellent color naturalness and visual perceptual quality, proving its significant potential and application value in addressing real-world underwater image restoration challenges, especially in extreme color correction.

Author Contributions

H.Q. proposed the original idea and wrote the manuscript. Q.L. and X.L. collected materials and wrote the manuscript. T.Z. supervised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China under grant number 52001039, 51839004.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the corresponding author on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AUVAutonomous Underwater Vehicles
CSHAMChannel–Spatial Hybrid Attention Mechanism
m-ViTm-Vision Transformer
UIQMUnderwater Image Quality Measure
UCIQEUnderwater Color Image Quality Evaluation

References

  1. Moghimi, M.K.; Mohanna, F. Real-time underwater image enhancement: A systematic review. J. Real-Time Image Process. 2021, 18, 1509–1525. [Google Scholar] [CrossRef]
  2. Li, S.; Wu, Y.; Li, C.; Zhao, H.; Li, Y. Application and prospect of unmanned underwater vehicle. Bull. Chin. Acad. Sci. (Chin. Version) 2022, 37, 910–920. [Google Scholar]
  3. González-Sabbagh, S.P.; Robles-Kelly, A. A survey on underwater computer vision. ACM Comput. Surv. 2023, 55, 1–39. [Google Scholar] [CrossRef]
  4. Chiang, J.Y.; Chen, Y.C. Underwater image enhancement by wavelength compensation and dehazing. IEEE Trans. Image Process. 2011, 21, 1756–1769. [Google Scholar] [CrossRef]
  5. Yuan, X.; Guo, L.; Luo, C.; Zhou, X.; Yu, C. A survey of target detection and recognition methods in underwater turbid areas. Appl. Sci. 2022, 12, 4898. [Google Scholar] [CrossRef]
  6. Hu, K.; Wang, T.; Shen, C.; Weng, C.; Zhou, F.; Xia, M.; Weng, L. Overview of underwater 3D reconstruction technology based on optical images. J. Mar. Sci. Eng. 2023, 11, 949. [Google Scholar] [CrossRef]
  7. Qi, Y.; Yang, Z.; Sun, W.; Lou, M.; Lian, J.; Zhao, W.; Deng, X.; Ma, Y. A comprehensive overview of image enhancement techniques. Arch. Comput. Methods Eng. 2022, 29, 583–607. [Google Scholar] [CrossRef]
  8. Muniraj, M.; Dhandapani, V. Underwater image enhancement by color correction and color constancy via Retinex for detail preserving. Comput. Electr. Eng. 2022, 100, 107909. [Google Scholar] [CrossRef]
  9. Peng, L.; Zhu, C.; Bian, L. U-shape transformer for underwater image enhancement. IEEE Trans. Image Process. 2023, 32, 3066–3079. [Google Scholar] [CrossRef] [PubMed]
  10. Jha, K.; Sakhare, A.; Chavhan, N.; Lokulwar, P.P. A Review on Image Enhancement Techniques Using Histogram Equalization. In Proceedings of the AIDE-2023 and PCES-2023 Program Schedule, Hinweis Research, Trivandrum, India, 28 October 2023. [Google Scholar]
  11. Sharma, R.; Kamra, A. A Review on CLAHE Based Enhancement Techniques. In Proceedings of the 2023 6th International Conference on Contemporary Computing and Informatics (IC3I), Lucknow, India, 14–16 September 2023. [Google Scholar]
  12. Singh, N.; Bhat, A. A systematic review of the methodologies for the processing and enhancement of the underwater images. Multimed. Tools Appl. 2023, 82, 38371–38396. [Google Scholar] [CrossRef]
  13. Li, C.; Anwar, S.; Porikli, F. Underwater scene prior inspired deep underwater image and video enhancement. Pattern Recognit. 2020, 98, 107038. [Google Scholar] [CrossRef]
  14. Islam, M.J.; Xia, Y.; Sattar, J. Fast underwater image enhancement for improved visual perception. IEEE Robot. Autom. Lett. 2020, 5, 3227–3234. [Google Scholar] [CrossRef]
  15. Raveendran, S.; Patil, M.D.; Birajdar, G.K. Underwater image enhancement: A comprehensive review, recent trends, challenges and applications. Artif. Intell. Rev. 2021, 54, 5413–5467. [Google Scholar] [CrossRef]
  16. Zhou, G.; Jia, G.; Zhou, X.; Song, N.; Wu, J.; Gao, K.; Huang, J.; Xu, J.; Zhu, Q. Adaptive high-speed echo data acquisition method for bathymetric LiDAR. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5703017. [Google Scholar] [CrossRef]
  17. Rajapaksha, U.; Sohel, F.; Laga, H.; Diepeveen, D.; Bennamoun, M. Deep learning-based depth estimation methods from monocular image and videos: A comprehensive survey. ACM Comput. Surv. 2024, 56, 1–51. [Google Scholar] [CrossRef]
  18. Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging into Self-Supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2019. [Google Scholar]
  19. Xu, W.; Fu, Y.L.; Zhu, D. ResNet and its application to medical image processing: Research progress and challenges. Comput. Methods Programs Biomed. 2023, 240, 107660. [Google Scholar] [CrossRef]
  20. Watson, J.; Mac Aodha, O.; Prisacariu, V.; Brostow, G.; Firman, M. The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
  21. Song, M.; Lim, S.; Kim, W. Monocular depth estimation using laplacian pyramid-based depth residuals. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 4381–4393. [Google Scholar] [CrossRef]
  22. Yu, B.; Wu, J.; Islam, M.J. Udepth: Fast monocular depth estimation for visually-guided underwater robots. arXiv 2022, arXiv:2209.12358. [Google Scholar]
  23. Peng, Y.-T.; Cosman, P.C. Underwater image restoration based on image blurriness and light absorption. IEEE Trans. Image Process. 2017, 26, 1579–1594. [Google Scholar] [CrossRef] [PubMed]
  24. Song, W.; Wang, Y.; Huang, D.; Tjondronegoro, D. A Rapid Scene Depth Estimation Model Based on Underwater Light Attenuation Prior for Underwater Image Restoration. In Proceedings of the 19th Pacific-Rim Conference on Multimedia, Hefei, China, 21–22 September 2018. [Google Scholar]
  25. Zhao, D.; Mao, W.; Chen, P.; Hu, Y.; Liang, H.; Dang, Y.; Liang, R.; Guo, X. A distributed and parallel accelerator design for 3-D acoustic imaging on FPGA-based systems. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 43, 1401–1414. [Google Scholar] [CrossRef]
  26. Alsakar, Y.M.; Sakr, N.A.; El-Sappagh, S.; Abuhmed, T.; Elmogy, M. Underwater image restoration and enhancement: A comprehensive review of recent trends, challenges, and applications. Vis. Comput. 2025, 41, 3735–3783. [Google Scholar] [CrossRef]
  27. Akkaynak, D.; Treibitz, T. Sea-thru: A Method for Removing Water from Underwater Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  28. He, K.; Sun, J.; Tang, X. Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 2341–2353. [Google Scholar] [CrossRef] [PubMed]
  29. Drews, P.; Nascimento, E.; Moraes, F.; Botelho, S.; Campos, M. Transmission Estimation in Underwater Single Images. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, Australia, 1–8 December 2013. [Google Scholar]
  30. Liu, D.; Zhou, J.; Xie, X.; Lin, Z.; Lin, Y. Underwater image restoration via background light estimation and depth map optimization. Opt. Express 2022, 30, 29099–29116. [Google Scholar] [CrossRef]
  31. Zhou, J.; Liu, Q.; Jiang, Q.; Ren, W.; Lam, K.-M.; Zhang, W. Underwater camera: Improving visual perception via adaptive dark pixel prior and color correction. Int. J. Comput. Vis. 2023, 133, 8215–8233. [Google Scholar] [CrossRef]
  32. Zhang, F.; You, S.; Li, Y.; Fu, Y. Atlantis: Enabling Underwater Depth Estimation with Stable Diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
  33. Bhat, S.F.; Alhashim, I.; Wonka, P. Adabins: Depth Estimation Using Adaptive Bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
  34. Wang, J.; Ye, X.; Liu, Y.; Mei, X.; Hou, J. Underwater self-supervised monocular depth estimation and its application in image enhancement. Eng. Appl. Artif. Intell. 2023, 120, 105846. [Google Scholar] [CrossRef]
  35. Chen, T.; Wang, N.; Chen, Y.; Kong, X.; Lin, Y.; Zhao, H.; Karimi, H.R. Semantic attention and relative scene depth-guided network for underwater image enhancement. Eng. Appl. Artif. Intell. 2023, 123, 106532. [Google Scholar] [CrossRef]
  36. González-Sabbagh, S.; Robles-Kelly, A.; Gao, S. Scene-cGAN: A GAN for underwater restoration and scene depth estimation. Comput. Vis. Image Underst. 2025, 250, 104225. [Google Scholar] [CrossRef]
  37. Nathan, O.B.; Levy, D.; Treibitz, T.; Rosenbaum, D. Osmosis: Rgbd Diffusion Prior for Underwater Image Restoration. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024. [Google Scholar]
  38. Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
  39. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  40. Burger, W.; Burge, M.J. Scale-Invariant Feature Transform (SIFT). In Digital Image Processing: An Algorithmic Introduction; Springer Nature: Cham, Switzerland, 2022; pp. 709–763. [Google Scholar]
  41. Chen, Z.; Badrinarayanan, V.; Drozdov, G.; Rabinovich, A. Estimating Depth from rgb and sparse Sensing. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  42. Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  43. Randall, Y. Flsea: Underwater Visual-Inertial and Stereo-Vision Forward-Looking Datasets. Master’s Thesis, University of Haifa, Haifa, Israel, 2023. [Google Scholar]
  44. Li, C.; Guo, C.; Ren, W.; Cong, R.; Hou, J.; Kwong, S.; Tao, D. An underwater image enhancement benchmark dataset and beyond. IEEE Trans. Image Process. 2019, 29, 4376–4389. [Google Scholar] [CrossRef]
  45. Gui, M.; Schusterbauer, J.; Prestel, U. DepthFM: Fast Generative Monocular Depth Estimation with Flow Matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025. [Google Scholar]
  46. Wang, K.; Guo, J.; Chen, K. An in-depth examination of SLAM methods: Challenges, advancements, and applications in complex scenes for autonomous driving. IEEE Trans. Intell. Transp. Syst. 2025, 26, 11066–11087. [Google Scholar] [CrossRef]
  47. Setiadi, D.R.I.M. PSNR vs. SSIM: Imperceptibility quality assessment for image steganography. Multimed. Tools Appl. 2021, 80, 8423–8444. [Google Scholar] [CrossRef]
  48. Hou, G.; Zhang, S.; Lu, T.; Li, Y.; Pan, Z.; Huang, B. No-reference quality assessment for underwater images. Comput. Electr. Eng. 2024, 118, 109293. [Google Scholar] [CrossRef]
  49. Rehman, M.U.; Abbas, Z.; Nasir, M.F.; Hussain, I. A multiscale physics-informed framework for robust no-reference underwater image quality evaluation. Alex. Eng. J. 2026, 135, 114–125. [Google Scholar] [CrossRef]
  50. Wang, Z.; Yang, M.; Wang, G.; Lian, Y.; Wang, Y. Digital twin-driven shape-performance-control-application integrated design for unmanned underwater vehicles. Sci. China Technol. Sci. 2026, 69, 1380301. [Google Scholar] [CrossRef]
Figure 1. Overall model architecture diagram. In the figure, ‘bneck’ is the basic structure of the MobileNetV3-small [38] network, BN is Batch Normalization, ‘h-swish’ is the activation function, and ‘CSHAM’ denotes the Channel–Spatial Hybrid Attention Module.
Figure 1. Overall model architecture diagram. In the figure, ‘bneck’ is the basic structure of the MobileNetV3-small [38] network, BN is Batch Normalization, ‘h-swish’ is the activation function, and ‘CSHAM’ denotes the Channel–Spatial Hybrid Attention Module.
Jmse 14 00563 g001
Figure 2. CSHAM Module.
Figure 2. CSHAM Module.
Jmse 14 00563 g002
Figure 3. Channel Attention Module. AvgPool denotes Average Pooling, MaxPool denotes Max Pooling, Sigmoid is the activation function, and Conv denotes convolutional layer.
Figure 3. Channel Attention Module. AvgPool denotes Average Pooling, MaxPool denotes Max Pooling, Sigmoid is the activation function, and Conv denotes convolutional layer.
Jmse 14 00563 g003
Figure 4. Spatial Attention Module. AvgPool denotes Average Pooling, MaxPool denotes Max Pooling, Sigmoid is the activation function, and Conv denotes convolutional layer.
Figure 4. Spatial Attention Module. AvgPool denotes Average Pooling, MaxPool denotes Max Pooling, Sigmoid is the activation function, and Conv denotes convolutional layer.
Jmse 14 00563 g004
Figure 5. m-Vision Transformer structure based on AdaBins.
Figure 5. m-Vision Transformer structure based on AdaBins.
Jmse 14 00563 g005
Figure 6. Depth estimation results of different methods under various environments. Samples (AC) are from the FLSea dataset, and samples (DF) are from the Seathru dataset.
Figure 6. Depth estimation results of different methods under various environments. Samples (AC) are from the FLSea dataset, and samples (DF) are from the Seathru dataset.
Jmse 14 00563 g006
Figure 7. Qualitative comparison with ground truth. Samples (A,B) are from the FLSea dataset, (C,D) are from the UIEB dataset, and (EG) are from the UIEB-S dataset.
Figure 7. Qualitative comparison with ground truth. Samples (A,B) are from the FLSea dataset, (C,D) are from the UIEB dataset, and (EG) are from the UIEB-S dataset.
Jmse 14 00563 g007
Figure 8. Qualitative comparison without ground truth. Samples (AC) in Figure 8 are from the Seathru dataset, and (DF) are from the HURLA dataset.
Figure 8. Qualitative comparison without ground truth. Samples (AC) in Figure 8 are from the Seathru dataset, and (DF) are from the HURLA dataset.
Jmse 14 00563 g008
Table 1. Quantitative comparison of different models on the FLSea dataset.
Table 1. Quantitative comparison of different models on the FLSea dataset.
ModelAbs Rel ↓Sq Rel ↓RMSE ↓δ1 ↑δ2 ↑δ3 ↑
Monodepth20.7980.3390.2470.2930.5490.744
Manydepth1.1970.5300.3080.3280.6390.861
LapDepth0.9770.3930.2530.4780.7490.875
UDepth1.3550.7250.3630.3560.6320.846
Ours0.7470.2720.2280.5430.7560.884
Table 2. Quantitative comparison of different models on the Seathru dataset.
Table 2. Quantitative comparison of different models on the Seathru dataset.
ModelAbs Rel ↓Sq Rel ↓RMSE ↓δ1 ↑δ2 ↑δ3 ↑
Monodepth20.9790.3950.3200.3380.5910.776
Manydepth1.4450.6730.3150.4370.6670.814
LapDepth1.4880.7400.2980.4860.7110.841
UDepth1.5170.7600.3190.4850.6830.816
Ours0.7840.2770.1830.6020.8290.918
Table 3. Full-reference evaluation results.
Table 3. Full-reference evaluation results.
DatasetFLSea UIEBUIEB-S
ModelSSIMPSNRSSIMPSNRSSIMPSNR
CLAHE0.859220.850.879420.660.871219.06
IBLA0.870425.240.600215.910.703918.48
ULAP0.610316.780.757619.690.775620.10
UWCNN0.698417.780.724618.110.739917.74
FUnIE-GAN0.823722.160.761520.350.767320.50
U-Transformer0.780422.650.739920.100.839520.49
Ours0.873024.090.886821.670.895423.35
Table 4. No-reference evaluation results.
Table 4. No-reference evaluation results.
DatasetFLSeaUIEBUIEB-SHURLA
ModelUIQMUCIQEUIQMUCIQEUIQMUCIQEUIQMUCIQE
CLAHE3.21670.58193.48470.58393.34800.55733.31290.5593
IBLA 2.82690.57921.68970.60432.36400.57841.14000.6104
ULAP1.66460.57952.47420.62142.40960.58842.34130.6237
UWCNN2.82580.60523.27390.54113.09830.52243.12710.5375
FUnIE-GAN3.37550.55423.38120.55463.27870.53432.94320.5430
U-Transformer3.30810.55913.46320.57003.32320.56883.27630.5487
Ours2.56170.60552.86370.61832.73990.61072.70830.5983
Table 5. Computational efficiency comparison of different methods.
Table 5. Computational efficiency comparison of different methods.
MethodsFLOPs (G) ↓Params (M) ↓FPS ↑
CLAHE--154
UDCP--142
UWCNN3.181.2142
CycleGAN63.3113.6289
FUnIE-GAN10.687.69293
U-Transformer66.2065.6011
Ours17.8711.8992
Table 6. Results of the ablation study.
Table 6. Results of the ablation study.
MethodAbsRel ↓SqRel ↓RMSE ↓δ1 ↑δ2 ↑δ3 ↑Param
Baseline0.8980.3450.2110.5510.8050.91415.6 M
Baseline + V30.8520.3150.2010.5540.8180.91511.8 M
Baseline + CBAM0.8360.3020.2070.5590.8110.91515.8 M
Baseline + CSHAM0.8040.2890.1980.5720.8190.91615.6 M
Ours0.7840.2770.1830.6020.8290.91811.9 M
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, T.; Qin, H.; Liu, Q.; Liu, X. Underwater Image Restoration Integrating Monocular Depth Estimation with a Physical Imaging Model. J. Mar. Sci. Eng. 2026, 14, 563. https://doi.org/10.3390/jmse14060563

AMA Style

Zhang T, Qin H, Liu Q, Liu X. Underwater Image Restoration Integrating Monocular Depth Estimation with a Physical Imaging Model. Journal of Marine Science and Engineering. 2026; 14(6):563. https://doi.org/10.3390/jmse14060563

Chicago/Turabian Style

Zhang, Tianchi, Hongwei Qin, Qiang Liu, and Xing Liu. 2026. "Underwater Image Restoration Integrating Monocular Depth Estimation with a Physical Imaging Model" Journal of Marine Science and Engineering 14, no. 6: 563. https://doi.org/10.3390/jmse14060563

APA Style

Zhang, T., Qin, H., Liu, Q., & Liu, X. (2026). Underwater Image Restoration Integrating Monocular Depth Estimation with a Physical Imaging Model. Journal of Marine Science and Engineering, 14(6), 563. https://doi.org/10.3390/jmse14060563

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop