NeRF-Enhanced Visual–Inertial SLAM for Low-Light Underwater Sensing

Wang, Zhe; Zhang, Qinyue; Hu, Yuqi; Zheng, Bing

doi:10.3390/jmse14010046

Open AccessArticle

NeRF-Enhanced Visual–Inertial SLAM for Low-Light Underwater Sensing

¹

Faculty of Information Science and Engineering, Ocean University of China, Qingdao 266100, China

²

Key Laboratory of Ocean Observation and Information of Hainan Province, Sanya Oceanographic Institution, Ocean University of China, Sanya 572024, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(1), 46; https://doi.org/10.3390/jmse14010046 (registering DOI)

Submission received: 22 November 2025 / Revised: 24 December 2025 / Accepted: 24 December 2025 / Published: 26 December 2025

(This article belongs to the Special Issue Intelligent Measurement and Control System of Marine Robots)

Download

Browse Figures

Versions Notes

Abstract

Marine robots operating in low illumination and turbid waters require reliable measurement and control for surveying, inspection, and monitoring. This paper present a sensor-centric visual–inertial simultaneous localization and mapping (SLAM) pipeline that combines low-light enhancement, learned feature matching, and NeRF-based dense reconstruction to provide stable navigation states. A lightweight encoder–decoder with global attention improves signal-to-noise ratio and contrast while preserving feature geometry. SuperPoint and LightGlue deliver robust correspondences under severe visual degradation. Visual and inertial data are tightly fused through IMU pre-integration and nonlinear optimization, producing steady pose estimates that sustain downstream guidance and trajectory planning. An accelerated NeRF converts monocular sequences into dense, photorealistic reconstructions that complement sparse SLAM maps and support survey-grade measurement products. Experiments on AQUALOC sequences demonstrate improved localization stability and higher-fidelity reconstructions at competitive runtime, showing robustness to low illumination and turbidity. The results indicate an effective engineering pathway that integrates underwater image enhancement, multi-sensor fusion, and neural scene representations to improve navigation reliability and mission effectiveness in realistic marine environments.

Keywords:

marine robots; multi-sensor fusion; signal processing; visual–inertial SLAM; low-light enhancement; learned feature matching; neural radiance fields (NeRF); underwater surveying

1. Introduction

Underwater simultaneous localization and mapping (SLAM) is essential for autonomous underwater vehicles and remotely operated vehicles performing exploration, mapping, and navigation tasks. Robust state estimation from visual and inertial sensors enables coverage-controlled surveying, waypoint tracking, and autonomous inspection in low-illumination and turbid conditions. However, it remains extremely challenging due to the unique constraints of the underwater domain, including light absorption, scattering, low texture, and limited visibility [1]. These constraints motivate integrated pipelines that enhance visual cues, learn robust correspondences, and leverage multi-sensor fusion to deliver stable navigation and measurement outputs for marine robots.

Early underwater SLAM approaches addressed some challenges by incorporating acoustic sensors and probabilistic estimators. For example, sonar-based SLAM solutions used acoustic range or imaging sonar to supplement visual data, often within particle filter or extended Kalman filter frameworks [1]. Eustice et al. demonstrated vision-based SLAM on the RMS Titanic wreck by carefully handling low-overlap imagery and fusing inertial measurements [2]. Modern underwater robotic systems such as SVIn2 combine sonar, camera, inertial, and depth sensors in a tightly coupled SLAM system to improve robustness and accuracy [3]. Despite these advances, purely vision-based SLAM underwater still struggles under poor illumination and turbidity, where cameras capture dark, low-contrast images that defeat conventional feature extraction.

Meanwhile, on land, visual SLAM techniques have matured with systems like ORB-SLAM2 and ORB-SLAM3, achieving impressive accuracy in clear environments [3,4,5]. ORB-SLAM2/3 rely on hand-crafted features (ORB) and have been extended to use inertial data for improved robustness. However, directly applying such methods underwater yields suboptimal performance, as shown in Rahman et al. [3]. Keypoints may not be detected reliably in low-light or hazy underwater images, and false matches increase without image enhancement. As a result, trajectory estimates from standard SLAM diverge or suffer large drift in these conditions.

Recent works have turned to deep learning to enhance visual SLAM front-ends. Learned feature extractors like SuperPoint [6] and context-aware matchers like SuperGlue [7] leverage neural networks to improve keypoint repeatability and matching robustness beyond classical methods. LightGlue [7,8] further refines this concept for efficiency, making deep matching feasible in real time. Another line of research integrates semantic understanding to filter or compensate for challenging visual effects; for instance, DynaSLAM [9] uses deep segmentation to remove dynamic objects from images, stabilizing mapping in dynamic scenes. In parallel, underwater image enhancement techniques have advanced significantly. Traditional physics-based approaches (e.g., histogram equalization and dark channel prior) and learning-based methods such as WaterGAN [10] and Sea-Thru [11] aim to restore color and clarity to underwater images. However, many enhancement algorithms focus on aesthetic or object detection improvement, and they are not optimized for SLAM feature processing.

Specifically for low-light underwater environments, researchers have proposed specialized enhancement networks. Xie et al. developed a deep model to simultaneously address low illumination and scattering, and they introduced a dedicated low-light underwater image dataset [12]. The enhanced images facilitate downstream tasks like object recognition. More directly tied to SLAM, Xin et al. [13] proposed ULL-SLAM, an end-to-end network that improves underwater SLAM by jointly learning image enhancement and feature detection in a self-supervised manner. Their approach reported increased feature match counts and reduced reprojection error under low-light conditions. Similarly, Qiu et al. [14] integrated an image preprocessing module (combining adaptive histogram equalization and a dehazing algorithm) into a visual–inertial SLAM pipeline, resulting in better trajectory accuracy in turbid water. These efforts underscore the importance of preprocessing and robust features for underwater SLAM, yet there remains a gap in combining multiple advanced components into a unified system.

This work’s main novelty lies in the integration and adaptation of a special designed underwater enhancement module with recent methods—NeRF for dense mapping and CBAM-enhanced learned matching—specifically designed for low-illumination and turbid conditions, and this configuration improves tracking robustness and reconstruction fidelity under challenging visibility. The contribution demonstrates the following practical combination: (i) a real-time enhancement network with global attention to stabilize feature detection and matching under adverse imaging conditions; (ii) a SuperPoint + CBAM-enhanced ULightGlue increases inlier ratios in texture-poor and noisy scenes; (iii) a tightly coupled visual–inertial front end using inertial measurement unit (IMU) pre-integration and nonlinear optimization to sustain robust pose estimation during visual degradation; and (iv) a practical NeRF-based mapping component enabling higher-fidelity underwater reconstructions without relying on explicit depth sensors, supporting measurement and control by providing steady navigation states to downstream guidance and trajectory planning. Experiments indicate that front-end modules (enhancement + ULightGlue with IMU fusion) primarily drive ATE improvements, whereas NeRF mainly enhances map quality.

2. Related Work

2.1. Underwater SLAM Challenges and Classical Approaches

Underwater environments impose significant challenges on SLAM. Light attenuation and scattering limit the effective range of cameras, and water turbidity and color absorption degrade image contrast and color fidelity [12]. These factors result in fewer detectable features and more mismatches, causing visual SLAM algorithms to drift. Classical approaches to underwater SLAM often incorporate additional sensors to compensate for these issues. Acoustic sensors such as sonar can provide range measurements or 2D scans independent of lighting conditions [1]. For example, side-scan sonar and multi-beam sonar have been used to create maps of the seafloor and aid localization where vision fails. Sparse point-based SLAM using sonar returns has been demonstrated, though with lower resolution compared to optical sensing.

Several early works combined vision with inertial navigation and pressure sensors to constrain SLAM drift. Eustice et al. pioneered large-area underwater visual SLAM by incorporating an inertial navigation system (INS) and performing loop closure on low-overlap imagery from deep-sea shipwreck exploration [2]. Their information filter approach maintained conservative covariance estimates to handle uncertain data association. Similarly, Ribas et al. [15] and Bahr et al. [16] applied particle filters and EKF SLAM to autonomous underwater vehicle (AUV) navigation in structured underwater environments like harbors, fusing sonar and vision. More recently, Rahman et al. presented SVIn and SVIn2, tightly-coupled SLAM systems that fuse a stereo camera, IMU, Doppler velocity log, depth sensor, and forward-looking sonar [3]. These systems achieved robust positioning in murky water by leveraging each sensor’s strength (e.g., sonar for long-range). However, the reliance on heavy acoustic equipment and prior calibration can be a drawback in some scenarios.

Pure vision-based SLAM underwater has also been explored, focusing on robust local features and outlier rejection. Researchers found that approaches like ORB-SLAM can work in shallow, clear water where sunlight or artificial lighting provides sufficient illumination, but they degrade rapidly as lighting diminishes or turbidity increases [3]. Thus, enhancing image quality and augmenting visual sensing with learning-based techniques has become a promising direction.

2.2. Learning-Based Feature Extraction and Scene Understanding

Deep learning has been applied to various aspects of SLAM to improve robustness and accuracy. One area is feature detection and description as follows: SuperPoint [6] introduced a self-supervised neural network that learns reliable keypoints and descriptors from image pairs. Unlike hand-crafted features, SuperPoint features are more repeatable under lighting and viewpoint changes, which is advantageous for underwater imagery with non-uniform illumination. Building on this, SuperGlue [7] and the more recent LightGlue [8] leverage graph neural networks and attention to perform context-aware matching of features between image pairs. These learned matchers significantly outperform traditional nearest-neighbor descriptor matching, especially in scenes with repetitive textures or image noise. In our work, we employ SuperPoint and LightGlue as the front-end to handle the difficult visual conditions of underwater scenes, where standard features like ORB would fail to find enough correspondences.

Another line of deep SLAM enhancement is using semantic or learned information to filter measurements. DynaSLAM (2018) used a CNN-based segmentation model to detect moving objects (like people or fish) and remove those regions from the mapping process, preventing them from corrupting the camera pose estimation [9]. While DynaSLAM targeted dynamic urban scenes, the general idea of leveraging deep learning to preprocess or weight feature measurements can be extended to underwater cases (e.g., ignoring marine life motion or false features from suspended particles).

Deep networks have also been trained to estimate camera pose or depth directly. End-to-end learned visual odometry (VO) and SLAM frameworks have emerged (e.g., DeepVO, ORB-SLAM with learned depth). However, these often require large training datasets and may not generalize well to underwater domain without retraining. Instead of replacing the SLAM pipeline, our approach enhances key components with deep learning while keeping the well-proven model-based optimization in the loop.

In terms of mapping, there is growing interest in neural implicit representations within SLAM. Neural Radiance Fields (NeRF) [17] originally developed for view synthesis, have been integrated into SLAM in works like iMAP (Implicit Mapping and Positioning) which jointly optimizes poses and a NeRF representation of the scene in real time [18]. These methods indicate a path toward SLAM that yields dense, photorealistic maps. However, current neural SLAM hybrids have mostly been demonstrated on small-scale or indoor scenes with RGB-D input. Applying such techniques underwater is largely unexplored and is one focus of our work.

2.3. Underwater Image Enhancement for SLAM

Enhancing underwater images to improve SLAM performance has been an active research topic. Many classical underwater image enhancement algorithms address issues of color cast, low contrast, and haze. Examples include histogram equalization variants, Retinex-based methods, and the dark channel prior (DCP) adapted for underwater. While these can improve visual appearance, their effect on feature detection can vary; some methods may oversaturate or introduce artifacts that confuse feature extractors. Still, incorporating enhancement as a preprocessing step has shown benefits. For instance, Qiu et al. (2024) reported that applying adaptive histogram equalization combined with a dehazing algorithm in a visual–inertial SLAM pipeline, thereby reducing the brightness differences caused by uneven lighting, boosted the number of inlier feature matches and reduced drift [14].

Learning-based enhancement offers more powerful restoration at the cost of requiring training data. WaterGAN [10] was an early attempt to synthesize and correct underwater images using GANs, effectively creating paired data for training. UWCNN [19] introduced a convolutional network that classifies water types and applies a corresponding color correction. More relevant to SLAM, the ULL-SLAM approach by Xin et al. (2023) trains a network with a low-light enhancement branch and a feature detection branch together [13]. This joint learning ensures that the enhanced images are specifically optimized to yield good feature points for SLAM, rather than just looking visually pleasing. Their results showed improved tracking on underwater videos where conventional ORB-SLAM failed due to darkness. However, ULL-SLAM requires a custom training procedure and still produces only a sparse map.

This work differs by decoupling the enhancement and feature extraction tasks as follows: we employ a dedicated enhancement network but then use off-the-shelf SuperPoint for features. This modularity allows using state-of-the-art networks for each component. This paper design our enhancement module to be lightweight enough for real-time use and to preserve scene details critical for SLAM (edges and corners). Furthermore, we integrate the enhancement with an advanced mapping back-end (NeRF) to fully exploit the improved image quality for dense reconstruction. In summary, while prior works tackled either front-end robustness or modest image preprocessing, our method brings together enhancement, deep features, inertial sensing, and neural mapping into a unified underwater SLAM system.

3. Materials and Methods

Architecture overview. As shown in Figure 1, the system comprises the following four modules: (1) low-light enhancement, which enhance low-light images, correct color distortions, suppress noise, and preserve feature information degraded visibility; (2) learned feature extraction and matching (SuperPoint + LightGlue), which yields robust keypoints and correspondences for the visual front-end; (3) tightly coupled visual–inertial state estimation, which performs IMU pre-integration and nonlinear optimization to produce stable navigation states; and (4) NeRF-based dense reconstruction, which generates radiance-field reconstructions aligned to the SLAM trajectory to complement sparse maps. Module boundaries are color-coded consistently in four colors (one per module).

3.1. Underwater Low-Light Image Enhancement Module

Due to severe visual degradation in underwater environments, particularly under low-light conditions, traditional visual SLAM algorithms often struggle to obtain stable feature points. To address this challenge, we introduce a deep learning-based image enhancement module as a preprocessing step for SLAM. The enhancement network as shown in Figure 2 utilizes an encoder–decoder convolutional neural network (CNN), designed to enhance low-light images, correct color distortions, suppress noise, and preserve feature information useful for SLAM.

3.1.1. Encoder

The encoder consists of a series of convolutional layers that progressively downsample the input image to extract high-level semantic features. The network uses four downsampling blocks for feature extraction. Each encoder block is composed of two 3 × 3 convolutional layers. To reduce the impact of lighting variation between different images, instance normalization (IN) is applied after each convolutional layer. The entire network employs the SiLU (Sigmoid Linear Unit) activation function, which offers smoother nonlinear characteristics compared to ReLU, optimizing the gradient flow and improving gradient propagation. This allows the features of low-light inputs to be more effectively activated. The mathematical expression of the SiLU activation function is as follows:

SiLU (x) = x \cdot σ (x) = \frac{x}{1 + e^{- x}},

(1)

where

σ (x)

is the Sigmoid function. This activation function effectively mitigates the vanishing feature problem and demonstrates superior feature extraction capability when processing low-light images. The network then proceeds with max-pooling to further downsample the image.

3.1.2. Bottleneck Layer

At the deepest bottleneck layer of the network, we introduce a Global Attention Mechanism (GAM). GAM extracts global information from the feature map through global average pooling (GAP) and calculates attention weights via two fully connected layers to adjust the importance of the channels in the feature map, thereby enhancing the representation of key features. The calculation formula is as follows:

w_{c} = σ (f_{2} (δ (f_{1} (\bar{F})))),

(2)

where

\bar{F}

is the feature vector after global pooling,

f_{1}

and

f_{2}

are fully connected layers,

δ

is the ReLU nonlinearity, and

σ

is the Sigmoid function. The resulting

w_{c}

serves as the channel attention weight, which re-weights the feature map, enhancing the features in key areas while suppressing those in irrelevant regions.

3.1.3. Decoder

The decoder uses a symmetric four-layer structure. Each layer includes a transposed convolution for upsampling, followed by the concatenation of the upsampled features with the corresponding feature map from the encoder. The concatenated features are then fused through two 3 × 3 convolutional layers, again utilizing instance normalization (IN) and SiLU (Sigmoid Linear Unit) activation functions. After passing through the four decoding blocks, the number of channels is reduced to 64. Finally, two 1 × 1 convolutional layers map the feature map back to the 3-channel RGB image space. The decoding process gradually restores the spatial details of the image while maintaining a natural color transition. The network outputs an enhanced image of the same size as the input.

This encoder–attention-enhanced-decoder network architecture effectively addresses the quality degradation problem in underwater images. Multi-scale feature extraction and skip connections ensure the preservation of fine details, the global attention mechanism enhances the representation of features, and the symmetric decoding structure ensures the naturalness of the enhancement results. Experimental results show that the network performs effectively in underwater image enhancement tasks.

3.2. Feature Extraction and Matching Network for Underwater Low-Light Environments

Underwater visual SLAM faces harsh environmental conditions. Firstly, water medium’s absorption and scattering of light cause reduced image contrast, blurred details, and color distortion (typically with a blue-green cast). Low-light conditions and turbidity from suspended particles make images appear dim, hazy, and color-shifted. These factors directly degrade feature point detection and description capabilities: effective features become scarce, and matching becomes difficult. Secondly, underwater scenes inherently have limited feature-rich targets, mostly concentrated in artificial structures, marine life, or seabed terrain. This means traditional algorithms (e.g., SIFT and ORB) tend to cluster extracted features around these structures, while textureless areas like sandy or muddy seabeds yield insufficient features. When underwater robots lose sight of these structures due to occlusion or movement, SIFT/ORB-based tracking often fails due to feature loss, leading to SLAM front-end tracking failure. Additionally, dynamic lighting and shot noise in underwater environments increase mismatch rates, and traditional feature matching relying on strict distance ratio thresholds struggles to adequately filter incorrect matches. Therefore, feature point extraction and matching in underwater environments face challenges of poor robustness and low stability, necessitating targeted improvements.

Our SLAM system’s front-end employs SuperPoint [6] for feature extraction and ULightGlue for feature matching to enhance visual stability in harsh underwater lighting conditions. SuperPoint utilizes a self-supervised learning strategy, pre-trained on large-scale datasets, enabling robust keypoint feature extraction despite lighting variations, noise, and blur. Even in enhanced underwater images, where noise or blur may persist, SuperPoint delivers stable feature point detection, outperforming traditional feature detection methods.

3.2.1. Robust Feature Extraction Module

SuperPoint is a deep feature extraction method based on self-supervised learning, capable of simultaneous feature point detection and descriptor extraction [6]. Its architecture includes a shared convolutional encoder, a feature point detection decoder, and a descriptor decoder. First, the input image undergoes dimensionality reduction via the convolutional encoder to produce abstract feature maps. The feature point detection decoder then predicts whether each pixel is a feature point, while the descriptor decoder extracts the corresponding feature descriptors.

SuperPoint employs a self-supervised training approach. During training, it is first pre-trained on synthetic datasets, and then homographic adaptation is applied to unlabeled images to generate pseudo-ground-truth labels, enhancing detection robustness. SuperPoint’s loss function consists of a detection loss

L_{p}

and a descriptor loss

L_{d}

as follows:

L_{p} = - \frac{1}{N} \sum_{i, j} \sum_{c = 1}^{65} Y_{i j, c} log X_{i j, c},

(3)

where

X_{i j, c}

is the predicted feature point probability distribution, and

Y_{i j, c}

is the ground-truth label. The descriptor loss adopts contrastive learning as follows:

L_{d} = \sum_{p, p^{'}} [S max (0, m_{p} - d \cdot d^{'}) + (1 - S) max (0, d \cdot d^{'} - m_{n})],

(4)

where

d \cdot d^{'}

denotes the cosine similarity between descriptors, S is an indicator variable, and

m_{p}

and

m_{n}

are similarity margins for positive and negative samples, respectively. The total loss function is as follows:

L = L_{p} + λ_{d} L_{d} .

(5)

For each frame of enhanced images, SuperPoint detects salient corner or blob features and computes their corresponding descriptors. Its advantage lies in its self-supervised training (via synthetic homographies) on large-scale image datasets, enabling adaptation to various imaging degradations and geometric transformations. This paper adopts SuperPoint as the front-end feature extraction method primarily because of its strong robustness in low-light, low-contrast, or noise-polluted images, whereas traditional methods like SIFT and ORB perform poorly in underwater environments.

3.2.2. ULightGlue: CBAM-Enhanced Feature Matching

Matching Pseudo-Code

1.: Input: An image pair $(A, B)$ .
2.: Extract SuperPoint keypoints and descriptors from A and B.
3.: Linearly project descriptors to a common embedding and form $Q / K / V$ for both sides.
4.: Enhance features with CBAM (channel then spatial) on A and B to obtain $F_{A}^{★}$ and $F_{B}^{★}$ (see equations below and Figure 3).
5.: Refine within-image context using self-attention on $F_{A}^{★}$ and $F_{B}^{★}$ .
6.: Exchange information with cross-attention between A and B to compute a similarity matrix S.
7.: Compute match probabilities with the matching head (dual-softmax); keep mutual top-1 pairs above a confidence threshold to form $M$ and report confidences $c_{i j}$ .
8.: Output: Mutual top-1 matches $M = {(i, j)}$ and their confidences $c_{i j}$ , obtained after mutual check and confidence thresholding.

After feature extraction, this work employed ULightGlue for feature matching instead of traditional brute-force matching. LightGlue [8] is a state-of-the-art deep matching method optimized from SuperGlue, specifically designed for real-time applications. It uses neural networks and attention mechanisms to dynamically adjust computational resources based on matching difficulty, enabling fast matching for regular frames while allocating more resources for loop closure to improve accuracy. Unlike traditional L2-distance-based matching, LightGlue simultaneously processes feature point sets from both images within the same network, establishing associations through self-attention and cross-attention [7]. Its core computations include the following:

1. Self-attention layers for intra-image feature enhancement:

a_{i j}^{(self)} = q_{i}^{⊤} R (p_{j} - p_{i}) k_{j},

(6)

where

R (p_{j} - p_{i})

encodes rotational displacement between points.

2. Cross-attention layers for inter-image feature matching:

a_{i j}^{I \leftarrow S} = {(k_{i}^{I})}^{⊤} k_{j}^{S} .

(7)

3. Matching head for computing match probabilities:

P_{i j} = σ_{i}^{A} σ_{j}^{B} {Softmax}_{k \in A} {(S_{k j})}_{i} {Softmax}_{k \in B} {(S_{i k})}_{j} .

(8)

This paper integrate the CBAM [20] module into the LightGlue network, forming the ULightGlue network to further enhance local and global feature representation in matching tasks. LightGlue’s transformer-based architecture effectively captures global context, but pure self-attention may overlook fine local structures and channel-wise feature importance. To address this, we introduce CBAM for localized and channel-adaptive feature enhancement.

The CBAM module comprises a Channel Attention Module (CAM) and a Spatial Attention Module (SAM). The CAM learns channel importance weights to emphasize informative feature channels, while the SAM highlights spatially critical regions to capture finer details.

The CAM is computed as follows:

M_{c} (F) = σ (MLP (A v g P o o l (F)) + MLP (M a x P o o l (F)))

(9)

where F is the input feature map,

A v g P o o l (\cdot)

and

M a x P o o l (\cdot)

denote average and max pooling,

MLP

is a multi-layer perceptron, and

σ

is the Sigmoid activation.

The SAM is computed as follows:

M_{s} (F^{'}) = σ (Conv ([A v g P o o l (F^{'}); M a x P o o l (F^{'})]))

(10)

where

F^{'}

is the channel-attention-weighted feature map, and

Conv (\cdot)

is a convolution operation applied to concatenated pooled features.

The final CBAM output is as follows:

F^{″} = M_{s} (F^{'}) \otimes F^{'} = M_{s} (M_{c} (F) \otimes F) \otimes (M_{c} (F) \otimes F)

(11)

where ⊗ denotes element-wise multiplication.

By adding CBAM before LightGlue’s transformer encoder, our model more effectively emphasizes local key information and channel importance, further improving matching accuracy and robustness.

3.2.3. Post-Matching Processing

After ULightGlue, we obtain a set of matched feature pairs with confidence scores. Although LightGlue has built-in filtering, we further enhance match quality with geometric consistency constraints:

Robust estimation (RANSAC): This paper fit essential matrices and homographies via RANSAC to ensure spatial geometric consistency and remove outliers.
Confidence thresholding: Only matches above a confidence threshold are retained to reduce false matches.

3.3. IMU Fusion for Robust Pose Estimation

To further enhance pose estimation accuracy, we adopt a tightly-coupled IMU fusion method, integrating accelerometer and gyroscope data from the Inertial Measurement Unit (IMU) into our SLAM system. IMUs provide high-frequency motion data unaffected by visual conditions, compensating for short-term pose estimation errors when camera frames degrade or motion blur occurs. This IMU pre-integration method resembles standard visual–inertial odometry (VIO) systems [21].

3.3.1. IMU’s Role in SLAM

IMU measurements are pre-integrated between consecutive keyframes to construct relative motion constraints. Given high-frequency raw accelerometer data

a (t)

and gyroscope data

ω (t)

, we compute the following integrals between keyframe timestamps

t_{i}

and

t_{j}

:

Rotation increment:

Δ R_{i j} \approx \int_{t_{i}}^{t_{j}} Ω (ω (t) - b_{g}) d t

(12)

Velocity increment:

Δ v_{i j} = \int_{t_{i}}^{t_{j}} (R_{i}^{T} (a (t) - b_{a}) - g) d t

(13)

Position increment:

Δ p_{i j} = \int_{t_{i}}^{t_{j}} v_{i} d t + \frac{1}{2} \int_{t_{i}}^{t_{j}} (R_{i}^{T} (a (t) - b_{a}) - g) t^{2} d t

(14)

where

b_{g}

and

b_{a}

are gyroscope and accelerometer biases (estimated online),

R_{i}

is the rotation matrix of keyframe i,

v_{i}

is its velocity,

g

is gravity, and

Ω (\cdot)

converts angular velocity to rotation increments (integrated on

S O (3)

). The pre-integrated results

Δ R_{i j}

,

Δ v_{i j}

, and

Δ p_{i j}

represent the IMU-predicted relative rotation, velocity change, and position change between keyframes i and j.

3.3.2. Visual–Inertial Joint Optimization

During SLAM optimization, we incorporate IMU measurements as constraints for joint optimization with visual SLAM pose estimates. Specifically, we maintain a state vector containing the following: keyframe positions

p

, orientations (as quaternions

q

), velocities

v

, and IMU bias parameters.

For each pair of consecutive keyframes, we add IMU factors to the optimization objective function to constrain the error between IMU-predicted relative motion and estimated visual poses as follows:

\begin{matrix} r_{i m u}^{i j} = [\begin{matrix} R_{i}^{T} (p_{j} - p_{i} - v_{i} Δ t - \frac{1}{2} g Δ t^{2}) - Δ p_{i j} \\ R_{i}^{T} (v_{j} - v_{i} - g Δ t) - Δ v_{i j} \\ 2 vec ({(Δ R_{i j})}^{T} (R_{i}^{T} R_{j})) \end{matrix}], \end{matrix}

(15)

where the first term is position error (difference between IMU-predicted and SLAM-estimated displacement), the second is velocity error, and the third is rotation error. These IMU residuals

r_{i m u}^{i j}

and their covariance matrices are added to the overall optimization objective as follows:

J = \sum_{k} {| r_{v i s i o n}^{k} |}_{Σ_{v}}^{2} + \sum_{i, j} {| r_{i m u}^{i j} |}_{Σ_{i m u}}^{2},

(16)

where

r_{v i s i o n}^{k}

represents visual observation reprojection errors, and

Σ_{v}

and

Σ_{i m u}

are the visual and IMU error covariance matrices, respectively. This paper employ sliding window nonlinear optimization to minimize

J

, similar to VINS-Mono’s backend but adapted for underwater environments.

Through IMU fusion, our system achieves the following:

Adaptability to lighting changes: In complete darkness or high turbidity, IMU provides stable short-term pose prediction when visual SLAM loses features
Drift correction: IMU acceleration measurements constrain monocular SLAM’s scale drift
Improved optimization accuracy: Visual–inertial joint optimization corrects visual estimation errors

This paper implement factor graph optimization using the GTSAM library, with final VIO outputs including the following: time-series camera poses (position, orientation, velocity), IMU bias parameters and Sparse 3D point cloud (from SuperPoint feature triangulation). In summary, IMU preintegration together with sliding-window optimization and factor-graph constraints, imposes metric scale and curbs drift in the VIO front-end.

3.4. Underwater Dense Mapping Based on NeRF

For underwater dense 3D reconstruction, we employ Neural Radiance Fields (NeRF) optimized via InstantNGP. NeRF represents 3D scenes as continuous volumetric fields, outputting color and density for any 3D point and viewing direction [17]. While traditional NeRF optimization is computationally expensive, InstantNGP [22] accelerates training significantly through multi-resolution hash grid encoding, making NeRF feasible for SLAM applications.

NeRF Modeling Pipeline

Once VIO completes a trajectory estimate (e.g., after a mission or loop closure), we train a NeRF model using SLAM-estimated keyframe poses and enhanced images for dense underwater mapping. The NeRF training process includes the following:

Camera parameter initialization: Using SLAM-estimated intrinsic and extrinsic parameters to ensure geometric consistency
Training with enhanced underwater images: Since NeRF relies on photometric consistency, using low-light enhanced images improves adaptation to underwater lighting variations

In implementation, camera poses are fixed to the VIO estimates during NeRF training; we do not jointly optimize poses with NeRF. This design maintains a clear separation between front-end state estimation and back-end mapping, and it avoids feedback loops that could overfit radiance-field artifacts into the pose graph. Consequently, NeRF training in this work improves map fidelity (view synthesis quality and geometric coherence) but does not alter SLAM trajectory accuracy (ATE). Any potential joint refinement of poses with NeRF is left to future work. Accordingly, the NeRF back end inherits the global metric scale from VIO; limitations under severe degradation and sensor noise are discussed in the dedicated subsection.

NeRF employs volume rendering for image synthesis. For a ray r through the scene, its rendered color is as follows:

C_{render} (r) = \sum_{k = 1}^{N} T_{k} (1 - exp (- σ_{k} δ_{k})) c_{k},

(17)

where N samples are taken along the ray,

c_{k}

and

σ_{k}

are the color and density predicted by the NeRF MLP at sample k,

δ_{k}

is the distance between samples, and

T_{k} = exp (- \sum_{l = 1}^{k - 1} σ_{l} δ_{l})

is transmittance (probability ray reaches sample k unobstructed). InstantNGP’s hash grid encoding efficiently maps coordinates to a compact MLP, predicting

c_{k}

and

σ_{k}

with significantly lower computation than traditional NeRF.

3.5. Loss Function Design and Optimization

To adapt our model to underwater environments, we design a hybrid loss function combining traditional NeRF reconstruction loss with underwater-specific enhancements including ColorSSIM loss and NimaLoss. These ensure high-quality reconstruction while accurately reflecting underwater characteristics, significantly improving SLAM accuracy and robustness.

3.5.1. NeRF Reconstruction Loss

The reconstruction loss minimizes pixel differences between predicted and ground truth images as follows:

L_{N e R F} = \sum_{r \in R} {| \hat{C} (r) - C (r) |}_{2}^{2},

(18)

where

\hat{C} (r)

is the model prediction and

C (r)

is the ground truth [17]. This ensures pixel-accurate underwater scene reconstruction for high-quality view synthesis in SLAM.

3.5.2. ColorSSIM Loss

This combines L1 loss, structural similarity (SSIM), and color loss to measure color and structural differences as follows:

L_{C o l o r S S I M} = 0.8 \cdot L_{f i d e l i t y} + 0.2 \cdot (1 - S S I M (\hat{I}, I)),

(19)

where

L_{f i d e l i t y}

is pixel-wise L1 loss,

S S I M (\hat{I}, I)

measures local structural similarity [23], and an additional color loss uses Gaussian blur for enhanced color consistency. This effectively suppresses NeRF artifacts while preserving realistic underwater colors.

3.5.3. NimaLoss

This ensures aesthetic quality using NIMA (Neural Image Assessment) as follows:

L_{N i m a} = L_{f i d e l i t y} + γ \cdot (10 - N I M A (\hat{I})),

(20)

where

N I M A (\hat{I})

is the aesthetic score (0–10) [24], and

γ

weights its contribution. This ensures visually pleasing reconstructions beyond just structural/color accuracy.

This hybrid loss achieves optimal balance between image quality and underwater adaptation, significantly enhancing reconstruction quality and visual SLAM performance.

4. Results

4.1. Datasets

4.1.1. AquaLoc Harbor Sequences

To validate the performance of our proposed NeRF-SLAM framework, we conducted experiments using the harbor sequences from the AquaLoc Dataset. Specifically designed for underwater localization and SLAM tasks, this dataset provides synchronized measurements from a monocular camera, MEMS-based IMU, and pressure sensor [25].

The experimental sequences were collected by a Remotely Operated Vehicle (ROV). The recorded video sequences exhibit typical underwater visual degradation characteristics, including backscattering caused by turbid water, shadow effects from artificial lighting, and sudden illumination changes [25].

4.1.2. AquaLoc Archaeological Sequences

Beyond the harbor subset, we further evaluate on archaeological sequences to assess robustness in low-texture, low-light, and highly degraded underwater scenes. These sequences include two deep-water archaeological sites and follow the same sensor setup of AquaLoc-monocular camera, MEMS IMU, and pressure sensor [25]. Scenes exhibit extreme darkness, suspended particles, and intermittent artificial illumination, resulting in severe backscatter and contrast loss.

4.2. Image Enhancement Performance Analysis

4.2.1. Image Enhancement Evaluation Metrics

Since ground truth reference images are unavailable for underwater image enhancement tasks, traditional full-reference quality metrics like PSNR and SSIM cannot be applied. Therefore, we employ no-reference image quality assessment metrics including UIQM, Entropy, BRISQUE and NIQE. These metrics objectively evaluate enhancement effects based on statistical characteristics and perceptual attributes of the images themselves, demonstrating good adaptability and wide applicability.

The UIQM (Underwater Image Quality Measure) [26], specifically designed for underwater images, comprehensively evaluates perceptual quality through weighted integration of color fidelity (UICM), sharpness (UISM), and contrast (UIConM) sub-metrics. Higher values indicate better visual quality, defined as follows:

UIQM = c_{1} \cdot UICM + c_{2} \cdot UISM + c_{3} \cdot UIConM,

(21)

where

c_{1}

,

c_{2}

, and

c_{3}

are empirically determined weights typically set to

0.0282

,

0.2953

, and

3.5753

, respectively.

Entropy measures information richness, reflecting the complexity of grayscale distribution. Higher values indicate more detailed images as follows:

Entropy = - \sum_{i = 0}^{255} p_{i} {log}_{2} (p_{i}),

(22)

where

p_{i}

represents the probability of pixels with grayscale value i.

Both BRISQUE and NIQE are no-reference metrics based on Natural Scene Statistics (NSS). BRISQUE quantifies image distortion by comparing spatial domain statistical features with natural images, while NIQE measures the distance between the target image and an ideal natural image statistical model. Lower scores indicate more natural, higher-quality images.

4.2.2. Image Enhancement Comparative Experiments

To validate our underwater image enhancement network’s effectiveness within the NeRF-SLAM framework, we compare it against the following four representative enhancement methods: Zero-DCE [27], Shallow-UWnet [28], U-shape Transformer [29], and EnlightenGAN [30]. These methods cover various enhancement strategies including reference-free learning, shallow convolution, Transformer encoders, and adversarial generation. As shown in Figure 4, the enhancement methods are visually compared.

4.3. The Role of Underwater Feature Extraction Matching Network in SLAM

To validate the effectiveness of our approach, we conducted comparative experiments using SIFT, ORB, SuperPoint + SuperGlue, and SuperPoint + ULightGlue for underwater feature extraction and matching. The results demonstrate that SIFT and ORB primarily detect features on artificial structures while showing limited performance in low-texture regions, leading to frequent tracking failures. In contrast, SuperPoint + SuperGlue improves feature distribution uniformity through its learned feature detection network, though it still exhibits matching errors in low-contrast areas.

Figure 5 and Figure 6 present visual comparisons of four feature extraction and matching methods (SIFT, ORB, SuperPoint + SuperGlue, and SuperPoint + ULightGlue) at different frame intervals. Figure 5 demonstrates short-term matching performance between the current frame and subsequent 10 frames, evaluating feature density and matching accuracy in adjacent frames.

Figure 6 examines long-term matching stability across 60-frame intervals, reflecting each method’s capability in feature preservation and tracking robustness.

4.4. SLAM-Focused Experiments

4.4.1. Evaluation Metrics

For evaluating Simultaneous Localization and Mapping (SLAM) performance, we primarily use Absolute Trajectory Error (ATE) as the key metric. ATE measures global trajectory deviation, reflecting the SLAM system’s global consistency, as follows:

{ATE}_{RMSE} = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {|p_{i} - {\hat{p}}_{i}|}^{2}},

(23)

where

p_{i}

represents the ground truth pose of frame i,

{\hat{p}}_{i}

denotes the estimated SLAM pose, and N is the number of keyframes.

4.4.2. Comparative Experiments

Table 1 describes the mainstream SLAM methods included in our comparative experiments, detailing their key characteristics including IMU support and underwater applicability.

4.4.3. Harbor SLAM Performance

Figure 7a–f presents our method’s trajectory results across six video sequences in Harbor dataset. The image enhancement module and combined underwater feature extraction matching network and IMU fusion module significantly improve underwater trajectory accuracy. Compared to pure visual odometry, our solution demonstrates notable improvements in trajectory consistency, effectively suppressing cumulative drift.

Table 2 compares Absolute Trajectory Error (ATE) across various SLAM methods on real underwater harbor datasets.

4.4.4. Archaeological Sequences Comparative Results

The comparative analysis on archaeological sequences (Table 3) highlights our method’s robustness and accuracy under extreme underwater degradation. Unlike ORB-SLAM3, DXSLAM, and VINS-Mono, which exhibit frequent failures or large error fluctuations across sequences, our system completes tracking on all ten sequences and maintains errors within a narrow range (0.057–0.893).

4.4.5. Ablation Study

Table 4 presents ATE ablation with progressive component toggles focused on enhancement and matching. Columns are ordered to reflect a stepwise progression from the baseline front-end to the full system. Compared to vision-only and simple visual–IMU fusion, the method improves positioning accuracy in most sequences (01–06), with clear benefits in challenging cases (e.g., Sequence 05 decreases from 0.231 to 0.114; Sequence 07 transitions from failure to stable tracking at ATE = 0.712). NeRF acts as a mapping back-end with fixed poses and does not affect ATE.

Across the ablation, each front-end component is necessary and complementary under different failure modes as follows: enhancement improves detectable features and match stability in low-light and color-shift scenes; learned matching (SuperPoint + ULightGlue) raises inlier ratios and reduces mismatches under repetitive or low-texture patterns and longer baselines; and IMU fusion maintains trajectory continuity and scale during low parallax, short occlusions, or degraded frames. Removing any component leads to accuracy drops or failures on at least one sequence—for example, in Sequence 02, adding IMU alone can increase ATE due to sensor noise or calibration drift; Sequence 07 is prone to failures without IMU or robust matching; and Sequence 05 benefits more noticeably with enhancement. Combined, the three provide broader coverage of degradations and turn failures (e.g., Sequence 07) into trackable trajectories. Note that NeRF, as a mapping back-end driven by fixed poses, does not affect ATE but is necessary for reconstruction fidelity and geometric consistency.

4.5. Novel View Reconstruction Results Analysis

As shown in Figure 8, our proposed method achieves superior reconstruction quality on the AquaLoc dataset compared to other SLAM approaches.

5. Discussion

5.1. Image Enhancement Results

Table 5 presents quantitative comparisons across seven underwater sequences using five metrics. The method demonstrates superior performance across most metrics, particularly achieving the highest scores in UIQM, UCIQE, and Entropy across all sequences. For instance, in Sequence 3, our method achieves UIQM and UCIQE scores of 0.9495 and 0.3449 respectively, significantly outperforming Zero-DCE’s 0.9317 and 0.2358, while also surpassing Shallow-UWnet and EnlightenGAN. The Entropy values consistently exceed 7.3, whereas other methods typically hover around 6.5, confirming our method’s advantage in detail preservation and information recovery.

Figure 4 visually compares enhancement results across seven typical underwater sequences. The original images exhibit common underwater degradation characteristics including insufficient illumination, dark colors, and blurred details. While Zero-DCE improves brightness, the results appear grayish with weak color recovery. Shallow-UWnet enhances brightness and saturation but introduces reddish tones in Sequences 3 and 5. U-shape Transformer better preserves structural details but suffers from insufficient brightness and greenish tints in Sequences 4 and 7. EnlightenGAN provides stable brightness improvement but produces desaturated colors and blurred edges.

In contrast, our method delivers more balanced enhancement across most sequences, significantly improving brightness and contrast while maintaining natural colors and structural integrity. For example, in Sequences 1 and 6, our enhanced images exhibit natural colors and sharp edges with superior visual quality compared to other methods.

Both quantitative and qualitative analyses confirm our method’s adaptability and stability across various underwater degradation scenarios, effectively balancing quality improvement with semantic information preservation to provide more reliable visual input for subsequent SLAM and reconstruction tasks.

5.2. Discussions of the Role of Underwater Feature Extraction Matching Network in SLAM

Figure 5 and Figure 6 shows the distribution and density of matching connections reveal that traditional methods like SIFT and ORB maintain limited matching capability at short intervals, but their performance degrades significantly with an increase in frame gaps, showing numerous mismatches (red lines), particularly under conditions of significant scene changes, occlusion, or underwater light attenuation. The learning-based SuperPoint + SuperGlue maintains good matching density at both intervals, demonstrating strong local invariance and spatial consistency. However, our proposed SuperPoint+ULightGlue consistently outperforms all alternatives, showing superior matching performance across all scenarios with high consistency and robustness in both short-term and long-term (60-frame) intervals while detecting stable structural features with minimal mismatches.

The analysis shows that the underwater feature extraction matching network significantly increases detected feature points compared to original low-light images, while ULightGlue improves the inlier ratio. These improvements directly enhance SLAM front-end tracking stability and reduce failure rates. In terms of computational efficiency, SuperPoint runs in real time on GPU, and ULightGlue maintains inter-frame matching at 10–20 ms through dynamic computation allocation, making it suitable for real-time SLAM systems.

5.3. SLAM Performance Analysis

5.3.1. Harbor SLAM Results

Figure 7a–f shows that the absolute pose errors (APE) across sequences support the system’s ability to maintain geometric consistency in complex underwater environments. The multi-module collaboration mechanism enables stable trajectory reconstruction in challenging scenarios (e.g., low-texture or dynamic environments) where traditional visual SLAM often fails, illustrating the combined effect of sensor fusion and neural representation enhancement.

The results also reveal the direct impact of image quality improvement on SLAM accuracy. The underwater image enhancement in this work improves contrast and texture details and provides more stable input for feature extraction and matching, thereby enhancing overall trajectory estimation accuracy. This further validates the importance of high-quality visual input for underwater SLAM performance.

Table 2 compares Absolute Trajectory Error (ATE) across various SLAM methods on real underwater harbor datasets. Traditional visual SLAM methods show significant limitations in complex underwater environments. ORB-SLAM2, DSO, LDSO, and Orbeez-SLAM exhibit severe tracking failures in multiple sequences, particularly in Sequences 1, 4, and 7. These limitations stem from underwater degradation factors including low-texture regions, uneven lighting, strong scattering, and turbidity-induced image quality reduction that challenge feature-based or photometric consistency frameworks.

In contrast, SVO demonstrates relative advantages in some sequences (1, 2, 3, and 6), achieving a minimum ATE of 0.0823 in Sequence 6, indicating better positioning accuracy and efficiency in short-distance, static, or weakly disturbed scenarios. However, SVO remains vulnerable to tracking interruptions in dynamic or heavily occluded scenes and fails in Sequences 4, 5, and 7.

ORB-SLAM3 with IMU shows improvement over vision-only configurations, delivering stable performance in mappable sequences. However, it still fails under extreme lighting variations or blur (Sequences 4 and 7), indicating insufficient robustness against typical underwater disturbances.

U-VIP-SLAM, as a multimodal system fusing visual, inertial, and pressure sensors, completes valid tracking in all sequences, demonstrating stable operation. It achieves notably low errors in favorable cases (e.g., 0.014 in Sequence 3 and 0.043 in Sequence 6), though performance degrades (1.093 and 0.704) in challenging dynamic and turbid conditions (Sequences 4 and 7).

Orbeez-SLAM (visual + NeRF) attains moderate accuracy where tracking succeeds (e.g., 0.042 in Sequence 3 and 0.109 in Sequence 6, but it fails in Sequences 1, 4, and 7.

This work’s method completes mapping in all sequences with ATEs of 0.249 (Sequence 1), 0.351 (Sequence 2), 0.018 (Sequence 3), 1.010 (Sequence 4), 0.114 (Sequence 5), 0.054 (Sequence 6), and 0.712 (Sequence 7). Compared with U-VIP-SLAM, it achieves lower errors in Sequence 1 and Sequence 5, remains comparable in Sequence 4, and is similar in Sequence 3, Sequence 6, and Sequence 7. These results indicate stable trajectory estimation under severe underwater degradation and dynamic conditions.

5.3.2. Archaeological SLAM Results

Analyzing the ATE results in Table 3 reveals substantial performance differences across SLAM methods on archaeological sequences. ORB-SLAM3 fails completely in five out of ten sequences (1, 2, 3, 6, and 9), and where it succeeds the errors vary widely (minimum 0.101 in Sequence 8; maximum 1.468 in Sequence 7), indicating instability in archaeological scenes. In contrast, DROID-SLAM fails only in Sequence 4 and keeps errors between 0.090 and 0.402 elsewhere, showing strong robustness; however, its error in Sequence 7 (0.218) is still slightly higher than our method’s 0.125.

DXSLAM fails in Sequences 2, 4, and 6, and it exhibits extreme variance in successful runs—achieving 0.036 minimum in Sequence 9 while reaching 2.931 in Sequence 4—suggesting high sensitivity to scene condition changes.

VINS-Mono fails in Sequences 2, 3, and 6; in successful runs, the average error is 0.941, peaking at 1.372 in Sequence 7, reflecting the limitations of pure visual–inertial fusion in archaeological scenes. SVIn2 fails in Sequences 4 and 6 and shows mixed performance as follows: relatively good results (0.231 and 0.247) in Sequences 1 and 8, but very large errors (2.721 and 2.371) in Sequences 5 and 10, indicating instability under certain conditions. The method completes tracking on all 10 sequences, with errors ranging from 0.057 to 0.893. Notably, 0.057 in Sequence 3 is the best among all methods on that sequence, and 0.893 in Sequence 4 is substantially better than DXSLAM’s 2.931 and VINS-Mono’s 1.093, demonstrating superior performance and stability under underwater degradation.

From specific cases, in Sequence 1 where ORB-SLAM3 fails completely, our method achieves 0.176, a 56.2% improvement over DROID-SLAM’s 0.402; in the most challenging Sequence 4, our 0.893 outperforms all alternatives; and in Sequence 5, our 0.114 improves by 95.8% over SVIn2’s 2.721. These results show that our method surpasses existing approaches not only in success rate but also in accuracy. Importantly, our method maintains stable operation across all sequences with the smallest error variability, demonstrating strong adaptability to diverse underwater challenges. By contrast, other methods either exhibit high failure rates or unusually large error fluctuations in specific scenes, limiting their reliability for practical underwater applications.

5.4. Novel View Reconstruction Results

As shown in Figure 8, our method demonstrates enhanced detail preservation and edge handling capabilities while maintaining better tracking continuity through its front-end network architecture. Notably, our system only experienced tracking and mapping failures in one scenario during testing.

This paper further compared the impact of different image enhancement methods on novel view synthesis quality. The results in Figure 8 confirm that our enhancement method significantly improves reconstruction quality in low-light underwater environments. The proposed approach effectively addresses the challenges of underwater imaging degradation, yielding more accurate and detailed reconstructions under challenging illumination conditions.

6. Conclusions

This paper presented a sensor-centric underwater visual–inertial SLAM system for low illumination and turbidity in marine robotic operations. The approach integrates a low-light image enhancement network to precondition input for SLAM, learned feature extraction and matching (SuperPoint and LightGlue) for robust correspondences, tightly coupled IMU fusion for steady pose estimation, and an accelerated NeRF back-end for dense, photorealistic reconstructions. Experiments on the AQUALOC dataset show improved tracking stability and reconstruction fidelity with competitive efficiency, validating that sensor data enhancement, multi-sensor fusion, and neural scene representations jointly strengthen underwater perception under adverse conditions.

From a measurement and control perspective, the system provides stable navigation states and dense situational awareness that support downstream guidance, trajectory planning, and inspection decision-making. In near-shore surveys, infrastructure inspection, and environmental monitoring, the combination of robust visual–inertial estimation and neural mapping yields actionable 3D products without requiring explicit depth sensors, improving mission reliability under limited lighting and fluctuating visibility. The system achieved competitive results on challenging underwater datasets, improving trajectory accuracy and producing high-quality reconstructions. This indicates the effectiveness of the integrated design in tackling underwater SLAM under adverse conditions.

Limitations and Future Work. Future work includes broadening sensing and fusion beyond monocular vision and IMU by integrating complementary modalities (e.g., DVL, multibeam/imaging sonar, depth, and magnetometers) within multi-rate, asynchronous factor-graph formulations. Cross-domain generalization across varying water types and illumination can be advanced through physics-informed priors for attenuation and scattering together with self-supervised domain adaptation. For robust loop closure and trajectory correction, semantics-aware place recognition and principled constraint selection are promising directions. On dense mapping, radiance-field models consistent with marine optics and resource-aware schedulers for embedded AUV platforms at larger scales constitute an important development. Georeferencing via acoustic beacons or surface-GPS alignment, as well as cooperative multi-robot mapping under distributed optimization, represent additional avenues for investigation.

Author Contributions

Conceptualization, Z.W. and Q.Z.; methodology, Z.W. and Q.Z.; software, Z.W. and Y.H.; validation, Z.W., Q.Z. and Y.H.; formal analysis, Z.W. and Q.Z.; investigation, Z.W. and Q.Z.; resources, B.Z.; data curation, B.Z.; writing—original draft preparation, Z.W. and Q.Z.; writing—review and editing, Z.W. and B.Z.; visualization, Z.W. and Q.Z.; supervision, B.Z.; project administration, B.Z.; funding acquisition, B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (Grant No. 2022YFD2401304).

Data Availability Statement

The data used in this study are openly available from the AquaLoc dataset at https://www.lirmm.fr/aqualoc/, accessed on 16 December 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, X.; Fan, X.; Shi, P.; Ni, J.; Zhou, Z. An overview of key SLAM technologies for underwater scenes. Remote Sens. 2023, 15, 2496. [Google Scholar] [CrossRef]
Eustice, R.M.; Singh, H.; Leonard, J.J.; Walter, M.R. Visually mapping the RMS Titanic: Conservative covariance estimates for SLAM information filters. Int. J. Robot. Res. 2006, 25, 1223–1242. [Google Scholar] [CrossRef]
Rahman, S.; Li, A.Q.; Rekleitis, I. Svin2: An underwater slam system using sonar, visual, inertial, and depth sensor. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 1861–1868. [Google Scholar]
Mur-Artal, R.; Tardós, J.D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4938–4947. [Google Scholar]
Lindenberger, P.; Sarlin, P.E.; Pollefeys, M. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 17627–17638. [Google Scholar]
Bescos, B.; Fácil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
Li, J.; Skinner, K.A.; Eustice, R.M.; Johnson-Roberson, M. WaterGAN: Unsupervised generative network to enable real-time color correction of monocular underwater images. IEEE Robot. Autom. Lett. 2017, 3, 387–394. [Google Scholar]
Akkaynak, D.; Treibitz, T. Sea-thru: A method for removing water from underwater images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1682–1691. [Google Scholar]
Xie, Y.; Yu, Z.; Yu, X.; Zheng, B. Lighting the darkness in the sea: A deep learning model for underwater image enhancement. Front. Mar. Sci. 2022, 9, 921492. [Google Scholar] [CrossRef]
Xin, Z.; Wang, Z.; Yu, Z.; Zheng, B. ULL-SLAM: Underwater low-light enhancement for the front-end of visual SLAM. Front. Mar. Sci. 2023, 10, 1133881. [Google Scholar] [CrossRef]
Qiu, H.; Tang, Y.; Wang, H.; Wang, L.; Xiang, D.; Xiao, M. An Improved Underwater Visual SLAM through Image Enhancement and Sonar Fusion. Remote Sens. 2024, 16, 2512. [Google Scholar]
Ribas, D.; Ridao, P.; Neira, J. Underwater SLAM for Structured Environments Using an Imaging Sonar; Springer: Berlin/Heidelberg, Germany, 2010; Volume 65. [Google Scholar]
Bahr, A.; Leonard, J.J.; Fallon, M.F. Cooperative localization for autonomous underwater vehicles. Int. J. Robot. Res. 2009, 28, 714–728. [Google Scholar] [CrossRef]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
Sucar, E.; Liu, S.; Ortiz, J.; Davison, A.J. imap: Implicit mapping and positioning in real-time. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 6229–6238. [Google Scholar]
Li, C.; Anwar, S.; Porikli, F. Underwater scene prior inspired deep underwater image and video enhancement. Pattern Recognit. 2020, 98, 107038. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast semi-direct monocular visual odometry. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7June 2014; pp. 15–22. [Google Scholar]
Müller, T.; Evans, A.; Schied, C.; Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (TOG) 2022, 41, 1–15. [Google Scholar] [CrossRef]
Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In Proceedings of the The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003; Volume 2, pp. 1398–1402. [Google Scholar]
Talebi, H.; Milanfar, P. NIMA: Neural image assessment. IEEE Trans. Image Process. 2018, 27, 3998–4011. [Google Scholar] [CrossRef]
Ferrera, M.; Creuze, V.; Moras, J.; Trouvé-Peloux, P. AQUALOC: An underwater dataset for visual–inertial–pressure localization. Int. J. Robot. Res. 2019, 38, 1549–1559. [Google Scholar] [CrossRef]
Panetta, K.; Gao, C.; Agaian, S. Human-visual-system-inspired underwater image quality measures. IEEE J. Ocean. Eng. 2015, 41, 541–551. [Google Scholar] [CrossRef]
Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1780–1789. [Google Scholar]
Naik, A.; Swarnakar, A.; Mittal, K. Shallow-uwnet: Compressed model for underwater image enhancement (student abstract). In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; pp. 15853–15854. [Google Scholar]
Peng, L.; Zhu, C.; Bian, L. U-shape transformer for underwater image enhancement. IEEE Trans. Image Process. 2023, 32, 3066–3079. [Google Scholar] [CrossRef]
Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. Enlightengan: Deep light enhancement without paired supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef]
Engel, J.; Koltun, V.; Cremers, D. Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 611–625. [Google Scholar] [CrossRef]
Gao, X.; Wang, R.; Demmel, N.; Cremers, D. LDSO: Direct sparse odometry with loop closure. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 2198–2204. [Google Scholar]
Forster, C.; Zhang, Z.; Gassner, M.; Werlberger, M.; Scaramuzza, D. SVO: Semidirect visual odometry for monocular and multicamera systems. IEEE Trans. Robot. 2016, 33, 249–265. [Google Scholar] [CrossRef]
Amarasinghe, C.; Rathnaweera, A.; Maithripala, S. U-VIP-SLAM: Underwater visual-inertial-pressure SLAM for navigation of turbid and dynamic environments. Arab. J. Sci. Eng. 2024, 49, 3193–3207. [Google Scholar] [CrossRef]
Chung, C.M.; Tseng, Y.C.; Hsu, Y.C.; Shi, X.Q.; Hua, Y.H.; Yeh, J.F.; Chen, W.C.; Chen, Y.T.; Hsu, W.H. Orbeez-SLAM: A Real-time Monocular Visual SLAM with ORB Features and NeRF-realized Mapping. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 9400–9406. [Google Scholar]

Figure 1. The underwater SLAM system architecture with four modules: (1) an image enhancement convolutional neural network (CNN) for preprocessing input frames; (2) feature detection (SuperPoint) and matching (ULightGlue) modules for the visual front-end; (3) a tightly coupled visual–inertial state estimator serves as the front-end, delivering metric-scale keyframe poses to the NeRF back end; and (4) NeRF-based dense reconstruction.

Figure 2. Architecture of the underwater low-light enhancement network. This encoder–decoder convolutional neural network (CNN) employs SiLU activation functions and instance normalization, with skip connections (grey arrows) transmitting spatial information between the encoder and decoder. Additionally, a global attention module (GAM) is incorporated at the bottleneck layer of the network to optimize.

Figure 3. Block diagram of ULightGlue with CBAM insertion. CBAM (channel + spatial) is applied before transformer layers to enhance local saliency and channel importance, improving matching in low-light underwater scenes.

Figure 4. Visual comparison of different enhancement methods across seven harbor video sequences.

Figure 5. Visualization of feature extraction and matching between current frame and subsequent 10 frames.

Figure 6. Visualization of feature extraction and matching between current frame and the frame after 60 frames.

Figure 7. Trajectory results of our method on harbor video sequences. Solid lines denote estimated trajectories; dashed lines denote ground-truth trajectories.

Figure 8. Novel view synthesis comparison of our network.

Table 1. Description of SLAM methods used in comparative experiments (✓ indicates supported; ✗ indicates not supported).

Method	Description	IMU Support	Underwater Applicable	NeRF Applicable
ORB-SLAM2 [4]	Sparse feature matching + loop closure	✗	✗	✗
ORB-SLAM3 [5]	ORB-SLAM2 with IMU	✓	✗	✗
DSO [31]	Sparse direct method	✗	✗	✗
LDSO [32]	DSO with feature points	✗	✗	✗
SVO [33]	Semi-direct method	✓	✗	✗
U-VIP-SLAM [34]	Visual–inertial pressure fusion	✓	✓	✗
Orbeez-SLAM [35]	ORB-SLAM2 with NeRF-realized mapping	✓	✗	✓

Table 2. Harbor dataset ATE comparison (✗ indicates tracking failure).

Method	Seq1	Seq2	Seq3	Seq4	Seq5	Seq6	Seq7
ORB-SLAM2	✗	0.442	0.031	✗	0.148	0.115	✗
DSO	✗	0.634	0.258	✗	0.673	0.245	✗
LDSO	✗	✗	✗	✗	✗	0.78	✗
SVO	0.490	0.562	0.266	✗	✗	0.0823	✗
ORB-SLAM3 (no IMU)	0.514	0.201	0.158	✗	0.302	0.142	✗
ORB-SLAM3 (with IMU)	0.514	0.201	0.158	✗	0.302	0.142	✗
U-VIP-SLAM	0.254	0.334	0.014	1.093	0.126	0.043	0.704
Orbeez-SLAM	✗	0.436	0.042	✗	0.144	0.109	✗
This work	0.249	0.351	0.018	1.010	0.114	0.054	0.712

Table 3. Archaeological sequences ATE comparison (✗ indicates tracking failure).

Method	ORB-SLAM3	DROID-SLAM	DXSLAM	VINS-Mono	SVIn2	This Work
Sequence 1	✗	0.402	0.108	1.282	0.231	0.176
Sequence 2	✗	0.109	✗	✗	2.440	0.082
Sequence 3	✗	0.090	0.154	✗	0.280	0.057
Sequence 4	1.450	✗	2.931	1.093	✗	0.893
Sequence 5	0.263	0.101	0.135	0.302	2.721	0.114
Sequence 6	✗	0.121	✗	1.093	0.609	0.096
Sequence 7	1.468	0.218	0.891	1.372	1.053	0.125
Sequence 8	0.101	0.129	0.153	1.216	0.247	0.094
Sequence 9	✗	0.237	0.036	0.528	1.509	0.157
Sequence 10	0.343	0.218	0.550	0.472	2.371	0.201

Table 4. Ablation study on harbor dataset ATE comparison. ✗ indicates tracking failure.

Sequence	Enhancement + Traditional Matching (ORB)	No Enhancement + IMU + Traditional Matching (ORB)	Enhancement + IMU + Traditional Matching (ORB)	No Enhancement + IMU + Superpoint + ULightGlue	Enhancement + Superpoint + ULightGlue	This Work
01	0.351	0.361	0.328	0.307	0.293	0.249
02	0.327	0.313	0.365	0.311	0.346	0.351
03	0.027	0.159	0.024	0.143	0.022	0.018
04	1.193	✗	1.227	1.174	1.171	1.010
05	0.135	0.296	0.231	0.265	0.129	0.114
06	0.062	0.148	0.064	0.143	0.060	0.054
07	✗	✗	✗	1.362	1.418	0.712

Table 5. Quantitative comparison of different enhancement methods across harbor sequences. Arrows indicate metric direction (↑ higher is better; ↓ lower is better).

Method	Metric	Seq1	Seq2	Seq3	Seq4	Seq5	Seq6	Seq7
Zero-DCE	UIQM ↑	0.7955	0.7441	0.9317	0.8149	0.8526	0.8853	0.7940
	UCIQE ↑	0.2255	0.2113	0.2358	0.2392	0.2475	0.2760	0.2327
	Entropy ↑	5.8002	5.2741	6.4291	6.5125	6.5019	7.0024	6.2224
	BRISQUE ↓	19.9897	18.6711	18.6962	18.9460	19.5515	19.4909	18.9793
	NIQE ↓	7.7055	7.5817	7.4891	8.0262	7.5846	7.5628	8.0030
U-shape Transformer	UIQM ↑	0.9057	0.9307	0.9065	0.7552	0.9031	0.9153	0.7685
	UCIQE ↑	0.3238	0.3623	0.333	0.3410	0.3815	0.3523	0.3816
	Entropy ↑	6.9079	6.9867	7.3971	7.6678	7.6160	7.5289	7.6277
	BRISQUE ↓	18.7233	18.3398	17.9371	18.4870	18.1504	17.9974	18.1754
	NIQE ↓	7.1752	5.1420	6.5793	4.8326	5.4725	5.6198	5.2366
EnlightenGAN	UIQM ↑	0.9336	0.9222	0.9302	0.8814	0.9176	0.8972	0.8694
	UCIQE ↑	0.2250	0.2101	0.2642	0.2363	0.2465	0.2907	0.2287
	Entropy ↑	6.9588	6.6930	7.0863	7.1678	7.1654	7.3643	6.9132
	BRISQUE ↓	19.3139	18.3069	18.6682	18.9870	19.1707	19.6137	18.8209
	NIQE ↓	6.7237	6.5303	7.1036	7.0057	6.8601	7.1935	7.0622
Shallow-UWnet	UIQM ↑	0.9548	0.9379	0.9312	0.8692	0.9477	0.9438	0.8756
	UCIQE ↑	0.2230	0.2074	0.2573	0.2349	0.2426	0.2865	0.2261
	Entropy ↑	6.9079	6.6563	7.1956	7.1678	6.9957	7.4194	6.7808
	BRISQUE ↓	17.5433	17.3073	17.5076	18.5058	17.6585	17.7670	18.1249
	NIQE ↓	5.3895	5.0876	6.5083	5.1091	5.6672	6.0896	5.4278
This work	UIQM ↑	0.9675	0.9472	0.9495	0.9524	0.9545	0.9621	0.9428
	UCIQE ↑	0.3149	0.3183	0.3449	0.3337	0.3474	0.3712	0.3680
	Entropy ↑	7.1892	7.3633	7.6642	7.4236	7.3558	7.5855	7.5404
	BRISQUE ↓	19.8165	18.4204	18.5951	18.7886	19.2285	19.2808	18.8700
	NIQE ↓	7.1419	7.1526	7.6035	7.4334	7.2404	7.2551	7.7188

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Z.; Zhang, Q.; Hu, Y.; Zheng, B. NeRF-Enhanced Visual–Inertial SLAM for Low-Light Underwater Sensing. J. Mar. Sci. Eng. 2026, 14, 46. https://doi.org/10.3390/jmse14010046

AMA Style

Wang Z, Zhang Q, Hu Y, Zheng B. NeRF-Enhanced Visual–Inertial SLAM for Low-Light Underwater Sensing. Journal of Marine Science and Engineering. 2026; 14(1):46. https://doi.org/10.3390/jmse14010046

Chicago/Turabian Style

Wang, Zhe, Qinyue Zhang, Yuqi Hu, and Bing Zheng. 2026. "NeRF-Enhanced Visual–Inertial SLAM for Low-Light Underwater Sensing" Journal of Marine Science and Engineering 14, no. 1: 46. https://doi.org/10.3390/jmse14010046

APA Style

Wang, Z., Zhang, Q., Hu, Y., & Zheng, B. (2026). NeRF-Enhanced Visual–Inertial SLAM for Low-Light Underwater Sensing. Journal of Marine Science and Engineering, 14(1), 46. https://doi.org/10.3390/jmse14010046

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

NeRF-Enhanced Visual–Inertial SLAM for Low-Light Underwater Sensing

Abstract

1. Introduction

2. Related Work

2.1. Underwater SLAM Challenges and Classical Approaches

2.2. Learning-Based Feature Extraction and Scene Understanding

2.3. Underwater Image Enhancement for SLAM

3. Materials and Methods

3.1. Underwater Low-Light Image Enhancement Module

3.1.1. Encoder

3.1.2. Bottleneck Layer

3.1.3. Decoder

3.2. Feature Extraction and Matching Network for Underwater Low-Light Environments

3.2.1. Robust Feature Extraction Module

3.2.2. ULightGlue: CBAM-Enhanced Feature Matching

Matching Pseudo-Code

3.2.3. Post-Matching Processing

3.3. IMU Fusion for Robust Pose Estimation

3.3.1. IMU’s Role in SLAM

3.3.2. Visual–Inertial Joint Optimization

3.4. Underwater Dense Mapping Based on NeRF

NeRF Modeling Pipeline

3.5. Loss Function Design and Optimization

3.5.1. NeRF Reconstruction Loss

3.5.2. ColorSSIM Loss

3.5.3. NimaLoss

4. Results

4.1. Datasets

4.1.1. AquaLoc Harbor Sequences

4.1.2. AquaLoc Archaeological Sequences

4.2. Image Enhancement Performance Analysis

4.2.1. Image Enhancement Evaluation Metrics

4.2.2. Image Enhancement Comparative Experiments

4.3. The Role of Underwater Feature Extraction Matching Network in SLAM

4.4. SLAM-Focused Experiments

4.4.1. Evaluation Metrics

4.4.2. Comparative Experiments

4.4.3. Harbor SLAM Performance

4.4.4. Archaeological Sequences Comparative Results

4.4.5. Ablation Study

4.5. Novel View Reconstruction Results Analysis

5. Discussion

5.1. Image Enhancement Results

5.2. Discussions of the Role of Underwater Feature Extraction Matching Network in SLAM

5.3. SLAM Performance Analysis

5.3.1. Harbor SLAM Results

5.3.2. Archaeological SLAM Results

5.4. Novel View Reconstruction Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI