RAEM-SLAM: A Robust Adaptive End-to-End Monocular SLAM Framework for AUVs in Underwater Environments

Wu, Yekai; Li, Yongjie; Luo, Wenda; Ding, Xin

doi:10.3390/drones9080579

Open AccessFeature PaperArticle

RAEM-SLAM: A Robust Adaptive End-to-End Monocular SLAM Framework for AUVs in Underwater Environments

by

Yekai Wu

¹,

Yongjie Li

^1,*,

Wenda Luo

¹ and

Xin Ding

²

¹

College of Electronic Engineering, Naval University of Engineering, Wuhan 430033, China

²

School of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(8), 579; https://doi.org/10.3390/drones9080579

Submission received: 12 July 2025 / Revised: 4 August 2025 / Accepted: 11 August 2025 / Published: 15 August 2025

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

We proposed RAEM-SLAM, a robust adaptive end-to-end monocular SLAM framework for AUVs in underwater environments. It integrates novel Residual Semantic–Spatial Attention Modules (RSSA) to enhance feature robustness against poor illumination and dynamic interference, and Local–Global Perception Block (LGP) for multi-scale motion perception.
We designed a Physics-guided Underwater Adaptive Augmentation (PUAA) method, dynamically converting terrestrial datasets into pseudo-underwater images by simulating light attenuation, scattering, and noise to bridge domain gaps for effective training.

What is the implication of the main finding?

RAEM-SLAM, equipped with integrated RSSA and LGP, enables accurate real-time trajectory estimation and mapping for AUVs in GPS-denied underwater environments, advancing the autonomous exploration of AUVs in unknown regions (e.g., deep-sea archaeology, resource exploitation).
PUAA significantly improves the adaptability and robustness of the system in complex underwater environments through realistic domain transformation.

Abstract

Autonomous Underwater Vehicles (AUVs) play a critical role in ocean exploration. However, due to the inherent limitations of most sensors in underwater environments, achieving accurate navigation and localization in complex underwater scenarios remains a significant challenge. While vision-based Simultaneous Localization and Mapping (SLAM) provides a cost-effective alternative for AUV navigation, existing methods are primarily designed for terrestrial applications and struggle to address underwater-specific issues, such as poor illumination, dynamic interference, and sparse features. To tackle these challenges, we propose RAEM-SLAM, a robust adaptive end-to-end monocular SLAM framework for AUVs in underwater environments. Specifically, we propose a Physics-guided Underwater Adaptive Augmentation (PUAA) method that dynamically converts terrestrial scene datasets into physically realistic pseudo-underwater images for the augmentation training of RAEM-SLAM, improving the system’s generalization and adaptability in complex underwater scenes. We also introduce a Residual Semantic–Spatial Attention Module (RSSA), which utilizes a dual-branch attention mechanism to effectively fuse semantic and spatial information. This design enables adaptive enhancement of key feature regions and suppression of noise interference, resulting in more discriminative feature representations. Furthermore, we incorporate a Local–Global Perception Block (LGP), which integrates multi-scale local details with global contextual dependencies to significantly improve AUV pose estimation accuracy in dynamic underwater scenes. Experimental results on real-world underwater datasets demonstrate that RAEM-SLAM outperforms state-of-the-art SLAM approaches in enabling precise and robust navigation for AUVs.

Keywords:

autonomous underwater vehicles (AUVs); simultaneous localization and mapping (SLAM); residual semantic–spatial attention module (RSSA); underwater environments

1. Introduction

In recent years, the development of Autonomous Underwater Vehicles (AUVs) has expanded the possibilities for ocean exploration. AUVs are increasingly utilized in diverse applications, including seabed structure mapping [1], underwater facility maintenance [2], underwater archaeology [3], and resource exploitation [4]. Accurate navigation and localization are critical for performing these complex tasks. However, in challenging underwater environments, radio signals attenuate rapidly, rendering the Global Positioning System (GPS) ineffective for AUV navigation. Acoustic navigation, which estimates the AUV’s position through acoustic signal exchange between beacons and the vehicle [5], relies on pre-deployed transponders. This reliance results in high deployment costs, restricting the AUV autonomy in unexplored regions. In contrast, vision-based Simultaneous Localization and Mapping (SLAM) offers a more efficient and cost-effective solution by utilizing monocular cameras to concurrently localize the AUV’s position and reconstruct environmental maps. Traditional visual SLAM methods are generally categorized by their frontend tracking strategy: feature-based approaches [6,7,8] and direct methods [9,10]. While both have demonstrated strong performance in terrestrial environments, their robustness significantly diminishes in underwater settings, leading to unstable or even lost trajectory tracking [11,12,13].

With the advent of deep learning, learning-based visual SLAM has emerged as a promising research direction. Existing approaches primarily fall into two categories. The first category integrates deep learning modules into traditional SLAM frameworks by replacing specific components [14,15,16]. However, this integration is often complex, and the overall system performance remains constrained by traditional modules [17]. The second category comprises end-to-end SLAM methods [18,19], which directly estimate poses or construct maps from raw image sequences, bypassing explicit feature extraction, matching, and other traditional steps. These approaches eliminate compatibility concerns and show improved adaptability to dynamic and complex environments [20].

Underwater scenes where AUVs operate present unique challenges compared to terrestrial environments, including poor illumination and numerous dynamic interference, making accurate target feature extraction and representation difficult. Recent studies have introduced semantic information to enhance network perception of target objects [21,22,23], while others have leveraged spatial attention mechanisms to improve the model’s awareness of spatial structures [24,25,26]. However, these approaches often fail to jointly consider both semantic and spatial cues during feature learning, potentially leading to semantic ambiguity or spatial misalignment. To address this issue, we propose the Residual Semantic–Spatial Attention Module, which integrates positional and semantic information through a dual-branch semantic–spatial attention mechanism. This module adaptively enhances key feature regions while suppressing noise, thereby producing more stable and discriminative deep features.

Moreover, underwater environments often lack abundant reference structures, leading to sparse features and significant errors in subsequent pose estimation. To mitigate this, recent research has focused on incorporating multi-scale feature information to improve the perception of local details, thereby enhancing performance in complex tasks such as image segmentation and object detection [27,28,29,30,31]. At the same time, transformers [32,33,34] have demonstrated strong capabilities in modeling long-range dependencies via multi-head self-attention, enabling the capture of global contextual features in complex scenes. Motivated by these insights, we propose the Local–Global Perception Block, which extracts local multi-scale features and global contextual representations in parallel. This block enhances the system’s robustness to viewpoint variations and its motion perception for AUVs in complex conditions, thereby increasing the accuracy of trajectory estimation. Building on these components, we present RAEM-SLAM, a deep learning-based SLAM framework specifically designed for AUVs in underwater environments, based on the DROID-SLAM architecture. Given a sequence of monocular images as input, RAEM-SLAM performs real-time, accurate trajectory estimation and dense map reconstruction.

The main contributions of this work are summarized as follows:

We propose RAEM-SLAM, a robust adaptive end-to-end monocular SLAM framework specifically designed for AUVs in underwater environments.
We propose a physics-guided underwater adaptive augmentation method that dynamically transforms terrestrial scene datasets into physically realistic pseudo-underwater images for the training of RAEM-SLAM, enhancing its adaptability in complex underwater environments.
We design and integrate a Residual Semantic–Spatial Attention Module into the feature extraction network to enhance the accuracy and stability of feature learning in underwater scenes.
We embed a Local–Global Perception Block during the state update stage. This block fuses multi-scale local details with global information to enhance the system’s ability to perceive AUV motion, further improving trajectory estimation accuracy.

2. Related Work

2.1. Deep Learning in Visual SLAM

Deep learning has been integrated into visual SLAM systems primarily through two approaches. The first approach involves replacing specific modules in traditional SLAM pipelines with deep learning components, such as CNN-based feature extractors [14,35,36]. For example, SuperVINS [15] integrates SuperPoint [37] and LightGlue [38] for keypoint extraction and matching, significantly improving localization in low-light and motion-blurred conditions. Similar enhancements have been proposed for loop closure detection [39,40,41,42]. However, these hybrid approaches often face compatibility issues, such as the mismatch between binary ORB descriptors [43] and floating-point deep features, which can lead to suboptimal performance. The second approach focuses on end-to-end SLAM systems that jointly learn localization and mapping without relying on traditional modules. DeepVO [44] pioneered this framework, using RCNNs to estimate poses from raw RGB sequences. UnDeepVO [45] extended this approach by incorporating unsupervised training through photometric losses, thereby reducing the dependence on annotations. DROID-SLAM [18] introduced a Dense Bundle Adjustment (DBA) layer for recursive pose and depth refinement, achieving high accuracy. Although effective in terrestrial dataset, these methods struggle considerably in underwater environments due to poor illumination, weak textures, and dynamic interference.

2.2. Underwater Visual SLAM

To address the unique challenges of underwater environments, scholars have proposed various specialized underwater visual SLAM systems. For instance, Zacchini et al. [46] validated the feasibility of integrating SIFT and SURF features with altimeter-based scale recovery for AUV navigation, though this method relies on feature descriptors and fails to handle dynamic interference issues. Ferrera et al. [47] proposed the UW-VO system, introducing optical flow tracking and retracking mechanisms to tackle turbid waters and fish swarm occlusions. However, this system lacks scene recognition and global optimization modules, limiting its application in large-scale scenarios. Building upon ORB-SLAM, the UVS system [48] resolved planar ambiguity through three-view initialization and utilized full-feature matching to enhance map density, but its assumption of constant-velocity motion causes pose jumps during AUV acceleration or deceleration.

Meanwhile, deep learning offers promising solutions for underwater visual degradation. Yang et al. [49] recently proposed the Underwater Feature Extraction Network, specifically designed for underwater environments. This framework employs cross-modal knowledge distillation to transfer knowledge from a pre-trained SuperPoint [37] teacher model to a student network, integrating it into ORB-SLAM3 to form the UFEN-SLAM system. For improving loop closure robustness, Burguera et al. [50] introduced an outlier-resistant loop detection framework tailored for underwater visual graph optimization SLAM. It utilizes a lightweight Siamese Convolutional Neural Network to rapidly screen potential loop closure image pairs, thereby increasing subsequent effective loop detection rates. However, both learning-based frameworks require integrating specialized networks into traditional SLAM architectures, and their overall performance remains constrained by fundamental limitations of traditional pipelines.

Consequently, to address these challenges more comprehensively, we propose RAEM-SLAM, an end-to-end visual SLAM framework specifically tailored for underwater conditions.

2.3. Semantic–Positional Feature Fusion

Underwater environments experience significant illumination attenuation due to wavelength absorption and scattering by suspended particles, resulting in poor imaging quality and making the extraction features more challenging. Recent studies have sought to enhance neural networks’ scene understanding by incorporating semantic information. For example, Qi et al. [21] introduced a Semantic Region-wise Enhancement Module to improve multi-scale feature enhancement, while Chen et al. [22] proposed USSSN, which leverages deformable convolutions to estimate semantic attention maps. Liang et al. [23] designed MS-SGA-GCN, which fuses global and local semantic features using graph convolutional networks, thereby improving multi-label image recognition. However, these methods often overlook spatial relationships between objects, which limits positional accuracy. To address this gap, researchers have introduced spatial attention mechanisms. Qi et al. [24] proposed SAWU-Net, employing pixel- and window-based spatial attention, while Zhang et al. [25] developed RFAConv, which uses receptive field-aware attention, and Bai et al. [26] presented a layered spatial attention method to enhance key-point detection. Despite these advancements, existing models still lack a unified understanding of both spatial and semantic information. To bridge this gap, we propose a Semantic–Positional Feature Fusion Module, which effectively integrates spatial structure and semantic cues to produce more discriminative features and enhance scene comprehension in underwater environments.

2.4. Local–Global Feature Fusion

Underwater environments lack abundant reference structures, making feature extraction particularly challenging. Accurate trajectory estimation and map construction require the joint utilization of local detail features and global structural information. Recent advancements have demonstrated that multi-scale feature fusion can enhance model performance in complex tasks such as segmentation and object detection. Serial skip connections [27,28] fuse hierarchical features but may lose fine details, while parallel architectures [29,30,31] process multiple scales concurrently, capturing richer contextual cues. Notable works in this area include PSPNet [51], which employs pyramid pooling for global context, M-FFN [52], which integrates atrous convolutions for adaptive context encoding, and PANet [53], which fuses multi-scale features using a point attention module. However, CNNs inherently suffer from limited receptive fields, restricting their ability to model long-range dependencies [54,55]. Transformer-based methods [32] address this limitation by capturing global context through multi-head attention. For example, TransUNet [33] combines Transformers with U-Net to improve segmentation accuracy and generalization, while STFN [34] enhances image quality assessment through Swin Transformer-based fusion. In DROID-SLAM, the update module relies on local convolutions to produce optical flow correction term, which lacks global motion constraints. To overcome this limitation, we propose the Local–Global Perception Block, which captures both local multi-scale features and global representations, significantly improving pose estimation accuracy in underwater environments.

3. Method

3.1. Overall Architecture

We propose RAEM-SLAM, as shown in Figure 1, an end-to-end monocular visual SLAM framework specifically tailored for underwater environments. This framework accurately estimates camera trajectories and reconstructs 3D maps. Built upon the architecture of DROID-SLAM, RAEM-SLAM retains several components including the Correlation Pyramid, ConvGRU, and Dense Bundle Adjustment (DBA) layer. Additionally, to address the challenges of underwater environments, we introduce two major novel components: the Residual Semantic–Spatial Attention (RSSA) modules and the Local–Global Perception (LGP) block, which are highlighted within red boxes in Figure 1.

The RSSA modules extract dense semantic and spatial features from input images. The Correlation Pyramid computes all-pairs dot-product correlations between adjacent frames, forming a 4D volume, which is then downsampled and interpolated to obtain multi-scale features. These features, along with flow residuals and contextual information, are refined by ConvGRU and passed to the LGP block. The LGP subsequently predicts flow corrections and confidence weights, enabling recursive optimization of pose and depth. The DBA layer integrates Gauss–Newton optimization to solve for joint pose-depth updates using Schur complement decomposition, weighted by learned confidences for enhanced accuracy. The RSSA module improves semantic–spatial awareness and feature robustness through residual connections, while the LGP block enhances the representation of blurred underwater boundaries and models global dependencies using self-attention. By fusing both local and global cues, RAEM-SLAM effectively reduces pose estimation errors and improves robustness in challenging underwater conditions.

3.2. Physics-Guided Underwater Adaptive Augmentation

To enhance the adaptability and robustness of the RAEM-SLAM system in complex and dynamic underwater environments, and to bridge the significant domain gap between its training data (typically terrestrial scenes) and real underwater scenarios, we propose a Physics-guided Underwater Adaptive Augmentation method (PUAA), as illustrated in Figure 2. Through precisely simulating the physical process of underwater light transmission, PUAA can dynamically transform easily obtainable terrestrial scene datasets (containing RGB images and their corresponding depth maps) into physically realistic pseudo-underwater images for training RAEM-SLAM. Unlike traditional augmentation methods that alter only spatial or low-level image features like rotation or cropping, PUAA systematically models semantic-level features unique to underwater scenes including light attenuation, scattering, and suspended particle noise.

The physical foundation of PUAA is built upon a modified Jaffe–McGlamery [56] underwater optical imaging model. It quantitatively describes the energy attenuation and scattering effects of light propagating through water and the expression is

U (x) = J (x) \cdot e^{- α (λ) d (x)} + \frac{b (λ) L_{0} (λ) e^{- K_{d} (λ) \cdot d_{s c e n e}}}{α (λ)} \cdot (1 - e^{- α (λ) d (x)})

(1)

here, the first term represents the direct attenuation component, while the second denotes backscattering component.

U (x)

indicates the simulated radiance intensity observed in the underwater environment, equivalent to the pixel value of the output synthetic pseudo-underwater image.

J (x)

signifies the original scene radiance intensity, i.e., the pixel value of the input terrestrial scene’s RGB image.

d (x)

is the pixel depth value, and

d_{s c e n e}

is the average depth of the scene. The wavelength-dependent parameters

α (λ)

,

b (λ),

and

K_{d} (λ)

represent attenuation coefficient, scattering coefficient, and diffuse attenuation coefficient, respectively.

L_{0} (λ)

describes wavelength-specific light intensity at a reference underwater depth, typically taken as 0 m below the surface.

Meanwhile, to fully simulate the complexity and diversity of real underwater environments, PUAA implements a multi-level physical degradation mechanism. Firstly, according to the training strategy, a specific water type representing distinct optical properties, is selected from the Jerlov types [57,58]. For instance, Type I clear oceanic waters exhibit low attenuation and scattering, while Type III turbid oceanic waters are characterized by high attenuation, strong scattering, and significant particle concentration. This selected water type is then assigned to the training sequence. We then apply a ±10% random perturbation to the fundamental optical parameters corresponding to the given water type to reflect small-scale natural variations. Second, for forward-scattering blur effects from suspended particles, PUAA generates a horizontal motion blur kernel. The blur kernel’s length is defined based on the selected Jerlov type, with longer kernels used for higher turbidity Type III waters, while its angle varies between 0 and 180 degrees. This kernel is rotated to a random angle to form a Point Spread Function (PSF) matrix, which is applied to each channel of the synthetic image. Additionally, to emulate suspended particle noise in the water body, random Gaussian noise is added to the synthetic image.

During the training stage, PUAA adopts a hybrid progressive approach to gradually enhance RAEM-SLAM’s understanding of underwater scenes. Initially, the model is pre-trained using original terrestrial data to establish fundamental scene understanding. Subsequently, the pseudo-underwater image sequences are used to fine-tune the model’s weights. Specifically, for each scene in the training dataset, its sequences are divided into three equal parts. These parts are transformed by PUAA into pseudo underwater image sequences with different turbidity levels—clear water (Jerlov type I, 5-pixel blur, Gaussian noise σ = 0.005), moderate water (Jerlov type II, 10-pixel blur, Gaussian noise σ = 0.01), and turbid water (Jerlov type III, 15-pixel blur, Gaussian noise σ = 0.02). The refinement phase progressively shifts training focus from clear-water to turbid-water sequences, and the noise intensity synchronously increases with the rising turbidity level. This strategy preserves the understanding of fundamental scenes while enabling gradual adaptation to underwater degradation characteristics, achieving a smooth domain transition. Finally, RAEM-SLAM maintains its performance in other scenarios while enhancing its robustness and adaptability in underwater environments.

3.3. Residual Semantic–Spatial Attention Module

During the feature extraction stage of RAEM-SLAM, a dual-branch encoder, consisting of a feature network and a context network, is employed to extract dense feature maps and contextual representations. Both networks comprise six Residual Semantic–Spatial Attention (RSSA) modules, with max-pooling layers inserted after every two modules for downsampling. As a result, dense features are generated at 1/8 the resolution of the input underwater scene data. The RSSA module, as illustrated in Figure 3, utilizes a parallel attention structure that extracts highly discriminative features by integrating both semantic and spatial information. Specifically, the input feature

x \in R^{H \times W \times C}

is processed through two consecutive 3 × 3 convolutional layers, as defined by

\begin{matrix} x_{1} = B a t c h N o r m (δ (C o n v_{3 \times 3} (x))) \\ x_{2} = B a t c h N o r m (δ (C o n v_{3 \times 3} (x_{1}))) \end{matrix}

(2)

where

{C o n v}_{3 \times 3} (\cdot)

denotes a convolution operation with a 3 × 3 kernel, and

δ (\cdot)

represents the ReLU activation function. BatchNorm refers to batch normalization. After feature extraction, the resulting feature maps are enhanced by both the Semantic Attention Module and the Spatial Attention Module.

(1) Semantic Attention Module: The semantic attention module analyzes inter-channel relationships to dynamically adjust channel-wise weights, emphasizing task-relevant features. Given an input feature map

x_{2} \in R^{H \times W \times C}

, both average pooling and max pooling are applied across the spatial dimensions. The average pooling captures global contextual statistics, while the max pooling highlights the most salient activation responses, providing complementary information. The pooled features are passed through a shared multi-layer perceptron (MLP) to generate two feature descriptors. These descriptors are then summed and passed through a sigmoid activation function to produce the final semantic attention weights

S_{s e m a n t i c}

. The input feature map is subsequently reweighted through element-wise multiplication to obtain the enhanced feature map

E_{s e m a n t i c}

. The output retains the same shape as the original input

x_{2}

, i.e.,

R^{H \times W \times C}

. The formulation is as follows:

\begin{matrix} S_{s e m a n t i c} (x_{2}) = σ (Φ (A v g (x_{2})) \oplus Φ (M a x (x_{2}))) \\ E_{s e m a n t i c} = x_{2} \otimes S_{s e m a n t i c} (x_{2}) \end{matrix}

(3)

where

σ (\cdot)

denotes the sigmoid function,

Φ

is a shared-weight MLP,

\oplus

denotes element-wise addition,

\otimes

denotes element-wise multiplication, and

A v g

and

M a x

represent average pooling and max pooling operations, respectively.

(2) Spatial Attention Module: The spatial attention module computes spatial attention weights to highlight informative regions in the feature map, such as edges and motion boundaries, while suppressing irrelevant or noisy background regions. Given an input feature map

x_{2} \in R^{H \times W \times C}

, both average pooling and max pooling are applied along the channel axis to generate two 2D spatial descriptors that emphasize global and local responses, respectively. These descriptors are concatenated and passed through a convolutional layer with a kernel to produce the spatial attention weights

S_{s p a t i a l}

. These weights are then applied to the input feature map via element-wise multiplication to obtain the enhanced output

E_{s p a t i a l}

. The operations are defined as

\begin{matrix} S_{s p a t i a l} (x_{2}) = σ (C o n v_{7 \times 7} ([A v g (x_{2}); M a x (x_{2})])) \\ E_{s p a t i a l} = x_{2} \otimes S_{s p a t i a l} (x_{2}) \end{matrix}

(4)

where

σ (\cdot)

denotes the sigmoid function and

{C o n v}_{7 \times 7} (\cdot)

represents a convolution with a 7 × 7 kernel.

Finally, the feature representations refined by the semantic and spatial attention modules are fused through element-wise addition to produce the final attention-enhanced output

E

, as expressed by

E = E_{s e m a n t i c} \oplus E_{s p a t i a l}

(5)

where

\oplus

denotes the addition operation. This design enhances the model’s ability to capture complex spatial structures and positional relationships. The spatial attention mechanism complements semantic guidance by emphasizing meaningful locations and suppressing disturbances from textureless or noisy areas. This joint attention strategy refines feature representations at multiple levels, improving the system’s robustness in underwater environments characterized by blurred boundaries and ambiguous textures.

3.4. Local–Global Perception Block

During the update stage of DROID-SLAM, the proposed framework first inputs the correlation features, optical flow features, and contextual features into the GRU module, which generates an updated hidden state

h_{k}

. This hidden state is then passed through a 3 × 3 convolutional layer to produce two key outputs: the flow refinement

r_{i j} \in R^{H \times W \times 2}

and the corresponding confidence weight

w_{i j} \in R^{H \times W \times 2}

. These outputs are then provided as inputs to the Dense Bundle Adjustment (DBA) layer, which jointly optimizes camera poses and inverse depths.

However, in complex underwater environments, challenges such as low illumination, dynamic interference, and intricate textures significantly complicate the process. The use of fixed-scale 3 × 3 convolutions at this stage suffers from limited receptive fields, making it difficult to effectively capture feature distributions distorted by suspended particles. Furthermore, local convolutions lack the capability to model the scene comprehensively, limiting the effectiveness of subsequent pose and depth refinements. To address these issues, we propose the Local–Global Perception Block, as illustrated in Figure 4. This module combines multi-scale feature fusion with a global self-attention mechanism. Multiple parallel heterogeneous convolutional branches are employed to construct hierarchical receptive fields that enhance the extraction of rich local features at different scales. Simultaneously, the self-attention mechanism captures broader global contextual representations in underwater scenes. By integrating both local and global features, the module enables more accurate estimation of optical flow corrections and confidence weights, thereby improving trajectory estimation accuracy under challenging underwater conditions.

3.4.1. Multi-Scale Local Perception

The proposed multi-scale local perception module utilizes multiple parallel heterogeneous convolutional branches to fully extract local details at different scales. Its structure is illustrated in the blue part of Figure 4. Specifically, the hidden state output from the GRU is processed through four parallel branches, each designed to capture local features with distinct receptive fields. The first branch applies a 1 × 1 convolution kernel, which preserves high-resolution local details while preventing spatial information loss due to downsampling. The second branch employs a 3 × 3 convolution kernel to capture basic local features and short-range dependencies. The formulation for this operation is as follows:

F_{1} = C o n v_{1 \times 1} (h_{k})

(6)

F_{2} = C o n v_{3 \times 3} (h_{k})

(7)

where

C o n v_{1 \times 1} (\cdot)

denotes a standard 1 × 1 convolution operation, and

C o n v_{3 \times 3} (\cdot)

denotes a standard 3 × 3 convolution operation. To avoid the parameter redundancy typically introduced by large convolutional kernels, we incorporate dilated convolutions to maintain an equivalent receptive field with reduced complexity. The third branch employs a 3 × 3 dilated convolution with a dilation rate of 2, effectively achieving a receptive field of 5 × 5 through sparse sampling. This design preserves sensitivity to mid-scale features while reducing computational cost. The receptive field size is calculated as follows:

R F_{s i z e} = (K - 1) \times r a t e + 1

(8)

where

K

denotes the kernel size, and the rate represents the dilation rate.

The fourth branch uses a dilation rate of 3, further expanding the effective receptive field to 7 × 7, which enables efficient modeling of broader contextual dependencies. The calculations are as follows:

F_{3} = C o n v_{3 \times 3}^{d i l = 2} (h_{k})

(9)

F_{4} = C o n v_{3 \times 3}^{d i l = 3} (h_{k})

(10)

where

C o n v_{3 \times 3}^{d i l = 2} (\cdot)

denotes a 3 × 3 convolution operation with a dilation rate of 2.

Finally, the four local feature maps are combined through element-wise addition and enhanced with a residual connection to improve gradient flow. The output is the final local multi-scale feature representation. The computation is defined as

F_{l o c a l} = h_{k} \oplus \sum_{i = 1}^{4} F_{i}

(11)

where

F_{l o c a l}

denotes the fused local multi-scale features,

h_{k}

is the initial hidden state output from the GRU (i.e., the input to this module), both

\oplus

and

\sum (\cdot)

represent element-wise addition operations.

3.4.2. Global Perception Block

The structure of the Global Perception Block is shown in the green part of Figure 4. We first apply global spatial encoding to the input hidden state

h_{k} \in R^{H \times W \times C}

to extract full-resolution spatial features. To reduce computation while preserving representation quality, the input feature map is divided into

N

non-overlapping patches of size

P \times P

, where each patch is flattened and projected to obtain a 2D feature sequence:

\{f_{i} \in R^{P^{2} \times C}| i = 1,2, \dots, N\}, N = \frac{H \times W}{P^{2}}

(12)

where

P

defines the patch size and

f_{i}

denotes the flattened patch feature. To enhance the representation capability, each patch feature is projected into a latent embedding space using a learnable linear projection:

f_{i}^{z} = F_{E} (f_{i}), \forall i \in [1, N]

(13)

where

F_{E} \in R^{C^{'} \times P^{2} C}

denotes the learnable projection layer, and

f_{i}^{z}

is the embedded patch feature. The resulting token embeddings

f_{i}^{z}

are concatenated to form the sequence. To integrate position information, learnable 2D positional encodings

E_{p o s} \in R^{N \times C^{'}}

are added to the token embeddings to obtain

Z_{l - 1}

:

Z_{l - 1} = [f_{1}^{z}, f_{2}^{z}, \dots, f_{N}^{z}] + E_{p o s}

(14)

where

Z_{l - 1} \in R^{N \times C^{'}}

represents the globally encoded spatial sequence.

Next, normalize the encoded feature

Z_{l - 1}

to obtain the feature

F_{l - 1}

, which is fed into a multi-head self-attention (MHSA) module for further transformation. The multi-head attention mechanism divides the input into multiple single self-attention heads, enabling the network to capture diverse patterns and richer global features, as illustrated in Figure 5. Specifically, the input feature

F_{l - 1}

is first linearly projected into query, key, and value representations:

Q = F \cdot W_{q}, K = F \cdot W_{k}, V = F \cdot W_{v}

(15)

where

W_{q}

,

W_{k}

,

W_{v}

are learnable projection matrices. The attention map is computed by applying the scaled dot-product operation followed by a softmax normalization:

M = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}})

(16)

where

d_{k}

is the dimension of the key vectors. The final attention output is obtained by applying the attention map to the value matrix:

Z_{t} = M H S A (Q, K, V) = M \cdot V

(17)

Finally, a two-layer Multi-Layer Perceptron (MLP) is applied to further transform the attended features. A residual connection and layer normalization are incorporated to improve training stability:

Z_{l} = M L P (N o r m (Z_{t})) + Z_{t}

(18)

where

N o r m (\cdot)

denotes layer normalization, and

Z_{l}

is the final global feature representation after attention and transformation.

3.5. Loss Function

During network training, we jointly supervise the model using optical flow loss and pose loss to optimize the estimation process. For optical flow supervision, we compute the L2 distance between the predicted flow and the ground truth flow, which is derived from depth and pose, using image warping to construct the flow loss. The formulation is as follows:

L_{f l o w} = \frac{1}{N} \sum_{i = 1}^{N} {‖p_{p r e d} - p_{t r u t h}‖}_{2}^{2}

(19)

where

p_{p r e d}

denotes the predicted optical flow derived from pose and depth, and

p_{t r u t h}

represents the ground truth flow.

{‖\cdot‖}_{2}

denotes the Euclidean distance.

For pose estimation, we adopt the SE(3) geodesic distance to measure the deviation between the predicted pose

G_{i}

and the ground truth pose

T_{i}

. The pose loss is defined as

L_{p o s e} = \sum_{i = 1}^{N} {‖L o g_{S E (3)} (T_{i}^{- 1} \cdot G_{i})‖}_{2}^{2}

(20)

where

L o g_{S E (3)} (\cdot)

denotes the logarithmic map on the SE(3) Lie group, and

{‖\cdot‖}_{2}

represents the Euclidean distance.

Finally, we aggregate the flow and pose losses from each stage

k

, weighted by a decay factor

γ^{K - k}

, to ensure that later-stage predictions receive stronger supervision. The total loss is formulated as

L_{t o t a l} = \sum_{k = 1}^{K} γ^{K - k} (L_{f l o w}^{k} + L_{p o s e}^{k})

(21)

where

K

denotes the total number of iterative updates.

4. Experiments

4.1. Evaluation Metrics

In our experiments, we use the Absolute Trajectory Error (ATE) [59] as the primary metric to evaluate the accuracy of trajectory estimation. Based on temporal alignment, ATE computes the Euclidean distance between the estimated trajectory and the ground-truth trajectory at each corresponding timestamp. This metric allows for a quantitative analysis of both the overall accuracy and long-term drift of the system. The ATE is defined as

A T E = \frac{1}{N} \sum_{i = 1}^{N} {‖t r a n s (T_{t r u t h}^{i}) - t r a n s (T_{e s t}^{i})‖}_{2}

(22)

where

t r a n s (\cdot)

extracts the translational component of a pose,

T_{t r u t h}^{i}

denotes the ground-truth pose at time step

i

,

T_{e s t}^{i}

represents the estimated pose,

{‖\cdot‖}_{2}

is the Euclidean distance, and

N

is the total number of poses.

Due to the inherent scale ambiguity in monocular SLAM methods, directly computing the ATE would incorporate scale errors. Therefore, prior to calculating the ATE, we eliminate scale ambiguity by aligning the entire estimated trajectory with the ground truth trajectory through Sim(3) similarity transformation. Subsequently, we employ the Root Mean Square Error (RMSE) to evaluate the ATE. Its calculation formula is as follows:

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {‖t r a n s (T_{t r u t h}^{i}) - t r a n s ({\hat{T}}_{e s t}^{i})‖}_{2}^{2}}

(23)

where

{\hat{T}}_{e s t}^{i}

denotes the estimated pose after applying the Sim(3) similarity transformation.

4.2. Implemental Details

To ensure a fair comparison with the original DROID-SLAM method, we use the same TartanAir dataset [60] for model training. This dataset is specifically designed for SLAM research, offering a variety of complex scenes captured in simulated environments. The network is trained using a single NVIDIA RTX 4090 GPU (24 GB) (Santa Clara, CA, USA), with an input image resolution set at 384 × 512 and a batch size of 1. Each training sample consists of a sequence of seven consecutive video frames as the basic unit. To address the scale ambiguity inherent in monocular SLAM, a ground-truth pose anchoring strategy is adopted. Specifically, the first frame of each training sequence is fixed to the ground-truth coordinate system to eliminate pose uncertainty in 6 degrees of freedom (6DoF). The second frame is also aligned with its ground-truth pose to constrain the trajectory estimation within the space of similarity transformations, thereby mitigating scale drift in monocular systems. This strategy effectively resolves normalization uncertainty during training and improves the stability of gradient optimization.

4.3. Ablation Study and Computational Efficiency

To evaluate the improvement effect of the proposed modules on the overall system performance, an ablation study was conducted on the archaeological sequences of the AQUALOC dataset. By separately removing the Physics-guided Underwater Adaptive Augmentation (PUAA) during network training, the Residual Semantics–Spatial Attention module (RSSA) in the feature extraction stage, and the Local–Global Perception Block (LGP) in the update operator, we systematically assessed the impact of each component on trajectory estimation accuracy. The experimental results are shown in Table 1, where the red font represents the maximum error value within the same sequence, and the bold font represents the minimum error value. This table reports the Absolute Trajectory Error (ATE) for the system under different configurations across ten sequences (Sequences 1–10). The system achieves optimal performance when all modules are retained (RAEM-SLAM), yielding a mean ATE of 0.084.

Disabling the PUAA module leads to the most significant overall performance degradation on average (Avg ATE = 0.303). However, this high average is primarily driven by an exceptionally large error (ATE = 1.950) observed in Sequence 2. Excluding Sequence 2, the average ATE for disabling PUAA across the remaining nine sequences is 0.119. Specifically, Sequence 2 exhibits unique and extreme underwater conditions, with many regions of severe scattering, high turbidity, and suspended impurity interference, as shown in Figure 6. The PUAA module effectively simulates these extreme conditions during the training phase through its physics-guided augmentation, thereby mitigating the adverse effects of diverse complex underwater challenges. This demonstrates that while PUAA’s effectiveness on other sequences is less pronounced compared to both RSSA and LGP, it has irreplaceability in dealing with extreme degradation scenarios.

Additionally, removing the RSSA module (No RSSA) consistently degrades performance across most sequences, raising the average ATE to 0.181. This demonstrates the fundamental importance of RSSA for robust feature representation in diverse underwater scenes. Similarly, the absence of the LGP module (No LGP) increases the average ATE to 0.127, confirming its vital role in accurate motion perception. In summary, the synergistic optimization of PUAA, RSSA, and LGP collectively enables high trajectory estimation accuracy. The complete system (RAEM-SLAM) integrating all three modules achieves the best performance.

To understand the computational cost associated with this performance gain, we measured the runtime and memory footprint of RAEM-SLAM on these ten archaeological sequences using a single NVIDIA RTX 4090 GPU (24 GB VRAM). The average inference time per sequence was 370 s, corresponding to a real-time frame rate of 26 frames per second (fps). In terms of GPU memory usage, RAEM-SLAM requires approximately 14 GB during inference, compared to the 11 GB required by the baseline DROID-SLAM. We consider the moderate computational cost of RAEM-SLAM a justifiable expense, given the significant enhancement it provides to the accuracy and robustness.

4.4. Comparative Experiment

To demonstrate the superiority of RAEM-SLAM, we compared it against several state-of-the-art Simultaneous Localization and Mapping (SLAM) methods.

4.4.1. Land Scene

Although RAEM-SLAM is specifically designed for underwater environments, its innovative RSSA and LGP modules intrinsically enhance feature extraction and motion perception capabilities. Therefore, to evaluate the overall performance and generalization ability of the complete RAEM-SLAM framework, we conducted experiments in land datasets.

The EuRoC [61] dataset comprises visual–inertial data captured by a micro aerial vehicle (MAV) in an industrial environment and an indoor environment equipped with motion capture systems, providing millimeter-accurate ground truth for algorithm benchmarking. Based on this dataset, we selected representative traditional SLAM methods (including ORB-SLAM, DSO, SVO, DSM, and ORB-SLAM3) and deep learning-based methods (such as DeepFactors, DeepV2D, TartanVO, and DROID-SLAM) as benchmarks for comparison. For clarity: DSO, SVO, DeepV2D(both), TartanVO, and D3VO + DSO are visual odometry methods without loop closure; all other methods have loop closure which was enabled in our experiments. As shown in Table 2, RAEM-SLAM outperforms all compared methods with an average absolute trajectory error of 0.022. In the table, bold text indicates the minimum error in the same sequence, while “-” denotes tracking failure. Compared to traditional methods, RAEM-SLAM successfully tracked all 11 sequences, with the RMSE under ATE values ranging from 0.012 to 0.038. Its average RMSE is significantly lower than that of DSM (0.126), the best-performing traditional method, by approximately 82.5%. Compared to deep learning methods, RAEM-SLAM also demonstrates superior performance, achieving the lowest average RMSE across all evaluated methods. Most significantly, RAEM-SLAM achieves a substantial improvement over its baseline, DROID-SLAM (Mono), reducing the average Absolute Trajectory Error (ATE RMSE) by 24.1% (from 0.029 to 0.022). Experimental results demonstrate that RAEM-SLAM achieves good overall performance and generalization capabilities in land scenes.

4.4.2. Underwater Scene

For underwater environments, we selected the AQUALOC [68] and AFRL [69] datasets to validate the effectiveness of our method. For both datasets, ground-truth trajectories are computed offline using the COLMAP [70] structure-from-motion software. These trajectories serve as the reference ground truth for evaluating the performance of the SLAM.

AQUALOC dataset

The AQUALOC [68] dataset is a publicly available underwater dataset collected by remotely operated vehicles (ROVs) on the seafloor. Within this dataset, the archaeological dataset section contains 10 sequences collected from depths of several hundred meters, which were recorded with a 20 Hz monocular camera at a resolution of 968 × 608. For our comparative experiments on this dataset, we selected the following methods: the traditional feature-based ORB-SLAM3 [64], the multi-sensor VINS-Mono [71] which fuses Inertial Measurement Unit (IMU) data, SL-SLAM [40] combining traditional methods with deep learning, and the baseline DROID-SLAM [18]. Comparative experimental results on ten underwater archaeological sequences are presented in Table 3. Additionally, Figure 7 illustrates trajectory comparisons diagrams of partial sequences. Figure 7a,c,e,g compare the estimated trajectories of all selected algorithms, with RAEM-SLAM (Ours) shown in solid purple. Figure 7b,d,f,h specifically contrast RAEM-SLAM (Ours) against ground-truth trajectories, and the ground truth denoted by black dashed lines and RAEM-SLAM estimates represented in distinct colors, with each color indicating the ATE at its corresponding position.

Across these ten sequences, ORB-SLAM3, VINS-Mono, and SL-SLAM all experienced multiple tracking failures. Relying on handcrafted feature descriptors, ORB-SLAM3’s feature detector fails to stably generate sufficient points in low-texture underwater scenes with dynamic disturbances, unavoidably losing trajectory tracking. Benefiting from multi-sensor fusion, VINS-Mono successfully ran on more sequences than ORB-SLAM3. However, in the feature-sparse underwater environment, it struggles to effectively constrain IMU drift, resulting in unsatisfactory estimation accuracy. SL-SLAM integrates the feature extraction network SuperPoint [37] and matching network LightGlue [38] into the ORB-SLAM3 framework. However, in underwater scenes with extreme visual degradation (such as severe scattering, turbidity, and suspended particles), low texture, and dynamic interferences, these feature networks also struggle to consistently extract sufficiently high-quality features, causing tracking failure. In contrast, RAEM-SLAM demonstrates exceptional robustness and precision as end-to-end SLAM architecture. It achieves the lowest average error (0.084) while successfully processing all sequences, representing an 82.7% reduction compared to the baseline DROID-SLAM (0.486).

2.: AFRL dataset

In order to validate the generalization capability of RAEM-SLAM across diverse complex underwater environments, we conducted comparative experiments on the AFRL dataset [69]. This dataset includes three challenging scenarios: submerged bus, cave, and fake cemetery. Due to severe underwater scattering and insufficient texture features, traditional methods (ORB-SLAM3, VINS-Mono) and hybrid approaches (SL-SLAM) suffered from persistent tracking failures across all sequences in this dataset. Therefore, we focus on comparing against the end-to-end baseline DROID-SLAM. As shown in Table 4, RAEM-SLAM achieves substantially lower trajectory errors than DROID-SLAM across all scenarios. Additionally, Figure 8 provides trajectory comparisons for intuitive assessment of these improvements.

In the cave scene with non-uniform lighting, RAEM-SLAM showed the most significant improvement in estimating trajectories (61.4% error reduction). As shown in Figure 8a, there is severe scattering and suspended particle interference in the Submerged Bus scenario. RAEM-SLAM completed the entire run without losing trajectory tracking while reducing ATE (2.386 m to 1.198 m) by 49.8%. Notably, while RAEM-SLAM shows higher accuracy than the original DROID-SLAM in the Fake Cemetery data, the estimated trajectory remains unsatisfactory compared to the ground truth. As illustrated in Figure 8f, the main reason is that the loop closure of RAEM-SLAM fails to recognize previously visited areas, thereby failing to correct the accumulated drift error. We will focus on addressing the limitations of this loop closure in the future.

Overall, the experimental results demonstrate that RAEM-SLAM achieves higher accuracy and robustness than DROID-SLAM across the diverse underwater scenes of the AFRL dataset.

5. Conclusions

To address the challenges of AUVs operating in complex underwater environments, this paper proposes a robust adaptive end-to-end monocular SLAM framework (RAEM-SLAM) for AUVs. To improve the adaptability of the system in underwater scenarios, we develop a Physics-guided Underwater Adaptive Augmentation framework. It can synthesize realistic pseudo-underwater images from terrestrial datasets and employ a hybrid progressive approach to gradually enhance RAEM-SLAM’s understanding of underwater scenes. A Residual Semantic–Spatial Attention Module is also introduced to resolve features blurring and semantic ambiguity in underwater images by employing parallel semantic and spatial attention branches. Furthermore, a Local–Global Perception Block is designed by integrating multi-scale convolutions with a Transformer-based architecture to simultaneously capture local details and global contextual dependencies. This design significantly reduces estimation errors under dynamic disturbances. Built upon these modules, the RAEM-SLAM framework demonstrates superior performance in trajectory accuracy and system stability on real-world underwater datasets. This work presents a promising solution for underwater AUV navigation, with potential applications in ocean exploration, underwater infrastructure inspection, and deep-sea resource development. Future research will focus on enhancing loop closure robustness, improving computational efficiency, and extending the framework to multi-sensor fusion for complex underwater tasks.

Author Contributions

Conceptualization, Y.W. and Y.L.; methodology, Y.L. and X.D.; software, W.L.; validation, Y.W., Y.L. and W.L.; formal analysis, Y.W.; investigation, Y.W.; resources, X.D.; data curation, W.L.; writing—original draft preparation, Y.W.; writing—review and editing, Y.L. and X.D.; visualization, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Palomer, A.; Ridao, P.; Ribas, D. Inspection of an Underwater Structure Using Point-cloud SLAM with an AUV and a Laser Scanner. J. Field Robot. 2019, 36, 1333–1344. [Google Scholar] [CrossRef]
Nauert, F.; Kampmann, P. Inspection and Maintenance of Industrial Infrastructure with Autonomous Underwater Robots. Front. Robot. AI 2023, 10, 1240276. [Google Scholar] [CrossRef]
Allotta, B.; Costanzi, R.; Ridolfi, A.; Colombo, C.; Bellavia, F.; Fanfani, M.; Pazzaglia, F.; Salvetti, O.; Moroni, D.; Pascali, M.A.; et al. The ARROWS Project: Adapting and Developing Robotics Technologies for Underwater Archaeology. IFAC-Pap. 2015, 48, 194–199. [Google Scholar] [CrossRef]
González-García, J.; Gómez-Espinosa, A.; Cuan-Urquizo, E.; García-Valdovinos, L.G.; Salgado-Jiménez, T.; Cabello, J.A.E. Autonomous Underwater Vehicles: Localization, Navigation, and Communication for Collaborative Missions. Appl. Sci. 2020, 10, 1256. [Google Scholar] [CrossRef]
Zhang, B.; Ji, D.; Liu, S.; Zhu, X.; Xu, W. Autonomous Underwater Vehicle Navigation: A Review. Ocean. Eng. 2023, 273, 113861. [Google Scholar] [CrossRef]
Civera, J.; Grasa, O.G.; Davison, A.J.; Montiel, J.M.M. 1-Point RANSAC for Extended Kalman Filtering: Application to Real-time Structure from Motion and Visual Odometry. J. Field Robot. 2010, 27, 609–631. [Google Scholar] [CrossRef]
Klein, G.; Murray, D. Parallel Tracking and Mapping for Small AR Workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, Nara, Japan, 13 November 2007; pp. 225–234. [Google Scholar]
Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-Scale Direct Monocular SLAM. In Proceedings of the Computer Vision—ECCV 2014, 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 834–849. [Google Scholar]
Engel, J.; Koltun, V.; Cremers, D. Direct Sparse Odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 611–625. [Google Scholar] [CrossRef]
Köser, K.; Frese, U. Challenges in Underwater Visual Navigation and SLAM. In AI Technology for Underwater Robots; Kirchner, F., Straube, S., Kühn, D., Hoyer, N., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 125–135. ISBN 978-3-030-30683-0. [Google Scholar]
Quattrini Li, A.; Coskun, A.; Doherty, S.M.; Ghasemlou, S.; Jagtap, A.S.; Modasshir, M.; Rahman, S.; Singh, A.; Xanthidis, M.; O’Kane, J.M.; et al. Experimental Comparison of Open Source Vision-Based State Estimation Algorithms. In Proceedings of the 2016 International Symposium on Experimental Robotics, Tokyo, Japan, 3–6 October 2016; Kulić, D., Nakamura, Y., Khatib, O., Venture, G., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 775–786. [Google Scholar]
Joshi, B.; Rahman, S.; Kalaitzakis, M.; Cain, B.; Johnson, J.; Xanthidis, M.; Karapetyan, N.; Hernandez, A.; Li, A.Q.; Vitzilaios, N.; et al. Experimental Comparison of Open Source Visual-Inertial-Based State Estimation Algorithms in the Underwater Domain. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019. [Google Scholar]
Tang, J.; Ericson, L.; Folkesson, J.; Jensfelt, P. GCNv2: Efficient Correspondence Prediction for Real-Time SLAM. IEEE Robot. Autom. Lett. 2019, 4, 3505–3512. [Google Scholar] [CrossRef]
Luo, H.; Liu, Y.; Guo, C.; Li, Z.; Song, W. SuperVINS: A Real-Time Visual-Inertial SLAM Framework for Challenging Imaging Conditions. IEEE Sens. J. 2025, 25, 26042–26050. [Google Scholar] [CrossRef]
Wang, Y.; Xu, B.; Fan, W.; Xiang, C. A Robust and Efficient Loop Closure Detection Approach for Hybrid Ground/Aerial Vehicles. Drones 2023, 7, 135. [Google Scholar] [CrossRef]
Chen, C.; Wang, B.; Lu, C.X.; Trigoni, N.; Markham, A. Deep Learning for Visual Localization and Mapping: A Survey. IEEE Trans. Neural Netw. Learning Syst. 2023, 35, 1–21. [Google Scholar] [CrossRef]
Teed, Z.; Deng, J. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. Adv. Neural Inf. Process. Syst. 2021, 34, 16558–16569. [Google Scholar]
Zhu, Z.; Peng, S.; Larsson, V.; Xu, W.; Bao, H.; Cui, Z.; Oswald, M.R.; Pollefeys, M. NICE-SLAM: Neural Implicit Scalable Encoding for SLAM. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12776–12786. [Google Scholar]
Favorskaya, M.N. Deep Learning for Visual SLAM: The State-of-the-Art and Future Trends. Electronics 2023, 12, 2006. [Google Scholar] [CrossRef]
Qi, Q.; Li, K.; Zheng, H.; Gao, X.; Hou, G.; Sun, K. SGUIE-Net: Semantic Attention Guided Underwater Image Enhancement with Multi-Scale Perception. IEEE Trans. on Image Process. 2022, 31, 6816–6830. [Google Scholar] [CrossRef]
Chen, T.; Wang, N.; Chen, Y.; Kong, X.; Lin, Y.; Zhao, H.; Karimi, H.R. Semantic Attention and Relative Scene Depth-Guided Network for Underwater Image Enhancement. Eng. Appl. Artif. Intell. 2023, 123, 106532. [Google Scholar] [CrossRef]
Liang, J.; Xu, F.; Yu, S. A Multi-Scale Semantic Attention Representation for Multi-Label Image Recognition with Graph Networks. Neurocomputing 2022, 491, 14–23. [Google Scholar] [CrossRef]
Qi, L.; Qin, X.; Gao, F.; Dong, J.; Gao, X. SAWU-Net: Spatial Attention Weighted Unmixing Network for Hyperspectral Images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5505205. [Google Scholar] [CrossRef]
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. RFAConv: Innovating Spatial Attention and Standard Convolutional Operation. arXiv 2023, arXiv:2304.03198. [Google Scholar]
Bai, W.; Zhang, Y.; Wang, L.; Liu, W.; Hu, J.; Huang, G. SADGFeat: Learning Local Features with Layer Spatial Attention and Domain Generalization. Image Vis. Comput. 2024, 146, 105033. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Li, Y.; Chen, Y.; Wang, N.; Zhang, Z.-X. Scale-Aware Trident Networks for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6053–6062. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Kim, H.; Yim, C. Swin Transformer Fusion Network for Image Quality Assessment. IEEE Access 2024, 12, 57741–57754. [Google Scholar] [CrossRef]
Li, D.; Shi, X.; Long, Q.; Liu, S.; Yang, W.; Wang, F.; Wei, Q.; Qiao, F. DXSLAM: A Robust and Efficient Visual SLAM System with Deep Features. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020; pp. 4958–4965. [Google Scholar]
Bruno, H.M.S.; Colombini, E.L. LIFT-SLAM: A Deep-Learning Feature-Based Monocular Visual SLAM Method. Neurocomputing 2021, 455, 97–110. [Google Scholar] [CrossRef]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperPoint: Self-Supervised Interest Point Detection and Description. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 337–33712. [Google Scholar]
Lindenberger, P.; Sarlin, P.-E.; Pollefeys, M. LightGlue: Local Feature Matching at Light Speed. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1 October 2023; pp. 17581–17592. [Google Scholar]
Peng, Q.; Xiang, Z.; Fan, Y.; Zhao, T.; Zhao, X. RWT-SLAM: Robust Visual SLAM for Highly Weak-Textured Environments. arXiv 2022, arXiv:2207.03539. [Google Scholar]
Xiao, Z.; Li, S. A Real-Time, Robust and Versatile Visual-SLAM Framework Based on Deep Learning Networks. arXiv 2024, arXiv:2405.03413. [Google Scholar]
Zhao, Z.; Wu, C. Light-SLAM: A Robust Deep-Learning Visual SLAM System Based on LightGlue under Challenging Lighting Conditions. IEEE Trans. Intell. Transp. Syst. 2025, 26, 9918–9931. [Google Scholar] [CrossRef]
Shi, Y.; Li, R.; Shi, Y.; Liang, S. A Robust and Lightweight Loop Closure Detection Approach for Challenging Environments. Drones 2024, 8, 322. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An Efficient Alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6 November 2011; pp. 2564–2571. [Google Scholar]
Wang, S.; Clark, R.; Wen, H.; Trigoni, N. DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 2043–2050. [Google Scholar]
Li, R.; Wang, S.; Long, Z.; Gu, D. UnDeepVO: Monocular Visual Odometry Through Unsupervised Deep Learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 7286–7291. [Google Scholar]
Zacchini, L.; Bucci, A.; Franchi, M.; Costanzi, R.; Ridolfi, A. Mono Visual Odometry for Autonomous Underwater Vehicles Navigation. In Proceedings of the OCEANS 2019—Marseille, Marseille, France, 17–20 June 2019; pp. 1–5. [Google Scholar]
Ferrera, M.; Moras, J.; Trouvé-Peloux, P.; Creuze, V. Real-Time Monocular Visual Odometry for Turbid and Dynamic Underwater Environments. Sensors 2019, 19, 687. [Google Scholar] [CrossRef]
Leonardi, M.; Stahl, A.; Brekke, E.F.; Ludvigsen, M. UVS: Underwater Visual SLAM—A Robust Monocular Visual SLAM System for Lifelong Underwater Operations. Auton. Robot. 2023, 47, 1367–1385. [Google Scholar] [CrossRef]
Yang, J.; Gong, M.; Nair, G.; Lee, J.H.; Monty, J.; Pu, Y. Knowledge Distillation for Feature Extraction in Underwater VSLAM. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May 2023; pp. 5163–5169. [Google Scholar]
Burguera, A.; Bonin-Font, F.; Font, E.G.; Torres, A.M. Combining Deep Learning and Robust Estimation for Outlier-Resilient Underwater Visual Graph SLAM. J. Mar. Sci. Eng. 2022, 10, 511. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
Prudviraj, J.; Vishnu, C.; Mohan, C.K. M-FFN: Multi-Scale Feature Fusion Network for Image Captioning. Appl. Intell. 2022, 52, 14711–14723. [Google Scholar] [CrossRef]
Wu, Y.; Yao, Q.; Fan, X.; Gong, M.; Ma, W.; Miao, Q. PANet: A Point-Attention Based Multi-Scale Feature Fusion Network for Point Cloud Registration. IEEE Trans. Instrum. Meas. 2023, 72, 1–13. [Google Scholar] [CrossRef]
Wang, R.; Lei, T.; Cui, R.; Zhang, B.; Meng, H.; Nandi, A.K. Medical Image Segmentation Using Deep Learning: A Survey. IET Image Process. 2022, 16, 1243–1267. [Google Scholar] [CrossRef]
Liu, J.; Yang, H.; Zhou, H.-Y.; Yu, L.; Liang, Y.; Yu, Y.; Zhang, S.; Zheng, H.; Wang, S. Swin-UMamba†: Adapting Mamba-Based Vision Foundation Models for Medical Image Segmentation. IEEE Trans. Med. Imaging 2024, 2512913. [Google Scholar] [CrossRef]
Narasimhan, S.G.; Nayar, S.K. Vision and the Atmosphere. Int. J. Comput. Vis. 2002, 48, 233–254. [Google Scholar] [CrossRef]
Jerlov, N.G. Optical Oceanography; Elsevier Oceanography Series; Elsevier: Amsterdam, The Netherlands, 1968; Volume 5, ISBN 978-0-444-40320-9. [Google Scholar]
Solonenko, M.G.; Mobley, C.D. Inherent Optical Properties of Jerlov Water Types. Appl. Opt. 2015, 54, 5392. [Google Scholar] [CrossRef]
Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A Benchmark for the Evaluation of RGB-D SLAM Systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 573–580. [Google Scholar]
Wang, W.; Zhu, D.; Wang, X.; Hu, Y.; Qiu, Y.; Wang, C.; Hu, Y.; Kapoor, A.; Scherer, S. TartanAir: A Dataset to Push the Limits of Visual SLAM. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020. [Google Scholar]
Burri, M.; Nikolic, J.; Gohl, P.; Schneider, T.; Rehder, J.; Omari, S.; Achtelik, M.W.; Siegwart, R. The EuRoC Micro Aerial Vehicle Datasets. Int. J. Robot. Res. 2016, 35, 1157–1163. [Google Scholar] [CrossRef]
Forster, C.; Zhang, Z.; Gassner, M.; Werlberger, M.; Scaramuzza, D. SVO: Semidirect Visual Odometry for Monocular and Multicamera Systems. IEEE Trans. Robot. 2017, 33, 249–265. [Google Scholar] [CrossRef]
Zubizarreta, J.; Aguinaga, I.; Montiel, J.M.M. Direct Sparse Mapping. IEEE Trans. Robot. 2020, 36, 1363–1370. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodriguez, J.J.G.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Czarnowski, J.; Laidlow, T.; Clark, R.; Davison, A.J. DeepFactors: Real-Time Probabilistic Dense Monocular SLAM. IEEE Robot. Autom. Lett. 2020, 5, 721–728. [Google Scholar] [CrossRef]
Teed, Z.; Deng, J. DeepV2D: Video to Depth with Differentiable Structure from Motion. arXiv 2020, arXiv:1812.04605. [Google Scholar]
Wang, W.; Hu, Y.; Scherer, S. TartanVO: A Generalizable Learning-Based VO. In Proceedings of the 2020 Conference on Robot Learning, Online, 16–18 November 2020. [Google Scholar]
Ferrera, M.; Creuze, V.; Moras, J.; Trouvé-Peloux, P. AQUALOC: An Underwater Dataset for Visual–Inertial–Pressure Localization. Int. J. Robot. Res. 2019, 38, 1549–1559. [Google Scholar] [CrossRef]
Rahman, S.; Li, A.Q.; Rekleitis, I. SVIn2: An Underwater SLAM System Using Sonar, Visual, Inertial, and Depth Sensor. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 1861–1868. [Google Scholar]
Schonberger, J.L.; Frahm, J.-M. Structure-from-Motion Revisited. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4104–4113. [Google Scholar]
Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]

Figure 1. The overall framework of RAEM-SLAM.

Figure 2. Overall framework of the Physics-guided Underwater Adaptive Augmentation.

Figure 3. The structural diagram of the Residual Semantic–Spatial Attention Module.

Figure 4. The structural framework of the Local–Global Perception Block.

Figure 5. The structure of single Self-Attention.

Figure 6. Challenging segments in Archaeological Sequence 2: (a) High turbidity. (b) Suspended impurity interference.

Figure 7. Trajectory comparison diagrams of partial Archaeological sequences. (a,b) Archaeological Sequences 1. (c,d) Archaeological Sequences 2. (e,f) Archaeological Sequences 5. (g,h) Archaeological Sequences 6.

Figure 8. Trajectory comparison on AFRL dataset. (a,b) Submerged Bus scene image and trajectory plot. (c,d) Cave scene image and trajectory plot. (e,f) Fake Cemetery scene image and trajectory plot.

Table 1. ATE(m) of ablation study on ten Archaeological sequences.

	Sequence 1	Sequence 2	Sequence 3	Sequence 4	Sequence 5	Sequence 6	Sequence 7	Sequence 8	Sequence 9	Sequence 10	Avg
No PUAA	0.092	1.950	0.026	0.356	0.072	0.073	0.070	0.075	0.229	0.082	0.303
No RSSA	0.214	0.040	0.087	0.785	0.058	0.088	0.119	0.055	0.249	0.113	0.181
No LGP	0.217	0.086	0.039	0.160	0.168	0.152	0.081	0.049	0.251	0.065	0.127
RAEM	0.076	0.026	0.023	0.183	0.046	0.071	0.078	0.047	0.232	0.062	0.084

Bold font represents the minimum error, while red font represents the maximum error.

Table 2. ATE (m) of comparative experiments with the-state-of-art SLAM on the EuRoC dataset.

		MH01	MH02	MH03	MH04	MH05	V101	V102	V103	V201	V202	V203	Avg
Traditional	ORB-SLAM [8]	0.071	0.067	0.071	0.082	0.060	0.015	0.020	-	0.021	0.018	-	-
	DSO [10]	0.046	0.046	0.172	3.810	0.110	0.089	0.107	0.903	0.044	0.132	1.152	0.601
	SVO [62]	0.100	0.120	0.410	0.430	0.300	0.070	0.210	-	0.110	0.110	1.080	-
	DSM [63]	0.039	0.036	0.055	0.057	0.067	0.095	0.059	0.076	0.056	0.057	0.784	0.126
	ORB-SLAM3 [64]	0.016	0.027	0.028	0.138	0.072	0.033	0.015	0.033	0.023	0.029	-	-
Deep Learning	DeepFactors [65]	1.587	1.479	3.139	5.331	4.002	1.520	0.679	0.900	0.876	1.905	1.021	2.040
	DeepV2D [66]	0.739	1.144	0.752	1.492	1.567	0.981	0.801	1.570	0.290	2.202	2.743	1.298
	DeepV2D (TartanAir) [66]	1.614	1.492	1.635	1.775	1.013	0.717	0.695	1.483	0.839	1.052	0.591	1.173
	TartanVO [67]	0.639	0.325	0.550	1.153	1.021	0.447	0.389	0.622	0.433	0.749	1.152	0.680
	DROID(Mono) [18]	0.041	0.016	0.027	0.048	0.052	0.035	0.016	0.023	0.021	0.014	0.026	0.029
	RAEM-SLAM	0.013	0.020	0.022	0.031	0.038	0.026	0.026	0.019	0.019	0.014	0.012	0.022

Bold font represents the minimum error, while - represents tracking failure.

Table 3. ATE (m) of comparative experiments with the-state-of-art SLAM on Archaeological dataset.

	Sequence 1	Sequence 2	Sequence 3	Sequence 4	Sequence 5	Sequence 6	Sequence 7	Sequence 8	Sequence 9	Sequence 10	Avg
ORB-SLAM3(Mono) [64]	-	-	-	-	0.274	-	0.581	0.101	0.332	0.301	-
VINS-Mono [71]	-	0.672	0.592	4.349	3.175	0.993	3.180	-	-	1.912	-
SL-SLAM(Mono) [40]	0.727	-	-	-	0.070	0.305	0.166	0.045	0.325	0.257	-
DROID(Mono) [18]	0.465	2.621	0.042	0.902	0.212	0.161	0.091	0.051	0.234	0.077	0.486
RAEM-SLAM	0.076	0.026	0.023	0.183	0.046	0.071	0.078	0.047	0.232	0.062	0.084

Bold font represents the minimum error, while - represents tracking failure.

Table 4. Comparison results of ATE(m) on AFRL dataset.

Scene	DROID-SLAM	RAEM-SLAM (Ours)	Error Reduction
Submerged Bus	2.386	1.198	49.8%
Cave	1.185	0.458	61.4%
Fake Cemetery	2.285	1.666	27.1%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, Y.; Li, Y.; Luo, W.; Ding, X. RAEM-SLAM: A Robust Adaptive End-to-End Monocular SLAM Framework for AUVs in Underwater Environments. Drones 2025, 9, 579. https://doi.org/10.3390/drones9080579

AMA Style

Wu Y, Li Y, Luo W, Ding X. RAEM-SLAM: A Robust Adaptive End-to-End Monocular SLAM Framework for AUVs in Underwater Environments. Drones. 2025; 9(8):579. https://doi.org/10.3390/drones9080579

Chicago/Turabian Style

Wu, Yekai, Yongjie Li, Wenda Luo, and Xin Ding. 2025. "RAEM-SLAM: A Robust Adaptive End-to-End Monocular SLAM Framework for AUVs in Underwater Environments" Drones 9, no. 8: 579. https://doi.org/10.3390/drones9080579

APA Style

Wu, Y., Li, Y., Luo, W., & Ding, X. (2025). RAEM-SLAM: A Robust Adaptive End-to-End Monocular SLAM Framework for AUVs in Underwater Environments. Drones, 9(8), 579. https://doi.org/10.3390/drones9080579

Article Menu

RAEM-SLAM: A Robust Adaptive End-to-End Monocular SLAM Framework for AUVs in Underwater Environments

Abstract

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning in Visual SLAM

2.2. Underwater Visual SLAM

2.3. Semantic–Positional Feature Fusion

2.4. Local–Global Feature Fusion

3. Method

3.1. Overall Architecture

3.2. Physics-Guided Underwater Adaptive Augmentation

3.3. Residual Semantic–Spatial Attention Module

3.4. Local–Global Perception Block

3.4.1. Multi-Scale Local Perception

3.4.2. Global Perception Block

3.5. Loss Function

4. Experiments

4.1. Evaluation Metrics

4.2. Implemental Details

4.3. Ablation Study and Computational Efficiency

4.4. Comparative Experiment

4.4.1. Land Scene

4.4.2. Underwater Scene

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI