Dual-Branch Feature Generalization Method for AUV Near-Field Exploration of Hydrothermal Areas

Liu, Yihui; Chen, Guofang; Xu, Yufei; Wan, Lei; Zhang, Ziyang

doi:10.3390/jmse12122359

Open AccessArticle

Dual-Branch Feature Generalization Method for AUV Near-Field Exploration of Hydrothermal Areas

by

Yihui Liu

¹,

Guofang Chen

^2,*,

Yufei Xu

¹

,

Lei Wan

¹ and

Ziyang Zhang

¹

Science and Technology on Underwater Vehicle Laboratory, Harbin Engineering University, Harbin 150001, China

²

School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(12), 2359; https://doi.org/10.3390/jmse12122359

Submission received: 24 November 2024 / Revised: 10 December 2024 / Accepted: 20 December 2024 / Published: 22 December 2024

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

The simultaneous localization and mapping (SLAM) technique provides long-term near-seafloor navigation for autonomous underwater vehicles (AUVs). However, the stability of descriptors generated by interest point detectors remains a challenge in the hydrothermal environment. This paper proposes a dual-branch feature generalization method, incorporating volumetric density and color distribution for enhanced robustness. The method utilizes shared descriptors and a feature confidence mechanism, combining neural radiance fields with Gaussian splatting models, ensuring fast and accurate feature generalization. The proposed approach improves recall while maintaining matching accuracy, ensuring stability and robustness in feature matching. This method achieves stable and reliable feature matching in a simulated hydrothermal environment.

Keywords:

underwater localization; feature generalization; VSLAM

1. Introduction

The near-field exploration of hydrothermal fields generates complex visual data, which are critical for advancing marine geological and biological research [1,2]. However, precisely localizing autonomous underwater vehicles (AUVs) in such unstructured and confined environments remains a formidable challenge. The emergence of simultaneous localization and mapping (SLAM) technologies, influenced by advancements in autonomous driving, unmanned aerial systems, and virtual/augmented reality, offers innovative solutions. Visual SLAM (VSLAM) utilizes cameras to capture environmental data and derive the vehicle’s pose by analyzing inter-frame relationships [3]. This approach constructs an environmental map in real time and determines the AUV’s position within it, enabling autonomous navigation and positioning.

However, studies on open-source algorithms reveal that visual interest point detectors (IPDs) and VSLAM systems often experience significant positioning errors in environments with poor textures or subtle features [4,5,6]. One primary reason is the inability to simultaneously ensure both robustness and uniqueness during descriptor matching. Common visual challenges in underwater environments, such as insufficient lighting, low image contrast, and color attenuation, exacerbate this issue by causing descriptors to fail in maintaining consistency under transformations like rotation across different viewpoints, further impairing AUV visual positioning [7,8]. While some approaches have integrated additional sensors, such as sonar or depth meters, to achieve moderate success in underwater visual–inertial odometry (VIO) [9,10,11], enhancing descriptor performance remains critical for IPDs to reliably estimate AUV posture changes.

The performance of a VSLAM system heavily relies on the stability of features generated by the descriptor algorithm. However, commonly used interest point detectors (IPDs) in underwater SLAM, such as scale-invariant feature transform (SIFT), speeded-up robust features (SURFs), and oriented FAST and rotated BRIEF (ORB), were originally designed for air-mediated environments [12,13,14]. The descriptors produced by these methods often fail to achieve successful matches when significant viewpoint changes occur, primarily due to the challenging underwater imaging conditions [15]. Additionally, intense variations in illumination caused by AUV-mounted lighting further hinder accurate descriptor matching. While deep neural networks offer superior adaptation to environmental changes and better robustness for illumination variations compared to traditional methods [16,17,18], their descriptors focus primarily on uniqueness, with limited robustness. This limitation reduces the number of reliably repeatable matches, which is insufficient to meet the demands of underwater SLAM applications.

In response to the aforementioned constraints, this paper proposes a dual-branch feature generalization method for the optical–acoustic fusion interest point detector (OAF-IPD) system, as illustrated in Figure 1 [19]. This method introduces a feature generalization approach based on 3D reconstruction, where the reconstructed model is used to capture the color information of interest points from different viewpoints to form generalized features. The NeRF (neural radiance field) model is applied to enhance the robustness of feature matching across diverse perspectives [20]. Given the relatively slow processing speed of the NeRF model, a dual-branch structure was designed. One branch leverages the NeRF model for robust feature generalization, while the other utilizes a Gaussian splatting model to enable faster generalization [21]. To optimize the output, the two branches share descriptors and a feature confidence evaluation module, ensuring the optimal balance between the contributions of each model. This design achieves both fast and accurate feature generalization.

To significantly enhance the robustness and accuracy of AUV self-localization in the hydrothermal near-seafloor environment, the dual-branch generalization method addresses the challenge of the low matching accuracy of traditional descriptors under varying viewpoints in underwater conditions. By applying radiance-field-based generalization to descriptors, this method effectively improves their adaptability. The dual-branch generalization method aims to establish a more reliable and robust descriptor framework tailored to the complex conditions of hydrothermal near-seafloor environments. The primary contributions are as follows:

Dual-branch generalization model: combines the visual realism of NeRF with the computational efficiency of 3D Gaussian splatting, achieves both offline generation of high-fidelity descriptors and real-time generalization for SLAM tasks;
Shared descriptor space for multi-modal fusion: aligns NeRF’s optical features with the geometric features of 3D Gaussian splatting in a unified descriptor space;
Generalized feature confidence mechanism: dynamically adjusts the reliability of generalized descriptors to balance robustness and accuracy in feature matching, addressing challenges in environments with varying lighting and viewpoints.

This paper is structured as follows: Section 2 provides a brief introduction to related works; Section 3 offers the proposed dual-branch generalization method; and Section 4 presents results from challenging underwater environments. Finally, Section 5 concludes with a summary and directions for future work.

2. Related Works

BEBLID enhances traditional binary descriptors such as ORB and BRISK by using a boosting algorithm to select the most effective binary tests, improving descriptor matching performance without increasing computational complexity [22]. This makes it efficient and suitable for real-time applications. However, its reliance on binary comparisons means it struggles with robustness in environments with poor underwater lighting, where variations in illumination and viewpoint significantly affect matching performance.

FeatureBooster is a framework designed to enhance the robustness of existing descriptors like SIFT and ORB through supervised learning [23]. By learning task-specific optimizations, it adapts descriptors to better handle changes in lighting, viewpoint, and environmental conditions, making it effective in challenging real-world scenarios. Nonetheless, in underwater environments where lighting is highly inconsistent and unpredictable, the reliance on pre-trained models and task-specific tuning can reduce its ability to maintain consistent feature matching.

PoSFeat introduces a feature-preserving sampling technique for point clouds, dynamically adjusting sampling density based on the saliency of points [24]. This ensures robust feature preservation in both sparse and dense regions, making it particularly effective for point cloud registration, segmentation, and large-scale 3D reconstruction. However, in underwater environments with poor illumination, the saliency-based sampling approach might fail to identify and preserve key features consistently, leading to reduced matching robustness.

The “You Only Hypothesize Once” proposes a novel point cloud registration method using rotation-equivariant descriptors and a hypothesis-reduction strategy to improve robustness against rotational variations and enhance computational efficiency [25]. While this method achieves state-of-the-art performance in registration tasks, it may face challenges in underwater scenarios where uneven lighting and noise disrupt the reliability of descriptors, compromising its matching robustness.

In previous research, our team proposed the OAF-IPD method, which successfully integrated sonar and camera data to extract fused features and generate descriptors. The detailed structure of that model is shown in Figure 2.

This approach demonstrated the feasibility of sensor data fusion for robust feature extraction in underwater environments, leveraging the complementary nature of sonar and optical data to improve the stability and accuracy of descriptor generation. The OAF-IPD significantly enhances the robustness and accuracy of AUV self-localization in hydrothermal near-seafloor environments. By addressing the limitations of individual sensors, OAF-IPD employs a deep learning framework to fuse data from multiple sensors, ensuring robust and accurate localization. It is designed to function effectively in the complex and dynamic conditions of hydrothermal areas, providing a highly reliable, unsupervised learning-based Image Pose Determination (IPD) method. The key contributions of OAF-IPD are as follows:

Fusion Module: Integration of the Feature Pyramid Network (FPN) into the UnsuperPoint framework to improve multi-sensor data fusion, enabling the extraction of more comprehensive and reliable features.
Depth Module: A specialized module designed to ensure a uniform distribution of interest points in depth, significantly enhancing localization accuracy by maintaining balanced spatial coverage of detected features.
Unsupervised Training Strategy: Introduction of an innovative unsupervised training approach, including the following: an auto-encoder framework for encoding sonar data, a ground truth depth generation framework to support depth module training, and a mutually supervised framework to ensure effective training of the fusion and depth modules without reliance on extensive labeled datasets.
Non-Rigid Feature Filter: Development of a camera data encoder equipped with a non-rigid feature filter to exclude features from non-rigid structures, such as smoke emitted from hydrothermal vents, thus mitigating environmental noise and interference.

These advancements make OAF-IPD a robust solution for AUV localization in hydrothermal near-seafloor environments, addressing critical challenges inherent in underwater operations.

However, in practical applications, the detection range limitations of sonar and cameras pose significant challenges. Cameras often have blind spots in distant areas, while sonar is limited to detecting objects within its vertical field of view, leaving a large blind zone in the height dimension. In real-world scenarios, it is common for feature points to be derived from a single sensor, which undermines the stability of the front end of fusion-based SLAM systems.

The OAF-IPD method addresses this issue primarily at the feature extraction stage, but it still encounters challenges when the overlap between camera and sonar detection areas is limited. In such cases, the method may suffer from an insufficient number of matchable features, which affects the robustness and performance of the system.

In deep-sea environments, where lighting is solely provided by equipment carried by the vehicle, significant appearance changes occur frequently due to variations in illumination and viewing angles. This places high demands on the robustness of descriptor matching. Moreover, to filter out outliers caused by occlusion and missing points, descriptors must also exhibit uniqueness. These requirements create two conflicting objectives—robustness and uniqueness—that are challenging to satisfy simultaneously through optimization in an IPD-only approach.

Based on OAF-IPD, a dual-branch generalization model was designed. This model ensures robust matching of descriptors under varying lighting intensities and viewing angles by generalizing the optical–acoustic fused features. Furthermore, it incorporates a descriptor updating mechanism to maintain the uniqueness of descriptors, thereby achieving a balance between robustness and uniqueness in multimodal feature matching.

3. Method

3.1. Construction of Optical–Acoustic Fused Feature Descriptors

The descriptors extracted by OAF-IPD are effective at distinguishing interest points, ensuring high uniqueness in matching. Building on this foundation, new optical–acoustic fused features are constructed while maintaining the matching uniqueness of the original descriptors. Specifically, 259 additional dimensions are appended to the 256-dimensional floating-point descriptors extracted by OAF-IPD, as illustrated in Figure 3.

The newly added dimensions are defined as follows:

Dimensions 1–256: These correspond to the original optical–acoustic fusion features

d

generated by OAF-IPD.

Dimensions 257–512: These represent generalized features

\tilde{d}

, which capture the color and volumetric density of the interest point under different viewing angles and lighting conditions. These features are derived using the generalization methods.

Dimension 513: This indicates the feature type

c_{m}

of the interest point, including acoustic features, optical features, and fusion features.

Dimension 514: This records the matching status

c_{d}

of the interest point, with possible values indicating the following: successful matching within the same modality, successful cross-modal matching, successful fusion feature matching, or no successful match.

Dimension 515: This stores the cumulative matching score

s_{m}

, which is computed based on the score of each individual match.

The data structure of the fused feature descriptors is summarized in Table 1. This extended descriptor structure allows for more detailed and dynamic feature representations, accommodating variations in viewpoint and illumination, and providing a robust system for multimodal feature matching in the context of optical–acoustic fusion.

3.2. Feature Generalization Method Based on NeRF

In hydrothermal zone detection, illumination relies solely on the AUV’s onboard light source. Under these lighting conditions, the AUV’s movement causes significant changes in image features, leading to substantial differences in the descriptors of the same point between two frames, which in turn reduces the recall rate of the matching process. To enable descriptors to retain rematchability under different viewpoints and lighting conditions, this section proposes a feature generalization method based on neural radiance fields (NeRF). By generalizing optical features, the method enhances matching robustness, thereby improving the recall rate of interest point matching in the SLAM front end when lighting conditions are limited.

Neural radiance fields (NeRF) is a 3D scene rendering technique represented by neural networks. It models each point in a scene using a multilayer perceptron (MLP), where the input is the point’s 3D coordinates and viewing direction, and the output includes the point’s color and volume density, representing transparency. By sampling rays in a scene, NeRF can synthesize images from different viewpoints. In the context of feature generalization, NeRF not only generates high-quality views but also learns mappings between features observed from different viewpoints, enhancing robustness in cross-modal or multi-view feature matching.

NeRF encodes spatial coordinates and viewing directions using Fourier features, which transform the inputs into a series of sine and cosine functions to capture high-frequency information. These encoded inputs are then fed into a neural network composed of multiple fully connected layers, each using a ReLU activation function. The network outputs the RGB values and volume density for each sampled point. NeRF samples multiple points along each ray within the scene, calculates the color and density of each point, and integrates these values using a volumetric rendering formula to compute the final color of the ray. During training, NeRF minimizes the mean squared error between the predicted and ground truth images to optimize network parameters, improving the quality of the synthesized images.

The input to the NeRF network is a 5-dimensional vector, consisting of a point’s position

x = (x, y, z)

in space and the direction vector

d = (θ, ϕ)

of the viewing ray. The output consists of the color

c = (r, g, b)

and volume density

σ

. The NeRF multi-layer perceptron (MLP) is represented as

F_{Θ} : (x, d) \to (c, σ)

. To ensure consistent representation across different angles, the volume density

σ

is predicted based solely on the position

x

, while the RGB color

c

is predicted as a function of both position and viewing direction.

To achieve this,

F_{Θ}

processes the 3D coordinates

x

using 8 fully connected layers (with ReLU activation and 256 channels per layer), outputting

σ

and a 256-dimensional feature vector. This feature vector is then concatenated with the viewing direction of the camera ray and passed through an additional fully connected layer (with ReLU activation and 128 channels). The output of this layer is the RGB color associated with the viewing direction. Figure 4 shows an overview of NeRF scene representation and differentiable rendering procedure.

The NeRF model achieves feature generalization by estimating the position of points and the corresponding camera viewing angles to derive color and volume density information. Specifically, it samples interest points from 64 different directions, capturing the color and density data of each point. This approach enhances the adaptability of feature descriptors to variations in viewpoint and lighting conditions, improving robustness and stability in complex scenarios and enabling the matching process to identify more reliable correspondences.

As shown in Figure 5, the output of the neural radiance field generalization framework is an 8 × 8 × 4 generalized feature. This output includes 8 horizontal angles (θ) and 8 vertical angles (ϕ), along with four predicted values from NeRF: the RGB values representing three color channels and the volume density (σ).

The core of the multi-angle feature generalization process relies on using the NeRF model for the volumetric rendering of optical images. This involves generating RGB values and volume density for new viewpoints. Volumetric rendering samples multiple points along the direction of a ray passing through the scene, computes the color and density of each point, and integrates these values to calculate the final color of the ray. For every pixel in the scene, a ray is cast from the camera’s position in the direction of the pixel’s line of sight, represented as:

r (t) = o + t d

(1)

In the equation, o represents the camera’s position, and t is the distance parameter along the ray direction.

NeRF uses a five-dimensional vector (spatial coordinates and viewing direction) to represent the volume density and the radiance emitted in the direction of the scene. Based on the principles of volumetric rendering, the expected color C(r) of a ray r(t) within the scene, bounded by the near boundary

t_{n}

and far boundary

t_{f}

, can be expressed as:

C (r) = \int_{t_{n}}^{t_{f}} T (t) σ (r (t)) c (r (t), d) d t

(2)

where

T (t) = \exp (- \int_{t_{n}}^{t} σ (r (s)) d s)

is the accumulated transmittance along the ray from

t_{n}

to

t_{f}

, representing the probability that the ray travels without being absorbed.

σ (r (t))

is the volume density at a point along the ray, indicating the likelihood of light scattering or being absorbed.

c (r (t), d)

represents the color emitted at the point

r (t)

in the direction

d

. This formulation integrates the contribution of each point along the ray to compute the final color observed from the camera, enabling NeRF to model complex light interactions and generate realistic views from novel perspectives.

By accumulating the color and density contributions of all sampled points along a ray, the final color of the ray is computed. This process determines the pixel value in the image corresponding to the ray, thereby capturing the associated feature.

3.3. Dual-Branch Generalization Model

NeRF has shown outstanding performance in feature generalization tasks due to its ability to model fine-grained volume density and color distributions, resulting in highly realistic visual effects. However, its high computational cost makes it unsuitable for real-time applications in VSLAM systems. Instead, NeRF is more suitable as an offline descriptor generation module that precomputes high-quality feature descriptors to enhance system performance in complex environments. To address the limitations of NeRF in real-time processing, it is critical to develop an efficient feature generalization method capable of supporting real-time VSLAM operations, enhancing robustness and adaptability in multimodal or dynamic scenes.

The NeRF model excels at learning and reconstructing detailed lighting and shadow effects by modeling light propagation and scattering in 3D space. It provides smooth and coherent transitions across continuous viewpoint changes, enabling high-quality image reconstruction from any angle. On the other hand, 3D Gaussian splatting models are highly efficient in noise suppression and data smoothing, effectively handling noisy or incomplete datasets. By describing data points with Gaussian distributions, these models are particularly robust when processing irregular data such as point clouds, excelling at identifying shapes and structures within sparse datasets. Compared to NeRF, Gaussian models are computationally less intensive, faster, and easier to implement, making them ideal for resource-constrained applications.

This paper proposes a dual-branch generalization model that combines the strengths of NeRF and 3D Gaussian models to achieve both online and offline feature generalization. The online branch leverages the efficiency and uncertainty modeling capabilities of 3D Gaussian models for real-time feature generalization, while the offline branch uses NeRF to generate high-quality feature descriptors, enhancing feature robustness for the system. Figure 6 illustrates the basic flowchart of the two-branch generalization model.

3.3.1. 3D Gaussian Splatting Model

The 3D Gaussian splatting model is a statistical approach for representing the probabilistic distribution of points in three-dimensional space. Its core concept is to use Gaussian distributions to smoothly represent point clouds or volumetric data, offering a continuous and noise-resistant spatial representation, which contrasts with traditional discrete point-based models. This model effectively reduces noise and enhances the smoothness and consistency of results in tasks such as 3D reconstruction and rendering.

The 3D Gaussian splatting model defines the center and range of a point’s distribution using a mean vector and a covariance matrix. This representation is particularly effective for handling high-noise and uncertain data, as illustrated in Figure 7. When compared to NeRF, the primary advantage of the 3D Gaussian splatting model lies in its capability to model uncertainty. Through Gaussian distributions, it intuitively captures the spatial distribution of feature points, making it especially suitable for generalizing features in noisy environments. Additionally, the model offers high computational efficiency, particularly for small-scale datasets, enabling rapid generation of feature representations and seamless integration with other models.

3.3.2. Shared Descriptors

In the feature generalization process of NeRF and 3D Gaussian splatting models, the first challenge is to map these two different types of data representations into a shared descriptor space. NeRF, by learning the radiance field of a scene, generates high-fidelity 3D reconstructions and lighting effects, extracting rich visual features in the process. Its core capability is the use of neural networks to create continuous 3D scene representations from multi-view images, offering highly detailed and viewpoint-consistent features. To integrate NeRF-generated features with the geometric and spatial features captured by the Gaussian splatting model, it is necessary to embed these visual features into a unified descriptor space, ensuring that NeRF’s features work collaboratively with the point cloud structure represented by Gaussian distributions.

On the other hand, the Gaussian splatting model captures the geometric structure of point clouds using statistical distributions, describing data points through Gaussian distributions. This approach is particularly effective for handling irregular data and noise suppression. To achieve feature generalization, during the feature sharing process, the spatial distribution capabilities of the Gaussian splatting model are leveraged to align and fuse NeRF’s optical features with the geometric features of the point cloud. This shared descriptor process combines the strengths of both models within a unified space: the robust geometric structure of the Gaussian model complements the high-precision visual information from NeRF. This integration enhances the system’s ability to generalize features under varying viewpoints and lighting conditions.

NeRF generates a continuous 3D scene representation by learning the radiance field of a scene through neural networks, modeling the color

C

of each pixel as a function

C = f (r, d)

where r is the origin of the ray and d is its direction. This representation is optimized by minimizing a loss function during training:

L_{NeRF} = \sum_{i} {‖C_{pred, i} - C_{true, i}‖}^{2}

(3)

On the other hand, the Gaussian splatting model uses Gaussian distributions to describe the geometric structure of point clouds. Assuming that the distribution of a point cloud data point

x_{i}

is described by a Gaussian distribution, it can be expressed as:

p (x_{i}) = \frac{1}{{(2 π)}^{d / 2} {|Σ|}^{1 / 2}} ex p (- \frac{1}{2} {(x_{i} - μ)}^{⊤} Σ^{- 1} (x_{i} - μ))

(4)

where μ is the mean and Σ is the covariance matrix. This geometric feature can be mapped to the same descriptor space as the optical features extracted by NeRF. By using shared descriptors, the high-resolution visual information provided by NeRF can be integrated with the geometric structure captured by Gaussian distributions.

The integration is achieved by optimizing a joint loss function:

L_{shared} = α L_{NeRF} + β L_{Gaussian}

(5)

where α and β are weight factors that control the relative contributions of the two losses. This shared process enhances the system’s generalization capability under varying viewpoints and lighting conditions, significantly improving the model’s robustness and accuracy in handling complex environments.

3.3.3. Feature Confidence

To address the limitations of NeRF in descriptor accuracy, a confidence mechanism is introduced to balance the use of generalized and non-generalized descriptors. NeRF-generated descriptors may suffer from issues such as instability under large viewpoint changes, dependence on high-quality training data, feature sparsity, and poor adaptability to dynamic scenes. These factors can reduce the accuracy of descriptors in certain scenarios.

In this system, raw descriptors, which are non-generalized, are assigned the highest confidence levels by default. Generalized descriptors, on the other hand, start with a lower initial confidence. To ensure reliability, the system increases the confidence of generalized descriptors after successful matches. When a generalized descriptor achieves two successful matches, its confidence is raised to match that of non-generalized descriptors. This approach ensures descriptors become more trustworthy as their effectiveness is demonstrated.

The confidence

C_{t}

of a descriptor at any time is updated using the formula:

C_{t} = C_{0} + k \cdot m i n (n, N)

(6)

where

C_{0}

is the initial confidence, k is the step size for each successful match, n is the number of successful matches, and N is the threshold for equalizing confidence with non-generalized descriptors.

For example, if the initial confidence of a generalized descriptor is set to 0.5, with a step size k of 0.25 and a threshold N of 2 matches, its confidence increases to 1.0 after two successful matches, equaling the confidence of non-generalized descriptors. This confidence mechanism ensures that the system adapts to varying descriptor reliability while gradually improving the accuracy of generalized features through successful matching.

3.3.4. Training

The NeRF model in the dual-branch generalization system uses a deep neural network and requires training before application. The training data consists of underwater images collected in a simulated hydrothermal environment within a controlled pool setting. The training process follows the original NeRF framework.

The NeRF generalization model is trained using multi-view underwater images, with camera viewpoints obtained through 3D reconstruction of the scene. The model is trained with 1024 randomly selected rays per batch. Each ray samples 32 coordinates in the coarse generalization phase and 64 coordinates in the fine generalization phase.

The Adam optimizer is used for training, with an initial learning rate of 5 × 10⁻⁴ that decays to 5 × 10⁻⁵ throughout training. The default Adam hyperparameters are applied:

β_{1} =

0.9,

β_{2} =

0.999,

ϵ =

10⁻⁷. For each individual scene, the model requires 100,000 to 200,000 iterations to converge. Training is performed using an NVIDIA 1080Ti GPU.

This process ensures that the NeRF model generates high-quality features suitable for the dual-branch generalization system in underwater environments.

4. Experiments and Results

4.1. Data Acquisition Experiment

This section describes the construction process of the acoustic–optic joint dataset for hydrothermal zones. Data collection experiments were conducted in a comprehensive test pool simulating the underwater hydrothermal environment. Accurate ground truth trajectories of the vehicle were generated using 3D reconstruction techniques, resulting in a dataset suitable for NeRF training, descriptor generalization methods, and SLAM testing.

4.1.1. Experimental Setup

Based on videos and images of actual hydrothermal zones, detailed mound models were constructed to scale. These models successfully replicated the textures, colors, and reflective properties of mounds under different underwater lighting conditions. Figure 8 illustrates the comparison between real black smoker chimneys in hydrothermal zones and the models under illumination from light sources carried by underwater vehicles.

To simulate potential line-of-sight occlusions in hydrothermal zones, six obstacles with heights ranging from 0.8 to 1.4 m were also constructed, further enhancing the realism of the test environment.

The data acquisition experiment was conducted in a comprehensive test pool with dimensions of 50m in length, 30m in width, and 10m in depth. The basic environment of the pool is shown in Figure 9. The mounds and obstacles were deployed at the bottom of the pool, with their positions shown in Figure 10. The tallest mound was placed at the center of the circular area.

4.1.2. Data Acquisition System

The acoustic–optic joint detection system used in the experiment consisted of a BlueROV(Blue Robotics, Torrance, CA, USA) equipped with an MSC3105 underwater camera(Deepsea, San Diego, CA, USA) and an M900-Mk2 multibeam imaging sonar (Blueview, Slangerup, Denmark), as shown in Figure 11. The BlueROV, developed by Blue Robotics, is an underwater remotely operated vehicle (ROV) specifically designed for applications such as scientific research, engineering inspection, underwater photography, and education. The BlueROV provides six degrees of freedom for motion control and demonstrates excellent maneuverability and stability underwater.

The optical sensing system utilized the IP-MSC3105 underwater camera produced by DeepSea Power & Light, as shown in Figure 12. This camera is equipped with a fixed-focus 2.7 mm f/2.9 lens, providing a horizontal field of view of 105° and a vertical field of view of 60° underwater. The IP-MSC3105 uses a 1/2.8″ CMOS image sensor, supports video output at a resolution of 1920 × 1080 at 30 FPS, and operates in low-light environments with a minimum illumination of 0.095 Lux.

The acoustic sensing system employed the M900-Mk2 multibeam imaging sonar produced by Teledyne BlueView, as shown in Figure 13. Operating at a frequency of 900 kHz, the sonar provides a maximum field of view of 130° and a detection range of up to 100 m. It features a horizontal beam width of 1° and a vertical beam width of 12°, delivering excellent image quality and resolution of 1.3 cm at a maximum refresh rate of 25 Hz. This device is widely used in underwater navigation, target detection, obstacle avoidance, and area surveying. It connects via Ethernet, operates on 12-48VDC power, and has a maximum power consumption of 20 W.

The imaging sonar was mounted at the center position below the ROV’s bottom battery compartment, 6cm from the bottom surface, and angled downward at 3° relative to the ROV’s horizontal plane. The camera was aligned with the sonar in the same direction, with its optical axis located in the same vertical plane as the sonar and positioned 12cm parallel to it.

4.1.3. Data Acquisition Process

The construction of the acoustic–optic joint dataset involved data collection experiments, data statistics, data analysis, and data processing. Due to acoustic reflections in the pool environment, the collected sonar data required denoising, while the optical data needed to be compared with real hydrothermal zone footage to analyze optical properties. Additionally, the COLMAP 3D reconstruction tool was used to align the scene with the deployed model positions, providing ground truth for the ROV trajectory as a reference for evaluating the accuracy of the localization method.

To build the acoustic–optic joint dataset for training and testing the proposed SLAM method in simulated hydrothermal zone scenarios, data collection was conducted under various conditions. A total of 10 experiments were carried out along different paths, both with and without additional lighting (including comb-like scanning, close-range encircling, different entry and exit points, and clockwise and counterclockwise navigation). These experiments captured joint image sequences and sonar data.

While it is challenging to simulate environmental noise such as bubbles or distortions caused by thermal vents in an experimental pool environment, the image dataset used in our paper includes real-world data collected from hydrothermal environments. This real-world dataset was incorporated into the training of the NeRF model, which enhances the model’s robustness to disturbances in such environments.

In the experiments, additional lighting in the pool was used to acquire training data with more distinct features. Tests conducted without external lighting simulated the dark environment of the deep sea, relying solely on the onboard illumination of the vehicle to recreate the working conditions of hydrothermal zone exploration. In the experiment, data collected under additional lighting from the top of the comprehensive test basin was used for calculating the ground truth of the vehicle’s position. In the NeRF branch model, this lighting environment played a key role in stabilizing and generalizing features. The clear underwater feature distinction in this environment contributed to supervising the generation of descriptors during the model’s training process. On the other hand, the data collected using only the vehicle’s onboard LED lighting, which mimics real working conditions, represents the final data used by the model. Testing on this dataset reflects the model’s ultimate performance.

The BlueROV was operated at a designed speed of 0.5 m/s, with each data collection experiment lasting between 5 and 15 min and covering a navigation distance of 100 to 450 m. Examples of some experimental scenarios are shown in Figure 14. The velocity was set at 0.5 m/s to ensure safe operation during close-range observation in hydrothermal areas. This velocity was chosen because it is safe for AUVs when approaching the accumulation bodies in these environments. A higher speed could also negatively affect sonar and camera performance, increase the computational burden on the embedded computer, and heighten the risk of collision in complex terrain. This velocity has also been used as the expected velocity for AUVs in high-precision visual system localization tasks, as demonstrated in related experiments [26,27].

4.1.4. Negative Label Generation with OAF-IPD

To enhance the descriptor robustness of the model under varying perspectives and angular changes, the OAF-IPD method is used to generate negative sample labels. The aim is to identify feature point pairs that are challenging for the model to distinguish and label them as training negative samples. These pairs often exhibit high feature descriptor similarity but fail to pass geometric transformation constraints due to perspective differences, qualifying as typical hard negative samples. By reinforcing the model’s ability to discriminate these negative samples during training, the generalization of the descriptor in scenarios with significant angular changes can be improved.

The process of generating negative samples involves three steps. First, the OAF-IPD algorithm performs an initial match on the feature points between images, filtering out pairs that fail geometric verification (error tolerance under homography transformation). Second, for these pairs, the descriptor similarity (Hamming distance) is calculated to select pairs with similar descriptors but inconsistent geometric positions as negative samples. Finally, these hard negative samples are labeled and incorporated into the training dataset. Combined with positive samples, they enable the model to better learn how to differentiate matching relationships even in cases where descriptor features are ambiguous. The number of point pairs added to negative labels is the same as the number of point pairs with positive labels.

4.1.5. Location

This paper utilized 3D reconstruction to determine the trajectory of the vehicle, optimizing it based on the placement positions of mounds to generate the ground truth of the vehicle’s location. This ground truth was used to evaluate the localization accuracy of the proposed SLAM method. Additionally, the constructed 3D model provided depth information, which was used to train the depth prediction module, thereby improving the overall performance of the method.

COLMAP, an advanced Structure-from-Motion (SfM) method, is employed to reconstruct 3D models using image sequences. In this paper, COLMAP was combined with a pixel-level feature metric optimization method to perform a 3D reconstruction of the pool test scene and generate the vehicle trajectory.

During the training of subsequent modules, depth data generated by the 3D reconstruction process was required as supervision. However, the accuracy provided by COLMAP alone was insufficient to meet the requirements. To address this, a pixel-level 3D reconstruction enhancement technique was applied to obtain a higher-precision hydrothermal zone reconstruction model. This method enhances SfM-based 3D reconstruction by optimizing multiple viewpoints along the same trajectory. By leveraging a deep neural network to extract features, it effectively mitigates the impact of geometric noise on 3D reconstruction. Fine reconstruction adjustments are performed through feature matching between image pairs, reducing feature matching errors and accurately estimating camera poses and the 3D scene structure.

This approach adopts a two-stage optimization strategy: first optimizing 2D keypoints and then refining the 3D points and camera poses obtained from SfM. The method demonstrates exceptional accuracy in local feature reconstruction and scalability for large-scale scenes. Compared to traditional geometric optimization methods, it significantly improves the accuracy of visual localization and the completeness of 3D reconstruction while maintaining computational efficiency.

The scale and accuracy of the reconstructed scene and trajectory were optimized using reference markers placed at the bottom of the test pool. The reconstruction achieved centimeter-level accuracy, while the trajectory accuracy reached decimeter-level precision. Figure 15 shows the 3D reconstruction of the hydrothermal zone scene generated by COLMAP, and Figure 16 presents the individual 3D reconstructions of the mound models and obstacles.

4.2. Metrics

In SLAM tasks for hydrothermal zones, the unique environmental conditions, such as poor underwater lighting, scattering effects, and complex terrain, pose significant challenges to the evaluation of matching algorithms. Key metrics for evaluating these algorithms include the following:

Accuracy refers to the ratio of correctly matched pairs to all attempted matches. It measures the overall effectiveness of the matching algorithm. In hydrothermal zone SLAM, accuracy assesses whether the algorithm can correctly identify and match keypoints across the entire scene. For example, in the complex terrain of hydrothermal zones, low accuracy in descriptor matching may lead to erroneous map construction or localization failure.

Precision is the ratio of correctly matched pairs to all predicted matches. High precision indicates fewer errors in identifying matches, with a low rate of mismatches. In the complex environment of hydrothermal zones, where mismatches are common, precision is critical. High precision ensures that the matching algorithm effectively avoids errors between similar but unrelated features, which is crucial for maintaining the reliability of SLAM system localization and map construction.

Recall is the ratio of correctly matched pairs to all actual matches. High recall means the algorithm effectively identifies most of the potential matches, minimizing missed matches. In hydrothermal zones, where feature points can be difficult to detect due to poor image quality or viewpoint changes, high recall is essential to ensure the algorithm captures valuable feature matches. High recall helps the SLAM system comprehensively capture environmental features, enabling the creation of more complete maps.

The F1 score is the harmonic mean of precision and recall, providing a comprehensive evaluation of the matching algorithm’s overall performance. A high F1 score indicates a good balance between precision and recall. In hydrothermal zone SLAM, the F1 score is used to balance precision and recall. In such environments, it is important to avoid mismatches (which affect localization accuracy) while ensuring sufficient correct matches (which ensure map completeness). The F1 score offers a unified metric to evaluate the algorithm’s performance, balancing robustness and uniqueness.

AUC-RANSAC and AUC-LO-RANSAC are used to evaluate the angular robustness of descriptors in relative pose estimation tasks [28]. AUC-RANSAC uses the standard RANSAC algorithm to estimate the fundamental or essential matrix and calculates the cumulative distribution of rotation and translation angular errors (CEC). The area under this curve quantifies the descriptor’s matching performance in terms of angular accuracy. AUC-LO-RANSAC builds on this by incorporating local optimization (LO) to refine the estimated model, achieving higher precision and robustness. This makes AUC-LO-RANSAC a more reliable indicator of descriptor performance in noisy environments or scenarios with repetitive features. By balancing angular error minimization and the number of inliers, AUC-LO-RANSAC offers a comprehensive evaluation of descriptor robustness in complex conditions.

4.3. Results

The real-time execution of the methods studied in this paper was tested on a PC. The hardware environment included a GPU (GeForce 1080Ti, manufactured by Micro-Star International, located in New Taipei City, Taiwan), a CPU (Intel i7-8700K, manufactured by Intel Corporation, headquartered in Santa Clara, CA, USA), and 32 GB of RAM. For the required libraries, the environment was configured considering the dependencies for ORB-SLAM3, NeRF, and 3D-GS as outlined in the open-source repositories, including Python 3.7, CUDA 10.0, cuDNN 7.6.5, OpenCV 3.2, etc. For specific configurations, please refer to the relevant documentation for each method available on GitHub.

The generalization methods were tested using joint data sequences from 20 acoustic–optic datasets. Among these, the first five datasets featured brightly lit underwater scenes with additional light sources, while the remaining 15 datasets simulated exploration scenarios relying solely on the vehicle’s onboard lighting. The matching method employed the widely used brute-force matching (BF Matching) based on Hamming distance, as implemented in ORB-SLAM3. The image resolution used in the experiments was 640×480.

4.3.1. Comparison of Descriptor Generalization Effects

Because the negative samples were generated from mismatched OAF-IPD descriptors, the original descriptor negative had a match result of 0, resulting in some very low metrics.

The generalization performance of acoustic–optic fused features was thoroughly compared and analyzed with a reprojection error of 3px. The average metrics are summarized in Table 2. In this comparison, “3D-GS” represents the 3D Gaussian splatting generalization model and “None” represents original descriptors generated by OAF-IPD. This paper evaluated the performance of three generalization models: the NeRF generalization model, the Gaussian splatting generalization model, and the dual-branch generalization model proposed in this paper.

In bright environments, using non-generalized original descriptors generated by OAF-IPD performs significantly worse compared to the three generalization methods, highlighting the challenges of using raw descriptors in bright environments. For accuracy, the None method achieves only 40.7%, whereas the best-performing dual-branch method reaches 89.0%, demonstrating an absolute improvement of 48.3%. Similarly, recall for the None method is limited to 44.9%, while NeRF and dual-branch increase recall to 98.6% and 98.4%, respectively, resulting in substantial absolute improvements of 53.7% and 53.5%.

Each of the three generalization methods has distinct strengths. NeRF, leveraging neural radiance field techniques, achieves the highest recall at 98.6%, effectively minimizing false negatives. However, its precision increases only modestly to 76.4%, which indicates a higher false positive rate. The 3D-GS method, integrating geometric constraints and spatial information, improves accuracy from 40.7% to 79.4% and recall to 82.6%. However, its precision decreases to 74.5%, reflecting a relative drop of 8.5% compared to None. The dual-branch method achieves balanced performance, increasing accuracy to 89.0%, recall to 98.4%, and precision to 79.2%, demonstrating robust and adaptable performance in bright environments.

In dark environments, the dual-branch method demonstrates the best performance among all evaluated approaches. It achieves the highest accuracy at 85.4%, significantly outperforming NeRF (80.0%), 3D-GS (79.8%), and the None method (30.2%). Similarly, its recall reaches 85.8%, surpassing NeRF (81.8%), 3D-GS (80.8%), and the None method (43.0%). The F1 score of 76.1 further highlights the dual-branch method’s superior balance between precision and recall compared to NeRF (71.5) and 3D-GS (72.7). These results indicate that the dual-branch approach is the most effective at maintaining robust feature matching under low-light conditions.

Moreover, the dual-branch method exhibits the smallest performance drop due to poor lighting. Compared to its results in bright environments, accuracy decreases by only 3.6% (from 89.0% to 85.4%), and recall drops by a modest 12.6% (from 98.4% to 85.8%). In contrast, NeRF experiences a more significant accuracy drop of 7.7% (from 87.7% to 80.0%) and a recall reduction of 16.8% (from 98.6% to 81.8%). The dual-branch method’s minimal sensitivity to lighting conditions highlights its robustness and adaptability, making it the most reliable choice for feature matching in challenging dark environments.

Figure 17 illustrates the matching performance of generalized descriptors in bright environments, where the number of matching points increases significantly, and the mismatch rate is notably reduced, reflecting the efficiency and accuracy of the generalization method under adequate lighting conditions. Figure 18 shows the matching performance of generalized descriptors in dark environments. Despite insufficient lighting, the generalization method maintains a high number of matching points while significantly reducing the mismatch rate, demonstrating its strong robustness in low-light conditions. These results further validate the effectiveness of descriptor generalization in improving matching performance under varying lighting conditions.

For negative samples, the dual-branch generalization method demonstrates significant advantages, particularly in dark environments. This approach effectively increases recall under varying viewpoints and lighting conditions, significantly reducing missed matches. In dark scenarios, the dual-branch method leverages the visual fidelity of NeRF and the rapid processing capabilities of the Gaussian splatting model to enhance feature stability and adaptability. This results in superior matching robustness across different viewpoints, ensuring reliable performance even in challenging conditions such as low lighting and multi-view variations.

4.3.2. Relative Pose Estimation

The evaluation results of AUC-RANSAC and AUC-LO-RANSAC demonstrate the superior performance of the dual-branch generalization method, particularly in ensuring high robustness under angular variations. For AUC-RANSAC, dual-branch achieves the highest scores at all angular thresholds (5°, 10°, and 20°), significantly outperforming NeRF and 3D Gaussian splatting (3D-GS), especially at larger angular changes. With scores of 65.8% and 75.1% at 10° and 20°, it showcases its ability to enhance descriptor stability and maintain accuracy across diverse viewing angles. By combining visual realism and structural efficiency, the dual-branch method ensures reliable feature matching, even under challenging conditions involving significant angular transformations.

Under AUC-LO-RANSAC shown in Table 3, dual-branch further establishes its superiority, achieving the highest AUC values of 67.3%, 79.8%, and 88.5% at 5°, 10°, and 20°, respectively. These results reflect its exceptional robustness in handling varying angular conditions while maintaining accuracy. Unlike NeRF, which focuses on visual generalization, or 3D-GS, which excels in structural modeling, the dual-branch method integrates the strengths of both to deliver robust and accurate pose estimation. Its ability to adapt effectively to both small and large angular variations ensures unmatched descriptor generalization, making it the most reliable and versatile method for scenarios requiring high angular robustness.

4.3.3. Generalization Method Effectiveness in SLAM

To evaluate the localization performance of the acoustic–optic dual-branch descriptor generalization method in a SLAM system, it was integrated with the front end of the proposed method and the back end of ORB-SLAM3. This SLAM framework, incorporating the proposed acoustic–optic feature generalization and matching methods, was named OAF-SLAM.

Figure 19 shows the localization results of the dual-branch generalization method, where the orange trajectory represents the localization result of the proposed dual-branch generalization method, and the blue curve represents the localization result of the OAF-IPD with non-generalized descriptors. The blue point indicates the starting point of the trajectory, and the green point marks the endpoint. Figure 20 presents a local zoomed-in view of the trajectory, where the shaded area corresponding to the trajectory color indicates the localization error. It can be observed that, compared to the OAF-IPD descriptors, the descriptors generated by the proposed dual-branch method exhibit superior localization accuracy.

Figure 21 illustrates the trajectory continuity performance of the two SLAM methods across 10 experiments. In the first five experiments under sufficient lighting, ORB-SLAM3 successfully achieved 100% tracking and localization. However, when relying solely on the vehicle’s onboard lighting, ORB-SLAM3’s trajectory continuity was significantly affected. In the sixth experiment, ORB-SLAM3 experienced six tracking failures, covering only 60% of the total distance and failing to effectively relocalize. In the seventh experiment, it encountered two tracking failures, completing only 13% of the trajectory. In the eighth to tenth experiments, ORB-SLAM3 failed to initialize and was unable to begin localization.

In comparison, the OAF-SLAM method demonstrated excellent performance under favorable lighting conditions, successfully achieving full trajectory localization in the first five experiments. In subsequent trials with poor lighting conditions, OAF-SLAM leveraged the fusion of acoustic and optical sensors to maintain high localization continuity even in challenging environments.

In the sixth, seventh, and eighth experiments, where ORB-SLAM3 struggled to complete localization, OAF-SLAM achieved 100% trajectory localization. In the ninth and tenth experiments, which involved longer trajectories, OAF-SLAM achieved 90% and 95% localization, respectively. While both experiments encountered two tracking failures, the system successfully relocalized and continued localization for the remainder of the trajectories.

4.3.4. Process Speed

In the provided context, the testing was conducted using data from a bright environment due to insufficient feature points detected in dark scenes by ORB features, which were then used as a benchmark to assess the additional time required for descriptor generalization and whether the optimization process time was excessively long. On the other hand, other SLAM systems that employed the OAF-IPD interest point detection method used data collected in dark scenes, relying solely on the vehicle’s own light source for illumination, reflecting real-world conditions for close-proximity detection in hydrothermal areas.

Table 4 presents the average time consumed by various steps in the SLAM process, where tracking includes feature matching, pose optimization, and keyframe detection, which are significantly influenced by descriptor dimensionality. The time used for mapping in each method is primarily determined by the size of the map, with methods showing similar performance in the same scene. In comparing the time required for feature generalization, it is evident that the NeRF generalization method takes excessively long, making it unsuitable for real-time SLAM operations. On the other hand, the 3D-GS method is much faster, with only a 50.49% increase in processing time compared to ORB-SLAM3. The dual-branch method sacrifices computation speed to ensure descriptor accuracy, leading to a 75.59% increase in processing time compared to ORB-SLAM3. However, for close-proximity detection tasks in hydrothermal areas with lower navigation speeds, the entire system still operates with processing times under 450 ms, which is sufficient for real-time localization and mapping tasks.

5. Conclusions

This paper proposes a dual-branch generalization method for acoustic–optic features, expanding descriptors from 256 dimensions to 515 dimensions. The expanded descriptors incorporate the generalized model’s density and color distributions under different angles, enabling robust rematching across varying viewpoints and lighting conditions. This approach enhances matching robustness. Two generalization models were implemented: one based on neural radiance fields (NeRF) and the other on 3D Gaussian splatting. Their generalization performance was compared using corrected supervision models. By combining the visual realism of NeRF with the rapid processing capability of 3D Gaussian splatting, a novel generalized supervision model was designed. This model utilizes high-quality images generated by NeRF as a foundation, further refined through Gaussian-based optimization and noise handling. This combination achieves stable and robust feature matching across diverse viewpoints and lighting conditions.

After generalization, the dual-branch approach shows significant improvements: a 53.5% gain in recall under bright conditions (44.9% to 98.4%) and it successfully eliminated negative samples that could not be identified by OAF-IPD. These results highlight the method’s robustness and adaptability in hydrothermal environments. It enhanced the continuity of localization in SLAM, demonstrating the multi-view matching robustness of the generalized features.

There are also some shortcomings in our paper, which will be refined in future studies. The method’s computational overhead, particularly with dense matching and NeRF-based generalization, may affect real-time SLAM performance. Future work could focus on runtime optimization. The approach relies on high-quality, diverse training data. Limited datasets, especially for variable underwater conditions, may reduce generalization effectiveness, highlighting the need for broader datasets or advanced augmentation techniques.

Author Contributions

Conceptualization, Y.L.; methodology, Y.L.; software, Y.L. and G.C.; validation, Y.X.; writing—original draft preparation, Y.L. and G.C.; writing—review and editing, Z.Z. and L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (51979058).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in this study.

Data Availability Statement

The dataset collected by the authors, which includes measurements of images and sonar data, is available upon request. Interested researchers can contact liuyihuiheu@hrbeu.edu.cn for access. Please ensure appropriate attribution if redistributing or referencing these visuals. For any specific image access or queries, contact the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Humphris, S.E.; Tivey, M.K.; Tivey, M.A. The Trans-Atlantic Geotraverse hydrothermal field: A hydrothermal system on an active detachment fault. Deep Sea Res. Part II 2015, 121, 8–16. [Google Scholar] [CrossRef]
Yang, K.; Scott, S.D. Possible contribution of a metal-rich magmatic fluid to a sea-floor hydrothermal system. Nature 1996, 383, 420–423. [Google Scholar] [CrossRef]
Bloesch, M.; Omari, S.; Hutter, M.; Siegwart, R. Robust visual inertial odometry using a direct EKF-based approach. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 298–304. [Google Scholar]
Germain, H.; Lepetit, V.; Bourmaud, G. Neural Reprojection Error: Merging feature learning and camera pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Sarlin, P.-E.; Unagar, A.; Larsson, M.; Germain, H.; Toft, C.; Larsson, V.; Pollefeys, M.; Lepetit, V.; Hammarstrand, L.; Kahl, F.; et al. Back to the Feature: Learning robust camera localization from pixels to pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Zhou, Q.; Sattler, T.; Leal-Taixé, L. Patch2Pix: Epipolar-guided pixel-level correspondences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Quattrini Li, A.; Coskun, A.; Doherty, S.M.; Ghasemlou, S.; Jagtap, A.S.; Modasshir, M.; Rahman, S.; Singh, A.; Xanthidis, M.; O’Kane, J.M.; et al. Experimental comparison of open source vision-based state estimation algorithms. In Proceedings of the International Symposium on Experimental Robotics (ISER), Tokyo, Japan, 3–8 October 2016. [Google Scholar]
Joshi, B.; Rahman, S.; Kalaitzakis, M.; Cain, B.; Johnson, J.; Xanthidis, M.; Karapetyan, N.; Hernandez, A.; Li, A.Q.; Vitzilaios, N.; et al. Experimental comparison of open source visual-inertial based state estimation algorithms in the underwater domain. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 4–8 November 2019; pp. 7227–7233. [Google Scholar]
Joe, H.; Cho, H.; Sung, M.; Kim, J.; Yu, S.-C. Sensor fusion of two sonar devices for underwater 3D mapping with an AUV. Auton. Robot. 2021, 45, 543–560. [Google Scholar] [CrossRef]
Hu, C.; Zhu, S.; Liang, Y.; Mu, Z.; Song, W. Visual-pressure fusion for underwater robot localization with online initialization. IEEE Robot. Autom. Lett. 2021, 6, 8426–8433. [Google Scholar] [CrossRef]
Rahman, S.; Quattrini Li, A.; Rekleitis, I. SVIn2: A multi-sensor fusion-based underwater SLAM system. Int. J. Robot. Res. 2022, 41, 1022–1042. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from Scale-Invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Ess, A.; Tuytelaars, T.; Van Gool, L. Speeded-Up robust features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G.R. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
Yi, K.M.; Trulls, E.; Lepetit, V.; Fua, P. LIFT: Learned invariant feature transform. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2016; pp. 467–483. [Google Scholar]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 224–236. [Google Scholar]
Christiansen, P.H.; Kragh, M.F.; Brodskiy, Y.; Karstoft, H. UnsuperPoint: End-to-end Unsupervised Interest Point Detector and Descriptor. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Liu, Y.; Xu, Y.; Zhang, Z.; Wan, L. Unsupervised Learning-Based Optical–Acoustic Fusion Interest Point Detector for AUV Near-Field Exploration of Hydrothermal Areas. J. Mar. Sci. Eng. 2024, 12, 1406. [Google Scholar] [CrossRef]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing scenes as neural radiance fields for view synthesis. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Kerbl, B.; Leimkühler, T.; Müller, J.; Kaplanyan, A.S.; Eggermont, J. 3D Gaussian Splatting for real-time radiance field rendering. ACM Trans. Graph. (TOG) 2023, 42, 139:1–139:14. [Google Scholar] [CrossRef]
Suárez, I.; Sfeir, G.; Buenaposada, J.M.; Baumela, L. BEBLID: Boosted Efficient Binary Local Image Descriptor. Pattern Recognit. Lett. 2020, 133, 366–372. [Google Scholar] [CrossRef]
Humenberger, M.; Zaffaroni, P.; Lienhart, R. FeatureBooster: Boosting Feature Descriptors for Robust Matching. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Li, K.; Wang, L.; Liu, L.; Ran, Q.; Xu, K.; Guo, Y. Decoupling Makes Weakly Supervised Local Feature Better. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Huang, Y.; You, S. You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Zhang, G.; Du, C.; Sun, Y.; Xu, H.; Qin, H.; Huang, H. UUV Trajectory Tracking Control Based on ADRC. In Proceedings of the IEEE International Conference on Robotics and Biomimetics, Qingdao, China, 3–7 December 2016. [Google Scholar]
Guo, Z.; Yang, X.; Gao, J.; Yan, J.; Luo, X. Velocity Observer-based Tracking Control of Autonomous Underwater Vehicle with Communication Delay. In Proceedings of the International Symposium on Autonomous Systems, Shanghai, China, 29–31 May 2019. [Google Scholar]
Larsson, V. PoseLib—Minimal Solvers for Camera Pose Estimation. 2020. Available online: https://github.com/vlarsson/poseLib (accessed on 31 January 2020).

Figure 1. Dual-branch generalization model based on OAF-IPD. The blue boxes represent the OAF-IPD and its output, while the red boxes highlight the research focus of this paper. The inputs for feature generalization include the image and the interest point locations provided by the OAF-IPD. The generalized features generated through this process are then combined with the descriptors from the OAF-IPD output to produce the final generalized descriptors.

Figure 2. OAF-IPD takes an image and sonar data as input, and outputs an interest point vector. Each interest point m is described by a score, a depth, a position, and a descriptor. “2*Conv-32” represents two convolutional layers, each with 32 channels. The blue color in the figure represents the camera data processing module, purple indicates the sonar data processing module, red shows the feature fusion module, and green denotes the output of OAF-IPD (these color conventions are used consistently in subsequent figures).

Figure 3. Composition of the fusion feature descriptor.

Figure 4. An overview of NeRF scene representation and differentiable rendering procedure.

Figure 5. NeRF feature generalization.

Figure 6. Dual-branch generalization model.

Figure 7. 3D Gaussian splatting optimization flowchart.

Figure 8. Basin test acquisition image and hydrothermal zone image.

Figure 9. Comprehensive test basin.

Figure 10. Hydrothermal scene layout location.

Figure 11. Acoustic–optic combined data acquisition system.

Figure 12. IP-MSC3105 underwater camera.

Figure 13. M900-Mk2 imaging sonar.

Figure 14. Experiments in the comprehensive test basin.

Figure 15. 3D reconstruction of the hydrothermal area scene.

Figure 16. 3D reconstruction of the hydrothermal vent model.

Figure 17. Generalization effects in bright environments.

Figure 18. Generalization effects in dark environments.

Figure 19. Localization results of SLAM system using dual-branch feature generalization.

Figure 20. Local zoomed-in view of the trajectory.

Figure 21. Localization continuity.

Table 1. Data types of fusion feature descriptors.

Dimension	Sign	Definition	Data Type
1–256	$d$	Original feature	float
257–512	$\tilde{d}$	Generalized feature	float
513	$c$	Feature type	int
514	$m c$	Matching type	int
515	$m s$	Matching score	float

Table 2. Comparison of average metrics for descriptor generalization.

Methods	Bright Environment				Dark Environment
Methods	A	P	R	F1	A	P	R	F1
None	40.7	81.4	44.9	57.9	30.2	75.4	43.0	54.6
NeRF	87.7	76.4	98.6	86.1	80.0	63.4	81.8	71.5
3D-GS	79.4	74.5	82.6	78.3	79.8	66.0	80.8	72.7
Dual-branch	89.0	79.2	98.4	87.8	85.4	68.4	85.8	76.1

Table 3. Relative pose estimation.

Generalization	AUC-RANSAC			AUC-LO-RANSAC
Generalization	5°	10°	20°	5°	10°	20°
None	37.2	54.3	59.4	49.7	61.3	63.4
NeRF	49.7	63.2	72.9	68.1	76.4	85.3
3D-GS	46.5	59.4	74.2	60.4	70.4	85.1
Dual-branch	49.2	65.8	75.1	67.3	79.8	88.5

Table 4. Running time of main parts in SLAM.

System		ORB	OAF
Method		-	-	NeRF	3D-GS	Dual-Branch
Descriptor size		128	256	517
Time cost (ms/)	Detector	16.07	21.77
	Generalization	-	-	2290.31	4.78	75.23
	Tracking	8.78	20.86	70.74	73.67	71.79
	Mapping	230.54	280.41	259.63	284.12	280.17
	Total	255.39	323.04	2642.45	384.34	448.96
Additional time (%)		0	26.49	934.67	50.49	75.59

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Chen, G.; Xu, Y.; Wan, L.; Zhang, Z. Dual-Branch Feature Generalization Method for AUV Near-Field Exploration of Hydrothermal Areas. J. Mar. Sci. Eng. 2024, 12, 2359. https://doi.org/10.3390/jmse12122359

AMA Style

Liu Y, Chen G, Xu Y, Wan L, Zhang Z. Dual-Branch Feature Generalization Method for AUV Near-Field Exploration of Hydrothermal Areas. Journal of Marine Science and Engineering. 2024; 12(12):2359. https://doi.org/10.3390/jmse12122359

Chicago/Turabian Style

Liu, Yihui, Guofang Chen, Yufei Xu, Lei Wan, and Ziyang Zhang. 2024. "Dual-Branch Feature Generalization Method for AUV Near-Field Exploration of Hydrothermal Areas" Journal of Marine Science and Engineering 12, no. 12: 2359. https://doi.org/10.3390/jmse12122359

APA Style

Liu, Y., Chen, G., Xu, Y., Wan, L., & Zhang, Z. (2024). Dual-Branch Feature Generalization Method for AUV Near-Field Exploration of Hydrothermal Areas. Journal of Marine Science and Engineering, 12(12), 2359. https://doi.org/10.3390/jmse12122359

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual-Branch Feature Generalization Method for AUV Near-Field Exploration of Hydrothermal Areas

Abstract

1. Introduction

2. Related Works

3. Method

3.1. Construction of Optical–Acoustic Fused Feature Descriptors

3.2. Feature Generalization Method Based on NeRF

3.3. Dual-Branch Generalization Model

3.3.1. 3D Gaussian Splatting Model

3.3.2. Shared Descriptors

3.3.3. Feature Confidence

3.3.4. Training

4. Experiments and Results

4.1. Data Acquisition Experiment

4.1.1. Experimental Setup

4.1.2. Data Acquisition System

4.1.3. Data Acquisition Process

4.1.4. Negative Label Generation with OAF-IPD

4.1.5. Location

4.2. Metrics

4.3. Results

4.3.1. Comparison of Descriptor Generalization Effects

4.3.2. Relative Pose Estimation

4.3.3. Generalization Method Effectiveness in SLAM

4.3.4. Process Speed

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI