A Study of the Three-Dimensional Localization of an Underwater Glider Hull Using a Hierarchical Convolutional Neural Network Vision Encoder and a Variable Mixture-of-Experts Transformer

Lee, Jungwoo; Park, Ji-Hyun; Hwang, Jeong-Hwan; Noh, Kyoungseok; Suh, Jinho

doi:10.3390/rs18050793

Open AccessArticle

A Study of the Three-Dimensional Localization of an Underwater Glider Hull Using a Hierarchical Convolutional Neural Network Vision Encoder and a Variable Mixture-of-Experts Transformer

by

Jungwoo Lee

¹

,

Ji-Hyun Park

¹,

Jeong-Hwan Hwang

¹,

Kyoungseok Noh

¹

and

Jinho Suh

^2,*

¹

Smart Mobility Research Center, Korea Institute of Robotics and Technology Convergence (KIRO), Pohang 37666, Republic of Korea

²

Major of Mechanical System Engineering, Pukyong National University, Busan 48513, Republic of Korea

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(5), 793; https://doi.org/10.3390/rs18050793

Submission received: 14 January 2026 / Revised: 27 February 2026 / Accepted: 4 March 2026 / Published: 5 March 2026

(This article belongs to the Special Issue Multi-Source Data Fusion and Feature Extraction for Underwater Target Detection)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Robust underwater glider hull detection and 3D localization are enabled by a hierarchical convolutional neural network (CNN) vision encoder with a variable mixture-of-experts (vMoE) transformer.
Camera–sonar multi-sensor fusion significantly improves depth estimation and enables successful autonomous glider recovery in real sea conditions.

What are the implications of the main findings?

Adaptive, capacity-aware learning model architecture effectively enables perception in challenging underwater environments.
The findings demonstrate that it is possible to recover underwater gliders in a practical, autonomous, and markerless manner, thereby making the process more efficient and reliable.

Abstract

Although underwater gliders are highly energy-efficient platforms capable of long-duration and large-scale ocean observation, their lack of self-propulsion requires external assistance for recovery upon mission completion. In harsh and dynamic marine environments, reliably detecting the glider and accurately estimating its three-dimensional position are critical to ensuring the recovery operations are safe and efficient. This paper proposes a perception framework based on deep learning to detect underwater glider hulls and estimate their three-dimensional relative positions using camera–sonar multi-sensor fusion. This approach integrates a hierarchical convolutional neural network (CNN) vision encoder and a transformer-based architecture to estimate the glider’s spatial location and heading direction simultaneously. The hierarchical CNN encoder extracts multi-level, semantically rich visual features, thereby improving robustness to visual degradation and environmental disturbances common in underwater settings. Additionally, the transformer incorporates a variable mixture-of-experts (vMoE) mechanism that adaptively allocates expert networks across layers, enhancing representational capacity while maintaining computational efficiency. The resulting pose estimates enable precise, collision-free ROV navigation for automated recovery and onboard sensor inspection tasks. Experimental results, including ablation studies, validate the effectiveness of the proposed components and demonstrate their contributions to accurate glider hull detection and three-dimensional localization. Overall, the proposed framework provides a scalable, reliable perception solution that allows for the safe, autonomous recovery of underwater gliders with an ROV in realistic ocean environments.

Keywords:

underwater glider recovery; multi-sensor fusion 3D localization; hierarchical CNN feature extraction; mixture-of-experts transformer; ROV-assisted autonomous operation

1. Introduction

The oceans cover a substantial portion of the Earth’s surface and constitute a vast source of natural resources and scientific information. As economic development and marine technology advance, interest in ocean exploration and long-term environmental observation has grown significantly [1]. Under these conditions, unmanned underwater vehicles (UUVs) have emerged as essential platforms for operating in environments that are inaccessible or hazardous to humans [2]. Among them, autonomous underwater vehicles (AUVs), remotely operated vehicles (ROVs), and underwater gliders are widely employed for oceanographic sensing and data acquisition. Underwater gliders are recognized as highly energy-efficient unmanned systems, as they achieve propulsion through active buoyancy modulation that alternates between positive and negative lift [3]. This mechanism enables extremely low power consumption, long operational endurance, and extended travel range, allowing gliders to perform missions spanning from several hours to several months and to traverse distances on the order of thousands of kilometers [4].

These gliders have many advantages, but they generally cannot propel themselves to the surface. This makes it impractical for them to return to a recovery vessel on their own after extended missions. Consequently, gliders often drift on the surface of the sea and must be retrieved by external efforts. Additionally, circumstances that deviate from nominal conditions, such as hardware failure or a deviation from the planned trajectory, necessitate rescue and recovery operations [5]. Recovering gliders in real ocean environments poses significant challenges and entails considerable risk due to wave-induced motion, poor visibility, adverse weather conditions, and the potential for collision with vessels or recovery equipment. These factors increase the likelihood of damage to the glider hull and onboard sensors and place significant reliance on operator experience in conventional manual recovery procedures. Therefore, there is a demand for automated recovery methodologies that are safe, efficient, and improve operational reliability while minimizing human risk.

The hull of an underwater glider constitutes the central structural component, serving as the housing for the primary control electronics and mission-critical oceanographic sensors. It typically incorporates a recovery ring or attachment point utilized during retrieval operations. It is imperative that remotely operated vehicles deployed for inspection or recovery approach the glider with high positional accuracy. This ensures that the sensors can be assessed or the recovery mechanism engaged without inducing a collision [6]. Achieving such safe and precise interaction necessitates reliable detection of the glider in dynamic surface or subsurface environments, as well as accurate estimation of the three-dimensional position and pose of the hull. Robust underwater glider hull detection is a foundational enabling technology for automated inspection and recovery systems, playing a pivotal role in enhancing the safety, efficiency, and sustainability of long-term glider-based ocean observation missions.

This study presents a deep learning framework that can detect underwater gliders and estimate their three-dimensional positions relative to ROVs in underwater environments. This approach addresses the need for reliable perception and localization capabilities in automated glider recovery scenarios. The framework uses a convolutional neural network (CNN)-based vision encoder and a transformer module to estimate the three-dimensional position of the glider hull and its heading direction. It takes underwater color images captured by an ROV-mounted camera and sonar sensor data as its inputs. The resulting spatial and directional information guides the ROV’s collision-free motion, enabling precise approach maneuvers for recovery ring engagement or onboard sensor inspection tasks.

The main contributions of this study are summarized as follows:

We present a vision encoder based on a hierarchical convolutional neural network. This encoder can robustly localize a glider within an input image and extract visual features associated with its scale and spatial extent. Unlike conventional flat feature extraction schemes, the proposed encoder generates multi-level, semantically enriched feature representations at different depths. Subsequent transformer layers then use these representations to improve perception accuracy.
We propose a transformer architecture that incorporates a variable mixture-of-experts (vMoE) mechanism to improve representational capacity and computational efficiency. The vision encoder’s hierarchical feature vectors are processed through multiple transformer encoder blocks that replace standard feed-forward networks with vMoE modules. This approach effectively compresses knowledge while enabling efficient training and inference by adaptively varying the number of experts across layers. In the early stages, a larger set of low-dimensional experts is used; in the deeper layers, a smaller set of high-dimensional experts is employed. These contributions together provide a robust, scalable solution for underwater glider hull detection and 3D localization, facilitating safe, autonomous, ROV-assisted recovery operations.
We propose a multi-sensor fusion approach that combines high-resolution color camera data from an underwater environment with low-resolution, long-range sonar measurements. This approach improves glider detection and three-dimensional position estimation. The combination of the camera’s visual detail with the sonar’s wide-area sensing improves the robustness and localization accuracy of the proposed method in challenging underwater conditions. This enables reliable detection and localization in situations where single-sensor perception is often unreliable.

2. Related Works

2.1. Underwater Glider Recovery and Underwater Approaching Guidance

Although underwater gliders are highly energy efficient and capable of long-range operations for extended missions, they generally lack the ability to autonomously return to a recovery vessel upon mission completion. This makes retrieval by external platforms unavoidable. In real marine environments, glider recovery is challenging and risky due to wave-induced motion, reduced visibility, adverse weather conditions, and collision risks. Consequently, considerable research efforts have been devoted to developing automated approaches for safely and reliably recovering gliders. For example, ROV-based glider recovery systems have been proposed, in which the retrieval process is divided into three stages: approach, capture, and lifting. The feasibility of automation is demonstrated by incorporating control strategies that account for ocean disturbances, vehicle propulsion, and dynamic characteristics [7].

Glider recovery is closely related to the more general task of precisely approaching and docking with underwater targets. In this task, detecting docking stations or visual markers and accurately estimating relative pose are crucial for final-stage performance. Recent studies have proposed vision-based underwater docking guidance frameworks that can detect docking markers in real time and estimate relative position and orientation using multiple visual cues, even under complex underwater visual conditions. These approaches have been shown to improve the stability and reliability of close-proximity interactions, including mechanical coupling during recovery operations [8]. In this context, the three-dimensional hull localization problem addressed in this study can be considered an extension of perception-based guidance for close interaction. The difficulty is increased by the fact that the target is not a fixed docking station, but rather a freely drifting or maneuvering underwater glider hull.

In the field of underwater robotics, multi-sensor fusion requires complementary sensing and reliable spatiotemporal alignment between different types of sensors. In practice, camera–sonar systems often experience timestamp discrepancies, sensor latency, and motion-induced misalignment. These issues can significantly degrade fusion quality in dynamic marine environments. Several studies have addressed these issues by employing time synchronization protocols, motion-compensated alignment with inertial sensing, and calibration-based registration approaches.

2.2. Vision-Based Underwater Object Detection and 3D Localization

Robust visual perception in underwater environments remains a challenging problem due to light absorption, scattering, color distortion, turbidity, and dynamic illumination conditions. These factors significantly degrade image quality, making reliable object detection and localization difficult. Consequently, a substantial body of research has focused on underwater visual perception, particularly on object detection using deep learning techniques. Several underwater datasets have been introduced to facilitate benchmarking and performance comparisons. The DUO dataset, for example, aggregates and re-annotates multiple underwater object detection datasets, providing a unified benchmark for evaluating detection algorithms in different underwater settings [9]. Building on these datasets, recent studies have adapted state-of-the-art object detectors, including YOLO-based architectures, for underwater scenarios. These adaptations demonstrate improved accuracy and real-time performance in detecting marine organisms and man-made objects [10].

While these approaches have achieved promising results in 2D object detection, most focus primarily on bounding box estimation in image space. However, for robotic manipulation and recovery tasks, such as ROV-assisted glider retrieval, two-dimensional detection alone is insufficient. Instead, the three-dimensional relative position and orientation of the target object must be accurately estimated to ensure collision-free, precise interaction. In this regard, underwater docking and station approach problems are closely related areas of research. Vision-based underwater docking systems have employed artificial visual markers and deep learning-based detectors to estimate relative pose between an AUV/ROV and a docking station, enabling autonomous guidance and alignment. Although effective, these methods often rely on predefined markers or structured docking interfaces, which limit their applicability to unstructured or marker-less targets such as free-floating underwater gliders.

Although monocular vision-based localization methods provide high-resolution appearance cues, their depth estimation performance is poor in low-visibility underwater conditions. Acoustic sensing, such as multibeam sonar, provides longer-range measurements, but it lacks fine spatial resolution. Therefore, multi-modal fusion frameworks have been increasingly explored to compensate for the weaknesses of each modality and improve performance in challenging underwater conditions.

2.3. Multi-Scale Feature Representation for Robust Underwater Perception

Accurate 3D localization from images strongly depends on the quality of extracted visual features, especially in environments with scale variation, partial occlusion, and motion-induced blur. Multi-scale and hierarchical feature representations have been shown to effectively address these challenges. Feature Pyramid Networks (FPNs) exploit the hierarchical nature of convolutional neural networks to generate multi-resolution feature maps, improving robustness to object scale variation and enhancing localization accuracy [11]. These hierarchical representations are particularly valuable in underwater settings, where the apparent size and visibility of objects can vary significantly due to distance and water conditions.

Recent 3D localization approaches in terrestrial robotics have produced impressive results with multi-view stereo, depth sensors, and large-scale pretraining. However, many of these methods rely on sensing assumptions that are difficult to satisfy underwater. These assumptions include stable lighting, a long visual range, and dense texture patterns. In contrast, underwater localization tasks are often limited by severe scattering, turbidity, and restricted camera visibility. Thus, perception architectures that are specifically designed for low-visibility conditions and heterogeneous sensor fusion are needed.

2.4. Transformer-Based Perception and Mixture-of-Experts Architecture

Transformer-based architectures have emerged as powerful models for visual perception due to their ability to capture long-range dependencies and global contextual relationships. The Vision Transformer (ViT) demonstrated that self-attention mechanisms can effectively replace convolutional inductive biases effectively when learning image representations, achieving competitive performance on large-scale vision benchmarks [12]. Building on this paradigm, DETR introduced transformers into object detection, formulating detection as a direct set prediction problem and enabling end-to-end training without hand-crafted post-processing steps such as non-maximum suppression [13]. To address the scalability and efficiency limitations of global attention, hierarchical transformer designs have been proposed. The Swin Transformer, for example, incorporates shifted window-based self-attention and a hierarchical feature structure, enabling the efficient processing of high-resolution images while preserving multi-scale representations [14].

Despite their strong representational capacity, transformer-based models are often computationally demanding, which motivates the exploration of more efficient architectures. One promising approach is the mixture-of-experts (MoE) paradigm, which originates from the seminal work on adaptive mixtures of local experts [15]. Early MoE models employed a supervised “divide-and-conquer” strategy in which multiple expert networks specialized in different regions of the input space, while a gating network dynamically determined the contribution of each expert. Subsequent research extended this concept by integrating MoE modules as components within deep neural networks, enabling experts to function as layers rather than standalone models and allowing large-capacity networks to be trained efficiently [16]. In parallel, researchers introduced the concept of conditional computation, in which a subset of network components is activated based on the input. This reduces unnecessary computation and improves scalability [17].

These ideas have led to the development of large-scale, sparse MoE models, especially in natural language processing. For example, Shazeer et al. demonstrated that sparsely gated MoE layers could scale neural networks to hundreds of billions of parameters while maintaining manageable inference costs. However, this approach presented challenges related to communication overhead and training stability [18]. More recently, the Switch Transformer simplified expert routing by activating only one expert per token. This significantly improved training stability and computational efficiency and established MoE as an effective mechanism for balancing model capacity and efficiency [19]. Inspired by these advances, MoE-based designs have been introduced to vision transformers. For instance, M3ViT integrates MoE modules into vision transformer architectures, enabling efficient multi-task learning through adaptive expert allocation [20].

However, applying transformer-based MoE architectures to underwater robotic perception is largely unexplored territory. These environments present unique challenges, such as severe visual degradation, scale variation, and task-specific perception requirements. These challenges make efficient, specialized feature processing particularly important. Unlike existing approaches, this work integrates hierarchical convolutional neural network (CNN)-based feature extraction with a transformer architecture augmented by a variable mixture-of-experts (MoE) mechanism [21]. The proposed method effectively compresses knowledge while preserving representational capacity by adaptively varying the number and dimensionality of experts across transformer layers. This makes the method ideal for three-dimensional relative localization via multi-sensor fusion in ROV-assisted autonomous recovery scenarios.

3. Methodology

To clarify this work’s novel methodology, we emphasize that the proposed framework targets the markerless, three-dimensional relative localization of an underwater glider hull for perception-guided remote operated vehicle interaction. This differs from conventional 2D underwater detection or marker-based docking guidance. Current underwater perception pipelines typically rely on single-modality vision in low-visibility conditions or employ multimodal fusion, which assumes consistent cross-sensor correspondence. Many underwater docking and approach systems use dedicated markers or structured docking interfaces to achieve stable guidance. However, these assumptions do not apply to practical glider recovery because the target is an unstructured hull subject to rapid environmental disturbances and appearance degradation. Accordingly, the proposed methodology is guided by two practical considerations: (i) exploiting the complementary acoustic structure from the multibeam sonar without requiring strict pixel-level alignment between modalities, and (ii) efficiently allocating model capacity for real-time robotic operation by emphasizing representational diversity in early stages and semantic refinement in deeper stages.

3.1. Data Acquisition

Sensor data for underwater glider hull detection is collected using a remotely operated vehicle (ROV) equipped with multiple underwater cameras and a sonar sensor. The primary camera mounted on the ROV is an OAK-D Pro PoE camera manufactured by Luxonis in the USA [22]. As shown in Figure 1, two additional GoPro cameras are mounted on either side of the primary camera to enable parallel image acquisition from horizontal viewpoints on the ROV platform. The sonar sensor, which is installed on the lower section of the ROV, is a dual-frequency, multibeam Oculus M750d sonar manufactured by Blueprint Subsea in England [23]. This sensor provides wide-area acoustic perception to complement visual sensing.

The ROV is deployed in both an indoor water tank and an outdoor harbor pier. During indoor tank experiments, an overhead crane suspended from the ceiling of the water tank holds the glider in a floating position. While the glider remains stationary, the ROV approaches it from various angles and distances to acquire visual data.

A recovery ring is mounted under the center of the glider’s hull, and artificial visual markers are attached to the forward and aft sections of the lower hull to facilitate the accurate computation of the relative distance and displacement between the ROV and the glider during data acquisition. However, in practical operational scenarios, underwater gliders are generally not equipped with artificial markers, and the presence of a recovery ring depends on the mission’s specific configuration. To maintain consistency with realistic deployment conditions, image regions corresponding to the artificial markers and the recovery ring are removed from the captured images. This preprocessing strategy prevents the learning model from exploiting artificial visual cues and encourages robust hull detection based on the glider’s intrinsic visual characteristics. Figure 2 shows the preprocessing procedure applied to a raw input image.

Image data acquired in the water tank is collected under highly static conditions where the glider’s motion is minimal and the illumination is relatively uniform. To simulate environmental disturbances encountered in real marine settings, such as intense sunlight and irregular surface waves, artificial lighting is directed toward the glider from multiple locations within the tank. Wave disturbances are generated by agitating the water’s surface with paddles. This experimental setup enables the acquisition of training and validation data that more closely resembles realistic ocean conditions. This improves the robustness of the deep learning model. Figure 3 illustrates the application of artificial lighting and wave disturbances.

During harbor experiments, the glider floated freely on the surface of the sea. When it drifted away from the pier, an operator on a nearby boat approached and towed it back. In real sea environments, the field of view of the underwater camera is severely limited, and image quality degrades rapidly with distance. At distances exceedingly approximately one or two meters, artificial markers become indistinguishable. As a result, acquiring data of sufficient quality for effective model training in open-water conditions is highly challenging. Consequently, as shown in Figure 4, the data collected in real marine environments are used as a test set after training the deep learning model with water tank data.

The training data samples from the water tank environment total around 400,000, and the validation set samples are about 60,000. The number of test dataset samples acquired from the ocean environment is approximately 36,000.

3.2. Construction of Multi-Sensor Training Data

In a preliminary evaluation, we investigated multiple training configurations to assess the impact of multi-camera image inputs on model performance. Specifically, we constructed training datasets by combining color images from the primary camera with images from up to four additional cameras mounted horizontally on the ROV. We then briefly trained a baseline model with these stacked images and compared the resulting performance across configurations. The experimental results indicated that including additional camera images only marginally improved learning accuracy. This limited improvement is likely due to the constrained visibility range in underwater environments and the relatively small physical displacement between the cameras mounted on the ROV platform. This likely restricts the amount of complementary visual information captured. Based on these observations, rather than stacking images from multiple cameras to create a single training sample, we combined multi-camera data with data from other sensors to expand and diversify the training dataset.

Figure 5 illustrates the processing and visualization of data obtained from the multibeam sonar sensor mounted on the ROV. The left panel presents the native sonar visualization, in which acoustic measurements are displayed in a polar coordinate system that corresponds to the sonar’s fan-shaped field of view. This representation reflects the spatial distribution of sonar beams and their respective ranges, providing an understanding of the location of obstacles in relation to the sensor.

The right panel shows a transformed representation of the same sonar measurements. In this representation, the intensity of acoustic echoes reflected from underwater objects across a wide radial range is converted into a grayscale image. In this representation, pixel intensity corresponds to the strength of the returned sonar signal; higher intensities indicate stronger reflections. This image-based formulation facilitates integration with a vision-based perception layer.

Figure 6 illustrates the procedure for creating multi-sensor training samples by combining visual and acoustic data. The upper-left panel shows a color image captured by the ROV-mounted underwater camera. Although this camera provides high-resolution visual information, its capabilities are limited by underwater visibility conditions. The lower-left panel depicts the intensity map obtained from the multibeam sonar sensor. This map shows the strength of acoustic reflections from underwater objects over a wide radial range.

To leverage the distinct properties of these two sensing modalities, the color image is decomposed into red (

R_{c o l o r}

), green (

G_{c o l o r}

), and blue (

B_{c o l o r}

) channels, while the sonar measurement is depicted as a single-channel intensity image (

I_{s o n a r}

). As shown on the right side of Figure 6, these four channels are stacked along the channel dimension to create a multi-channel input tensor. This channel-wise stacking strategy allows the learning model to simultaneously process fine-grained visual features from the camera and long-range structural information from the sonar.

In the current experimental setup, sensor synchronization is performed at the software level using the ROV’s onboard shared system clock. Camera images and multibeam sonar intensity maps are recorded using a unified timestamp system, and frame pairing is conducted via nearest-timestamp alignment. Because the platform moves relatively slowly, temporal discrepancies within the synchronization tolerance introduce negligible spatial displacement.

Note that the camera and the multibeam sonar observe different spatial domains and employ different projection models. The camera uses perspective projection, while the sonar uses polar acoustic sampling. In this study, the sonar’s intensity data is reconstructed in polar coordinates and resized to match the spatial resolution of the camera image. Then, the reconstructed sonar map is concatenated with the RGB channels to create a four-channel input tensor.

However, strict pixel-level geometric alignment between modalities is not assumed. The proposed transformer-based architecture does not interpret the four-channel input as a conventional color image. Instead, each channel is treated as an independent feature source. Through attention mechanisms, the model learns cross-modal relationships, enabling it to capture correlations between sonar structural cues, visual features, and 3D localization outputs without enforcing explicit geometric reprojection between sensing domains.

In practical recovery scenarios, the measurements obtained from cameras and multibeam sonar are affected by platform motion and significant latency. Additionally, the two sensors have different sampling and projection characteristics. Under these conditions, enforcing frame-wise or pixel-wise correspondence can introduce systematic bias when the apparent spatial relationship between the two modalities fluctuates. Therefore, although the sonar intensity map is resized to match the camera resolution when constructing the four-channel input, the proposed early fusion method does not rely on geometric reprojection. Rather, each channel is treated as a heterogeneous feature source, and cross-modal associations are implicitly learned through attention. This enables the model to use sonar-derived structural cues for enhanced depth-sensitive localization without making rigid alignment assumptions.

3.3. Hierarchical CNN-Based Vision Encoder for Multi-Level Feature Representation

In prior studies [15], input images were divided into patches using various horizontal and vertical partitioning schemes, and a CNN-based vision encoder was applied independently to each patch to perform flat feature extraction. This approach captures local visual information, but it has limited in its ability to represent multi-level semantic characteristics within each patch. To address this limitation, the present study introduces a hierarchical convolutional neural network (CNN)-based vision encoder that improves feature representation at the patch level. The hierarchical CNN encoder aggregates multi-level semantics within local regions to improve the quality of patch embeddings provided to the transformer. Unlike purely flat feature extraction, the resulting patch embeddings preserve fine-grained visual cues and higher-level context. This is advantageous in underwater environments where local textures are weak, and appearances frequently change.

Figure 7 illustrates how the input image is divided based on a predefined grid configuration. Each resulting patch is then processed by a convolutional neural network (CNN) vision encoder to extract two-dimensional feature representations. These initial feature vectors are successively refined through a hierarchy of CNN encoders with different kernel sizes. The encoders are applied in an overlapping, multi-stage process. Finally, the proposed encoder combines feature representations from all hierarchical levels to create a composite feature vector that captures both fine-grained local details and higher-level semantic information. This hierarchical feature extraction strategy produces more robust and informative visual representations, thereby improving perception tasks such as glider hull localization and pose estimation.

Let

I \in R^{H \times W \times 3}

denote an input underwater RGB image captured by the ROV-mounted camera. The image is first partitioned into a fixed grid of

P

non-overlapping patches.

I = {I_{1}, I_{2}, \dots, I_{P}}

(1)

where each patch

I_{p} \in R^{h \times w \times 3}

. For each patch

I_{p}

, a hierarchical CNN vision encoder is applied to extract multi-level feature representations.

Let

f^{(l)} (\cdot)

denote the CNN encoder at hierarchy level

l

, employing different kernel sizes and receptive fields.

The hierarchical features are computed as:

f_{p}^{(1)} = f^{(1)} (I_{p}), f_{p}^{(l)} = f^{(l)} (f_{p}^{(l - 1)}), l = 2, \dots, L .

(2)

The final patch-level feature vector is obtained by concatenating all hierarchical features.

f_{p} = Concat (f_{p}^{(1)}, f_{p}^{(2)}, \dots, f_{p}^{(L)})

(3)

All patch features are then stacked to form the input token sequence for the transformer.

F = [f_{1}, f_{2}, \dots, f_{P}] \in R^{P \times D}

(4)

To investigate the effect of hierarchical depth in the proposed CNN vision encoder, we conducted an ablation study by varying the number of hierarchical levels

L

used for feature extraction. Specifically, we evaluated encoder configurations with depths of

L \in {1,2, 3}

, where

L = 1

corresponds to a conventional flat CNN encoder without hierarchical feature aggregation, and larger values of

L

progressively incorporate additional convolutional stages with different kernel sizes and spatial receptive fields. For each configuration of hierarchical depth, the model was trained for 500 epochs with a batch size of 32 on an RTX A5000 GPU manufactured by Nvidia in the USA.

Figure 8 shows scatter plots that demonstrate the relationship between the predicted and actual values at each level of the hierarchy. Quantitative metrics, such as mean absolute error (MAE) and the coefficient of determination, indicate a significant improvement in performance with increasing hierarchy depth. There is poor correlation at depth 1 (

R^{2} = 0.05

), but the model substantially improves at depth 2 (

R^{2} = 0.49

) and achieves optimal performance at depth 3 (

R^{2} = 0.52, M A E = 0.06

). These results suggest that deeper hierarchical representations are essential for capturing the spatial dependencies necessary for precise 3D coordinate estimation.

The regression error characteristic (REC) curves in Figure 9 illustrate the relationship between cumulative accuracy and allowable error tolerance. Quantitative analysis reveals that the error threshold needed to reach 90% accuracy progressively decreases from 0.218 at depth 1 to 0.185 at depth 2 and to 0.162 at depth 3, with over 50% of samples at depth 3 converging within a margin of error of 0.033. These results suggest that deeper hierarchical features can effectively compensate for fine-grained spatial errors.

The experimental results demonstrate that increasing the hierarchical depth from

L = 1

to

L = 3

improves the accuracy of glider hull localization and heading estimation. This improvement is due to the encoder’s ability to capture fine-grained local details and higher-level semantic context through multi-level feature aggregation. Shallow configurations (

L = 1

) tend to focus on local texture and edge information, which is often unreliable in underwater environments due to variations in illumination and turbidity. In contrast, deeper hierarchical encoders provide more robust representations by integrating contextual and structural cues across multiple scales.

However, increasing the depth beyond

L = 3

only marginally improves performance and incurs additional computational costs and a higher risk of feature redundancy. These results suggest a trade-off between representational richness and efficiency. Based on these findings, the study selects a moderate hierarchical depth as the default configuration, offering a favorable balance between perception accuracy and computational complexity.

3.4. Transformer with Variable Mixture-of-Experts for Efficient Inference in 3D Position Estimation

A transformer uses CNN-based vision encoders to process features extracted from image patches of varying sizes and output inter-feature relationships. This allows information to be integrated and refined into a more expressive representation of the extracted features.

The variable mixture-of-experts (vMoE) encoder replaces the conventional feed-forward network (FFN) with a dynamically routed mixture of a varying number of experts while preserving the multi-head attention and residual normalization structure of a transformer. A gating mechanism with sparse selection activates a subset of experts at each layer, enabling efficient and scalable representation learning.

Figure 10 illustrates a comparison between a standard transformer encoder and the proposed vMoE transformer encoder. The conventional transformer encoder architecture is depicted on the left side. Each encoder block consists of a multi-head self-attention (MHA) layer followed by a feed-forward network (FFN). Residual connections and layer normalization, denoted as ‘

A d d & N o r m

’, are applied after both the MHA and FFN sublayers to stabilize training and facilitate gradient flow. In the standard transformer, the FFN is a dense module that processes all input tokens with a fixed set of parameters in each layer.

The proposed vMoE transformer encoder is shown on the right side. It replaces the FFN sublayer with a variable mix-of-experts module while preserving the overall transformer structure. This module consists of multiple feed-forward networks, or “experts,” denoted as ‘

F F N_{k}

’, where

k

represents the total number of experts available at a given layer. A gating network controls the selection and combination of the experts. In early encoder layers, a larger number of low-dimensional experts may be employed to capture diverse low-level patterns, whereas deeper layers may utilize fewer but higher-capacity experts to model more abstract semantic information. The final encoder output is produced after stacking all vMoE-adapted encoder blocks.

Underwater imagery often exhibits contrast loss and sparsity. These issues can cause early representations to become less discriminative across viewpoints and distances. The vMoE encoder allocates more experts to the early stages to encourage diverse low-level feature processing. This provides more stable cues for subsequent fusion and regression. Deeper stages can then operate on richer, more consistent representations to refine semantic understanding while maintaining computational efficiency through sparse expert activation.

To improve computational efficiency, a sparse selection mechanism activates only the experts with the highest weights, rather than all

k

experts. This selective activation enables conditional computation, which allows the model to increase its representational capacity without proportionally increasing computation.

We conducted an ablation study to analyze the impact of expert allocation on model performance and efficiency. For this study, we assigned different numbers of experts to each vMoE transformer encoder block. Specifically, we evaluated configurations in which the number of experts varied across encoder blocks while adjusting the dimensionality of each expert to maintain an approximately constant overall computational resource.

Table 1 summarizes the three expert allocation strategies that were evaluated in the ablation study. According to the rule for varying the number of experts, more experts are assigned to the early encoder blocks that receive output features from the CNN vision encoder. Case 1 uses a constant number of experts and routing parameters throughout all transformer encoder blocks. Case 2 gradually reduces the number of experts and routing capacity toward deeper layers uniformly. Case 3 uses a highly skewed allocation, assigning many experts to the early encoder blocks and progressively fewer to the later blocks. This emphasizes rich feature diversification in the early stages and refined representation learning in the deeper layers. The base model has eight encoder blocks, and each configuration was trained for 200 epochs with a batch size of 32 using an NVIDIA RTX A5000 GPU.

We analyzed the computational complexity of the vMoE architecture using a single forward pass, a batch size of one, and a fixed input resolution. We computed floating-point operations (FLOPs) and measured inference speed (FPS) on the deployed hardware platform after a warm-up stage.

The total computational cost of each Transformer layer consists of self-attention operations and feed-forward (expert) computations. In the Mixture-of-Experts design, the feed-forward network is replaced by multiple experts. However, during inference, only the top k routed experts are activated.

Although more experts are allocated to the early layers to promote feature diversification, the actual inference cost is bounded by the routing parameter. This design enhances representational diversity in shallow layers without proportionally increasing computational complexity.

We analyze the number of parameters, floating-point operations (FLOPs), and inference speed under a single forward pass with a batch size of 1 to evaluate the computational efficiency of the proposed architecture. The model contains 55.86 M parameters and requires 0.860 GFLOPs per inference at the configured input resolution. The FLOPs were computed based on the network’s full forward propagation, including the vMoE layers under top-k expert routing. Inference speed was measured in an Intel i5 notebook environment using the test dataset and achieved approximately 11 frames per second (FPS) in evaluation mode. These results suggest that the early-heavy vMoE architecture increases representational capacity while maintaining practical, real-time feasibility for embedded, underwater inspection scenarios.

Figure 11 and Figure 12 show the inference results of the three-dimensional position regression models on the validation set for each case. Figure 11 compares the predicted and ground-truth values along the z-axis, which represents the distance between the ROV and the glider.

The quantitative metrics in the upper-left corner of the plots indicate that Case 3 has the highest coefficient of determination (

R^{2} = 0.56

), followed by Cases 1 (

R^{2} = 0.22

) and 2 (

R^{2} = 0.45

). Additionally, Case 3 exhibits the lowest mean absolute error (MAE) of 0.07. These results demonstrate that the proposed expert allocation strategy with a biased distribution of experts significantly improves prediction accuracy.

Figure 12 shows the regression error characteristic (REC) curves, which represent the proportion of predictions that fall within a given error tolerance. A curve that rises more steeply toward the upper-left corner indicates greater model robustness. Case 3 exhibits the narrowest tolerance at 0.174 when comparing the error thresholds required to achieve 90% accuracy (

A c c = 0.9

), thereby outperforming Cases 1 and 2 (0.203 and 0.186, respectively) and indicating more precise prediction capability. Case 3 demonstrates a particularly steep increase, even in the low-error region (

t o l e r a n c e < 0.1

). This suggests a favorable characteristic for achieving high accuracy at early tolerance levels.

The results demonstrate that increasing the variability of the number of experts results in consistent improvements in localization accuracy. A variable expert configuration with more low-dimensional experts in the early layers and fewer high-capacity experts in the deeper layers outperforms a fixed expert design.

3.5. Ablation Study on Multi-Sensor Fusion of Camera and Sonar for Robust Underwater Glider Perception

An ablation analysis was conducted to investigate the effect of merging color images acquired from the ROV-mounted camera with intensity-map images derived from multi-beam sonar data.

Figure 13 shows a quantitative comparison of the proposed 3D relative position regression model with two different training configurations. It illustrates the effect of incorporating sonar measurements. Figure 13a shows the results when training the model with only camera-derived inputs, i.e., excluding the sonar intensity-map channel from the dataset. In this camera-only configuration, the predicted values are dispersed around the ideal line, especially along the z-axis. This indicates that depth observability is limited from monocular imagery in visually degraded underwater conditions. This result is consistent with the intrinsic ambiguity of monocular depth estimation, which becomes more pronounced when contrast, texture, and visibility are reduced. The relatively low

R^{2}

values across the axes suggest that the model only explains a small amount of the variance in the ground-truth position when relying solely on visual information.

Figure 13b, in contrast, shows the results obtained when the sonar intensity-map channel is included during training to create an early-fusion, multi sensor input. Compared to the camera-only case, the scatter distributions are more concentrated around the ideal prediction line and the regression metrics are improved. Including sonar information yields the most significant benefit for the z-axis. There, the prediction trend aligns more closely with the ideal line, and the reported

R^{2}

value increases, indicating improved depth consistency. This improvement is expected because sonar provides range-sensitive acoustic responses over a wider radial field, which complements the limited depth cues available from monocular images. Overall, comparing (a) and (b) demonstrates that integrating sonar intensity maps improves the reliability of 3D localization, especially along the depth-related axis.

4. Results

4.1. Validation of 3D Position and Heading Estimation in an Indoor Tank Environment

We evaluated the accuracy of three-dimensional position and glider heading estimation using the developed glider hull detection model on a validation dataset collected in an indoor water tank. For this evaluation, we set the CNN vision encoder’s hierarchical depth to three and used a highly skewed variable expert configuration for the transformer, which was identified as the optimal setting in the ablation studies.

Figure 14 shows the regression results for the x, y, and z-axes, as well as the heading angle. These results were obtained by applying a full model that incorporates a hierarchical convolutional neural network (CNN) vision encoder, a variable mixture-of-experts (vMoE) transformer, and camera–sonar multi-sensor fusion. The scatter plots compare the predicted values with the ground-truth measurements for the three-dimensional relative position components and the glider heading angle.

The proposed model demonstrates strong predictive performance for the translational components across all axes. In particular, the x and y-axis results exhibit high coefficients of determination (

R^{2} = 0.84

and

R^{2} = 0.88

, respectively), indicating that the model captures most of the variance in lateral relative motion. The z-axis, which corresponds to the distance between the ROV and the glider, is typically the most difficult to estimate using vision alone. However, this axis also shows robust performance (

R^{2} = 0.86

). This result underscores the effectiveness of incorporating sonar intensity-map information, which provides supplementary, range-sensitive cues that notably enhance depth estimation.

Figure 15 shows the regression error characteristic (REC) curve of the proposed model. The REC curve provides a comprehensive evaluation of localization accuracy as it relates to error tolerance. It illustrates the proportion of predictions with a given tolerance error. This offers insight into the model’s accuracy and robustness across different error regimes. A steeper rise toward the upper-left region of the plot indicates greater robustness and precision.

The proposed framework achieves 50% accuracy at an error tolerance of 0.047, 70% at 0.071, and 90% at 0.125. These relatively small tolerance thresholds demonstrate that a large proportion of predictions converge within tight error bounds. The steep increase in the low-tolerance region shows that the model can achieve high precision for a significant number of samples. This is crucial for safe, close-range ROV operations, such as glider approach, inspection, and recovery.

Figure 16 shows the validation results for the localization of an underwater glider hull in an indoor water tank. It compares ground truth and model predictions to evaluate the proposed framework. Figure 16a shows the three-dimensional ground-truth center positions of the glider hull overlaid on the images. Figure 16b shows the three-dimensional center positions of the glider hull estimated by the proposed model under the same experimental conditions. The predicted center coordinates closely match the ground truth annotations, demonstrating accurate localization across different viewpoints and distances. Notably, the model successfully estimates the center of the hull without relying on visual markers, indicating that it captures the glider’s intrinsic geometric and appearance-based features.

These results confirm that combining hierarchical, multi-level visual features; adaptive expert allocation in the transformer encoder; and camera–sonar multi-sensor fusion yields reliable three-dimensional localization, which is suitable for autonomous underwater operations.

4.2. Test of 3D Position Estimation in a Sea Environment

The first test experiment was conducted in a harbor. The glider was deployed to float near the harbor, and the ROV navigated autonomously toward the glider by receiving its transmitted GPS signal. Once the ROV reached the glider, it descended below the surface of the water and estimated the three-dimensional position of the glider’s hull. This information was then used to perform a controlled approach maneuver.

Figure 17 shows the results of three-dimensional glider hull position estimation obtained from real sea experiments. During these experiments, the glider floated freely on the surface of sea while underwater images were captured in natural marine conditions. These conditions were characterized by strong illumination variations, surface reflections, water turbidity, and dynamic background disturbances. These conditions present greater challenges for visual perception than controlled indoor environments.

In the figure, the green bounding boxes represent the projected hull regions obtained by projecting the measured physical size of the glider hull onto the image plane based on the estimated three-dimensional position and heading. The red circle markers indicate the projection of predicted 3D center positions of the glider hull relative to the ROV-mounted camera. The close alignment between the projected hull extents and the visible glider structure demonstrates the consistency and accuracy of the estimated three-dimensional position.

Figure 18 shows the end-to-end autonomous recovery procedure executed by the ROV control program during real-sea experiments. Initially, the ROV and the glider both transmit GPS signals while operating at the sea surface. Then, the ROV uses these GPS measurements to navigate autonomously to the vicinity of the floating glider. This long-range navigation stage allows for efficient convergence toward the target without requiring continuous visual perception.

Once the ROV reaches a predefined proximity threshold, it transitions from surface navigation to underwater operation by descending below the surface of the water. During this phase, the proposed perception framework activates to detect the glider hull and estimate its three-dimensional position relative to the camera mounted on the ROV. The estimated position is updated continuously and fed into the control module. This allows the ROV to execute precise, collision-free approach maneuvers in conditions of limited visibility and dynamic marine environments.

As the ROV approaches in on the glider, the system guides the vehicle toward the recovery interface located on the lower section of the glider hull. Using the estimated pose information, the ROV aligns itself to engage the recovery ring and subsequently performs grasping of the glider body using its onboard manipulation mechanism.

In this recovery scenario, the ROV autonomously detects the glider hull. Then, it adjusts its approach velocity and heading based on the estimated three-dimensional pose to avoid colliding with the glider during close-range interaction. With this perception-guided control strategy, the ROV successfully aligns with the glider and grasps the recovery ring on the lower section of the hull. Experimental results indicate an overall recovery success rate of approximately 80%. The remaining failures are primarily due to harsh sea conditions and strong currents, which significantly impact vehicle stability and the precision required to engage the recovery ring. These results demonstrate that environmental conditions play a key role in recovery performance and confirm the effectiveness of the proposed perception and control framework under realistic operational constraints.

Due to operational constraints, the sea trials were conducted under a limited range of environmental conditions. Therefore, the presented results should be interpreted as validation of feasibility rather than as an exhaustive assessment of robustness across extreme turbidity and current regimes.

5. Discussion

This study presents a perception framework for underwater glider hull localization, integrating a hierarchical convolutional neural network (CNN) vision encoder, a variable mixture-of-experts (vMoE) transformer, and camera–sonar multi-sensor fusion. Extensive validation in indoor tank environments and real sea trials shows that the proposed approach offers robust and precise pose estimation, which enables autonomous ROV-assisted glider recovery. However, the experimental results also reveal several important observations and limitations that warrant further discussion.

5.1. Hierarchical Feature Representation and Patch Partitioning Strategy

Ablation studies confirm that hierarchical feature representation is crucial for improving localization accuracy because it captures multi-scale spatial and semantic information. In the current implementation, the input image is partitioned into three configurations:

2 \times 2

,

3 \times 4

, and

5 \times 9

grids. These partitioning schemes are selected based on the input image’s resolution and the receptive field sizes of the two-dimensional convolution kernels used in the CNN vision encoder. This design balances feature granularity and computational efficiency. However, subdividing the image into finer patches could enable more detailed feature extraction and improve robustness in cluttered or low-contrast underwater scenes. However, increased granularity would introduce higher computational costs and require careful adjustment of kernel sizes and feature aggregation strategies. Therefore, exploring adaptive or resolution-aware patch partitioning schemes is an important area of future research.

5.2. Limitations and Future Directions of the vMoE Transformer Architecture

A vMoE transformer with a highly skewed expert allocation demonstrates performance compared to fixed or uniformly distributed expert configurations. This highlights the effectiveness of adaptive capacity allocation across encoder blocks. However, the current study only explores a small part of the vMoE design space. Additional ablation studies are necessary to systematically investigate a broader range of hyperparameters, such as alternative expert weights, dimensionalities, and routing strategies. Furthermore, incorporating advanced activation functions, such as SwiGLU [24] or related gated linear units, could enhance representational capacity and training stability within expert networks. A more comprehensive exploration of these design choices could yield additional performance gains and provide deeper insight into optimal MoE configurations for underwater perception tasks.

5.3. Sensor Range Complementarity and Multi-Sensor Fusion Challenges

The experimental results demonstrate that combining a camera and sonar enhances three-dimensional localization performance, particularly along the depth axis. However, real-sea experiments reveal the inherent limitations of the two modalities in terms of effective sensing range. In practical marine environments, vision-based perception typically provides reliable information only within approximately one meter due to turbidity, lighting variations, and scattering effects. In contrast, sonar sensors are effective at longer distances but have lower spatial resolution. This creates an intermediate range in which neither sensor alone provides sufficiently informative measurements. Integrating additional sensing modalities, such as short-range acoustic sensors and inertial and velocity-based sensors, may be necessary to bridge the effective operating ranges of vision and sonar. Such complementary sensing could improve robustness during transitions from long-range approaches to close-proximity interactions.

A comparison of camera-only and camera–sonar configurations confirms that integrating sonar substantially improves estimation of the depth axis, which is the most challenging component of monocular vision. These results suggest that multi-modal fusion is crucial for reliable 3D localization in underwater environments, especially when the visual range is limited.

Although camera–sonar fusion improves localization performance, it only works when the two modalities are accurately synchronized in space and time. In the current system, sensor measurements are aligned based on timestamp matching. However, residual temporal offsets may occur due to sensor latency and communication delays. This issue becomes more critical under strong currents or wave-induced motion because rapid relative movement can amplify the misalignment between camera frames and sonar returns [25,26]. Future work should incorporate motion-compensated alignment using IMU/DVL measurements and temporal interpolation schemes to improve synchronization robustness [27].

5.4. Practical Deployment Considerations and Future Work

It should be noted that the current field experiments do not encompass the full range of extreme marine conditions. High turbidity and strong currents impose practical operational limitations. Reduced visibility makes camera-based features less reliable, and current-induced drift makes stable, close-range engagement more difficult. While the proposed framework is feasible under a few realistic test conditions, determining quantitative environmental adaptation thresholds (e.g., maximum turbidity levels and current speed limits) is an important topic for future systematic evaluation.

The proposed framework performs well in both indoor and real sea experiments. However, the current field validation is limited to a constrained set of marine conditions. Perception stability and recovery-ring engagement accuracy are sensitive to environmental disturbances such as reduced underwater visibility, wave-induced relative motion, and strong currents. These factors introduce rapid pose variation, intermittent occlusions, and unstable vehicle motion, which can degrade localization consistency and directly affect recovery success. Therefore, additional large-scale evaluations under more diverse and severe ocean conditions are required to further validate robustness and operational reliability.

Another important direction is incorporating temporal consistency across consecutive frames. Since the current framework performs pose estimation on a per-frame basis, the predicted localization results may fluctuate under rapid motion or noisy sensing conditions. To improve stability, we could adopt sequential modeling approaches, such as temporal transformers or recurrent neural networks (RNNs), which exploit motion continuity. Alternatively, we could use lightweight filtering techniques, such as exponential smoothing or Kalman filtering, to suppress estimation jitter and enhance robustness during close-range recovery operations.

From a practical deployment standpoint, computational efficiency is essential for real-time execution on underwater robotic platforms with limited resources. While the vMoE transformer improves capacity scaling through sparse expert activation, further optimization is needed to minimize runtime overhead. Strategies for achieving this include reducing the number of experts, lowering the top-k routing value, and using lightweight expert architectures. Furthermore, model compression and quantization techniques (e.g., FP16 or INT8 inference) could significantly enhance real-time performance without substantially compromising localization accuracy [28,29].

Major localization errors observed in real marine environments are primarily caused by limited visibility, sensor noise, and the rapid relative motion between the ROV and the drifting glider. Mitigating these errors requires several improvements, such as refining the vMoE transformer architecture [30,31] and integrating additional sensing modalities. Incorporating IMU measurements, for example, can support motion compensation, and Doppler Velocity Log (DVL) sensors can provide accurate relative velocity estimates to improve motion awareness and stabilize pose estimation. Furthermore, short-range acoustic sensors can provide high-confidence distance measurements in close-proximity scenarios, effectively bridging the sensing gap between vision-based perception and long-range sonar. Additionally, expanding training datasets collected under diverse environmental conditions is essential for improving model generalization. Together, these enhancements are expected to significantly improve the robustness of autonomous recovery performance in harsh sea environments.

Benchmarking directly with recent mainstream 3D localization frameworks remains challenging because many state-of-the-art methods rely on sensing modalities such as stereo vision, LiDAR, and structured depth measurements that are either unreliable or unavailable in underwater recovery scenarios. Differences in datasets, evaluation protocols, and sensor configurations also complicate fair comparisons. Therefore, this study employs systematic ablation analysis to isolate the contributions of hierarchical feature representation, vMoE-based capacity allocation, and camera–sonar fusion. Future work should focus on developing unified benchmarking protocols for underwater glider recovery perception.

6. Conclusions

This paper presents a perception framework based on multi-sensor fusion that can detect and three-dimensionally localize underwater glider hulls. The framework employs a hierarchical convolutional neural network (CNN) vision encoder and a variable mixture-of-experts (vMoE) transformer architecture with camera–sonar multi-sensor fusion to address the challenges of underwater perception in visually obstructed and dynamic marine environments.

The hierarchical CNN vision encoder effectively extracts multi-level and multi-scale visual features, enabling robust localization under varying viewpoints and environmental conditions. The vMoE transformer enhances representational capacity and computational efficiency by adaptively allocating experts across encoder blocks. Furthermore, integrating camera images with sonar intensity maps greatly improves three-dimensional localization accuracy, particularly along the depth axis, where vision-only approaches are limited.

Extensive evaluations in indoor water tank environments and real-sea trials validate the effectiveness and generalization capability of the proposed framework. Quantitative results, including regression metrics and regression error characteristic curve analysis, demonstrate that the framework is more accurate and robust than baseline configurations. Additionally, underwater glider recovery experiments achieved an overall success rate of approximately 80%, confirming the practical feasibility of the perception and control pipeline under realistic operational conditions.

Future research will focus on several key areas. First, incorporating temporal modeling techniques, such as sequential transformers or recurrent architectures, may improve robustness by exploiting motion continuity across data frames. Second, expanding the multi-sensor fusion framework to include additional sensing modalities is essential for overcoming range limitations and enhancing system reliability. Finally, extensive sea trials will be conducted across a wider range of environmental conditions and glider configurations to further validate scalability and generalization.

Author Contributions

Conceptualization, J.L.; methodology, J.L. and J.-H.H.; software, J.L. and K.N.; validation, J.L.; formal analysis, J.L.; investigation, J.L.; resources, J.L. and J.S.; data curation, J.L., J.-H.H., K.N. and J.-H.P.; writing—original draft preparation, J.L.; writing—review and editing, J.L. and J.S.; visualization, J.L. and J.-H.P.; supervision, J.S.; project administration, J.S.; funding acquisition, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Korea Planning & Evaluation Institute of Industrial Technology (KEIT) and conducted by the Ministry of Trade, Industry and Energy (MOTIE) (Robot Industrial Core Technology Development Project, Project Number 20018764).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Winkler, A.; Mienert, J.; Mevel, C.; Dürr, S. The Deep Sea Frontier: A New European Research Initiative. Sci. Drill. 2007, 4, 38–39. [Google Scholar] [CrossRef]
Yuh, J. Design and Control of Autonomous Underwater Robots: A Survey. Auton. Robot. 2000, 8, 7–24. [Google Scholar] [CrossRef]
Rudnick, D.L.; Davis, R.E.; Eriksen, C.C.; Fratantoni, D.M.; Perry, M.J. Underwater Gliders for Ocean Research. Mar. Technol. Soc. J. 2004, 38, 73–84. [Google Scholar] [CrossRef]
Eriksen, C.C.; Osse, T.J.; Light, R.D.; Wen, T.; Lehman, T.W.; Sabin, P.L.; Ballard, J.W.; Chiodi, A.M. Seaglider: A Long-Range Autonomous Underwater Vehicle for Oceanographic Research. IEEE J. Ocean. Eng. 2001, 26, 424–436. [Google Scholar] [CrossRef]
Webb, D.C.; Simonetti, P.J.; Jones, C.P. SLOCUM: An Underwater Glider Propelled by Environmental Energy. IEEE J. Ocean. Eng. 2001, 26, 447–452. [Google Scholar] [CrossRef]
Ridao, P.; Carreras, M.; Ribas, D.; Sanz, P.J.; Oliver, G. Intervention AUVs: The next Challenge. Annu. Rev. Control 2015, 40, 227–241. [Google Scholar] [CrossRef]
Huynh, T.; Tran, M.-T.; Lee, M.; Kim, Y.-B.; Lee, J.; Suh, J.-H. Development of Recovery System for Underwater Glider. J. Mar. Sci. Eng. 2022, 10, 1448. [Google Scholar] [CrossRef]
Ni, T.; Sima, C.; Zhang, W.; Wang, J.; Guo, J.; Zhang, L. Vision-Based Underwater Docking Guidance and Positioning: Enhancing Detection with YOLO-D. J. Mar. Sci. Eng. 2025, 13, 102. [Google Scholar] [CrossRef]
Xie, K.; Yang, J.; Qiu, K. A Dataset with Multibeam Forward-Looking Sonar for Underwater Object Detection. Sci. Data 2022, 9, 739. [Google Scholar] [CrossRef]
Hu, Z.; Cheng, L.; Yu, S.; Xu, P.; Zhang, P.; Tian, R.; Han, J. Underwater Target Detection with High Accuracy and Speed Based on YOLOv10. J. Mar. Sci. Eng. 2025, 13, 135. [Google Scholar] [CrossRef]
Wang, C.; Zhong, C. Adaptive Feature Pyramid Networks for Object Detection. IEEE Access 2021, 9, 107024–107032. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations 2021, Virtual Event, 3–7 May, 2021. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision—ECCV 2020; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive Mixtures of Local Experts. Neural Comput. 1991, 3, 79–87. [Google Scholar] [CrossRef] [PubMed]
Eigen, D.; Ranzato, M.; Sutskever, I. Learning Factored Representations in a Deep Mixture of Experts. arXiv 2013, arXiv:1312.4314. [Google Scholar] [CrossRef]
Bengio, Y.; Léonard, N.; Courville, A. Estimating or Propagating Gradients through Stochastic Neurons for Conditional Computation. arXiv 2013, arXiv:1308.3432. [Google Scholar] [CrossRef]
Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv 2017, arXiv:1701.06538. [Google Scholar] [CrossRef]
Fedus, W.; Zoph, B.; Shazeer, N. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv 2021, arXiv:2101.03961. [Google Scholar] [CrossRef]
Liang, H.; Fan, Z.; Sarkar, R.; Jiang, Z.; Chen, T.; Zou, K.; Cheng, Y.; Hao, C.; Wang, Z. M³ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-Task Learning with Model-Accelerator Co-Design. arXiv 2022, arXiv:2210.14793. [Google Scholar] [CrossRef]
Lee, J.; Park, J.-H.; Hwang, J.-H.; Noh, K.; Choi, Y.; Suh, J. Artificial Neural Network for Glider Detection in a Marine Environment by Improving a CNN Vision Encoder. J. Mar. Sci. Eng. 2024, 12, 1106. [Google Scholar] [CrossRef]
Luxonis OAK-D Pro PoE Camera. Available online: https://shop.luxonis.com/products/oak-d-pro-w-poe (accessed on 12 January 2026).
Blueprint Subsea Multibeam Oculus M750d Sonar. Available online: https://www.blueprintsubsea.com/oculus/oculus-m-series (accessed on 12 January 2026).
Shazeer, N. GLU Variants Improve Transformer. arXiv 2020, arXiv:2002.05202. [Google Scholar] [CrossRef]
Pecheux, N.; Creuze, V.; Comby, F.; Tempier, O. Self Calibration of a Sonar–Vision System for Underwater Vehicles: A New Method and a Dataset. Sensors 2023, 23, 1700. [Google Scholar] [CrossRef]
Kim, H.-G.; Seo, J.; Kim, S.M. Underwater Optical-Sonar Image Fusion Systems. Sensors 2022, 22, 8445. [Google Scholar] [CrossRef]
Zhang, F.; Zhao, S.; Li, L.; Cao, C. Underwater DVL Optimization Network (UDON): A Learning-Based DVL Velocity Optimizing Method for Underwater Navigation. Drones 2025, 9, 56. [Google Scholar] [CrossRef]
Dikkala, N.; Ghosh, N.; Meka, R.; Panigrahy, R.; Vyas, N.; Wang, X. On the Benefits of Learning to Route in Mixture-of-Experts Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, 6–10 December 2023; Association for Computational Linguistics: Singapore, 2023; pp. 9376–9396. [Google Scholar] [CrossRef]
Jiang, A.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Hanna, E.B.; Bressand, F.; et al. Mixtral of Experts. arXiv 2024, arXiv:2401.04088. [Google Scholar] [CrossRef]
Lu, X.; Zhao, Y.; Qin, B. Improving the Downstream Performance of Mixture-of-Experts Transformers via Weak Vanilla Transformers. Electronics 2025, 14, 4256. [Google Scholar] [CrossRef]
Guo, Y.; Tu, Z.; Cheng, Z.; Tang, X.; Lin, T. Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models. arXiv 2024, arXiv:2405.14297. [Google Scholar] [CrossRef]

Figure 1. Example of an ROV equipped with a sonar sensor and multi-camera mounts for acquiring sensor data.

Figure 2. Preprocessing and annotation of underwater glider images: (a) original images with artificial markers, (b) marker-removed images used for training, and (c) visualization of estimated hull position and vehicle heading.

Figure 3. Underwater data acquisition under progressively challenging environmental conditions: (a) static water tank environment, (b) static environment with strong artificial illumination, (c) environment with induced surface waves, and (d) combined disturbances including illumination and wave effects.

Figure 4. Images of an underwater glider collected from the ocean environment.

Figure 5. Visualization of multibeam sonar data, showing the raw polar-coordinate sonar display (left) and the corresponding grayscale intensity map of acoustic reflections over a wide radial range (right).

Figure 6. Channel-wise stacking of RGB camera images and sonar intensity maps to construct unified multi-sensor training samples.

Figure 7. Architecture in which outputs from hierarchical CNN vision encoders applied to partitioned image patches are merged to form a composite feature vector.

Figure 8. Comparison of regression performance for z-axis coordinate prediction across different hierarchical depths of the CNN vision encoder: (a) Hierarchical depth 1, (b) Hierarchical depth 2, and (c) Hierarchical depth 3.

Figure 9. Regression error characteristic (REC) curves evaluating the spatial accuracy of the hierarchical CNN vision encoder at different depths: (a) Hierarchical depth 1, (b) Hierarchical depth 2, and (c) Hierarchical depth 3.

Figure 10. Comparison between a standard Transformer encoder and the proposed variable Mixture-of-Experts (vMoE) Transformer encoder.

Figure 11. Comparison of regression performance for Z-Axis coordinate prediction across different expert allocations. (a) Case 1: fixed number of experts. (b) Case 2: uniformly variable number of experts. (c) Case 3: highly skewed variable number of experts.

Figure 12. Regression error characteristic (REC) curves evaluating the error distribution across different expert allocation. (a) Case 1: fixed number of experts. (b) Case 2: uniformly variable number of experts. (c) Case 3: highly skewed variable number of experts.

Figure 13. Predicted versus ground-truth regression results for 3D relative position estimation under different sensor input configurations: (a) camera-only input, (b) camera and sonar fusion input.

Figure 14. Predicted versus ground-truth results for 3D relative position and heading estimation using the proposed framework.

Figure 15. Regression Error Characteristic (REC) curve for 3D relative position estimation using the proposed framework.

Figure 16. Validation results of underwater glider hull localization in an indoor water tank environment: (a) Ground-truth 3D center positions annotated in the validation set, (b) Corresponding 3D center positions estimated by the proposed model.

Figure 17. Test results of three-dimensional glider hull position estimation in a real sea environment.

Figure 18. The autonomous ROV-assisted glider recovery procedure uses GPS-based navigation and vision-based 3D hull localization.

Table 1. Configurations of expert allocation and routing strategies used in the ablation study.

Ablation Case	Number of Experts Per Encoder Block	Routing Top-k Per Encoder Block
Case 1 (Fixed number of experts)	[8, 8, 8, 8, 8, 8, 8, 8]	[2, 2, 2, 2, 2, 2, 2, 2]
Case 2 (Uniformly variable number of experts)	[16, 16, 8, 8, 4, 4, 4, 4]	[4, 4, 2, 2, 1, 1, 1, 1]
Case 3 (Highly skewed variable number of experts)	[32, 8, 8, 4, 4, 4, 2, 2]	[8, 2, 2, 1, 1, 1, 1, 1]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lee, J.; Park, J.-H.; Hwang, J.-H.; Noh, K.; Suh, J. A Study of the Three-Dimensional Localization of an Underwater Glider Hull Using a Hierarchical Convolutional Neural Network Vision Encoder and a Variable Mixture-of-Experts Transformer. Remote Sens. 2026, 18, 793. https://doi.org/10.3390/rs18050793

AMA Style

Lee J, Park J-H, Hwang J-H, Noh K, Suh J. A Study of the Three-Dimensional Localization of an Underwater Glider Hull Using a Hierarchical Convolutional Neural Network Vision Encoder and a Variable Mixture-of-Experts Transformer. Remote Sensing. 2026; 18(5):793. https://doi.org/10.3390/rs18050793

Chicago/Turabian Style

Lee, Jungwoo, Ji-Hyun Park, Jeong-Hwan Hwang, Kyoungseok Noh, and Jinho Suh. 2026. "A Study of the Three-Dimensional Localization of an Underwater Glider Hull Using a Hierarchical Convolutional Neural Network Vision Encoder and a Variable Mixture-of-Experts Transformer" Remote Sensing 18, no. 5: 793. https://doi.org/10.3390/rs18050793

APA Style

Lee, J., Park, J.-H., Hwang, J.-H., Noh, K., & Suh, J. (2026). A Study of the Three-Dimensional Localization of an Underwater Glider Hull Using a Hierarchical Convolutional Neural Network Vision Encoder and a Variable Mixture-of-Experts Transformer. Remote Sensing, 18(5), 793. https://doi.org/10.3390/rs18050793

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Study of the Three-Dimensional Localization of an Underwater Glider Hull Using a Hierarchical Convolutional Neural Network Vision Encoder and a Variable Mixture-of-Experts Transformer

Highlights

Abstract

1. Introduction

2. Related Works

2.1. Underwater Glider Recovery and Underwater Approaching Guidance

2.2. Vision-Based Underwater Object Detection and 3D Localization

2.3. Multi-Scale Feature Representation for Robust Underwater Perception

2.4. Transformer-Based Perception and Mixture-of-Experts Architecture

3. Methodology

3.1. Data Acquisition

3.2. Construction of Multi-Sensor Training Data

3.3. Hierarchical CNN-Based Vision Encoder for Multi-Level Feature Representation

3.4. Transformer with Variable Mixture-of-Experts for Efficient Inference in 3D Position Estimation

3.5. Ablation Study on Multi-Sensor Fusion of Camera and Sonar for Robust Underwater Glider Perception

4. Results

4.1. Validation of 3D Position and Heading Estimation in an Indoor Tank Environment

4.2. Test of 3D Position Estimation in a Sea Environment

5. Discussion

5.1. Hierarchical Feature Representation and Patch Partitioning Strategy

5.2. Limitations and Future Directions of the vMoE Transformer Architecture

5.3. Sensor Range Complementarity and Multi-Sensor Fusion Challenges

5.4. Practical Deployment Considerations and Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI