To clarify this work’s novel methodology, we emphasize that the proposed framework targets the markerless, three-dimensional relative localization of an underwater glider hull for perception-guided remote operated vehicle interaction. This differs from conventional 2D underwater detection or marker-based docking guidance. Current underwater perception pipelines typically rely on single-modality vision in low-visibility conditions or employ multimodal fusion, which assumes consistent cross-sensor correspondence. Many underwater docking and approach systems use dedicated markers or structured docking interfaces to achieve stable guidance. However, these assumptions do not apply to practical glider recovery because the target is an unstructured hull subject to rapid environmental disturbances and appearance degradation. Accordingly, the proposed methodology is guided by two practical considerations: (i) exploiting the complementary acoustic structure from the multibeam sonar without requiring strict pixel-level alignment between modalities, and (ii) efficiently allocating model capacity for real-time robotic operation by emphasizing representational diversity in early stages and semantic refinement in deeper stages.
3.1. Data Acquisition
Sensor data for underwater glider hull detection is collected using a remotely operated vehicle (ROV) equipped with multiple underwater cameras and a sonar sensor. The primary camera mounted on the ROV is an OAK-D Pro PoE camera manufactured by Luxonis in the USA [
22]. As shown in
Figure 1, two additional GoPro cameras are mounted on either side of the primary camera to enable parallel image acquisition from horizontal viewpoints on the ROV platform. The sonar sensor, which is installed on the lower section of the ROV, is a dual-frequency, multibeam Oculus M750d sonar manufactured by Blueprint Subsea in England [
23]. This sensor provides wide-area acoustic perception to complement visual sensing.
The ROV is deployed in both an indoor water tank and an outdoor harbor pier. During indoor tank experiments, an overhead crane suspended from the ceiling of the water tank holds the glider in a floating position. While the glider remains stationary, the ROV approaches it from various angles and distances to acquire visual data.
A recovery ring is mounted under the center of the glider’s hull, and artificial visual markers are attached to the forward and aft sections of the lower hull to facilitate the accurate computation of the relative distance and displacement between the ROV and the glider during data acquisition. However, in practical operational scenarios, underwater gliders are generally not equipped with artificial markers, and the presence of a recovery ring depends on the mission’s specific configuration. To maintain consistency with realistic deployment conditions, image regions corresponding to the artificial markers and the recovery ring are removed from the captured images. This preprocessing strategy prevents the learning model from exploiting artificial visual cues and encourages robust hull detection based on the glider’s intrinsic visual characteristics.
Figure 2 shows the preprocessing procedure applied to a raw input image.
Image data acquired in the water tank is collected under highly static conditions where the glider’s motion is minimal and the illumination is relatively uniform. To simulate environmental disturbances encountered in real marine settings, such as intense sunlight and irregular surface waves, artificial lighting is directed toward the glider from multiple locations within the tank. Wave disturbances are generated by agitating the water’s surface with paddles. This experimental setup enables the acquisition of training and validation data that more closely resembles realistic ocean conditions. This improves the robustness of the deep learning model.
Figure 3 illustrates the application of artificial lighting and wave disturbances.
During harbor experiments, the glider floated freely on the surface of the sea. When it drifted away from the pier, an operator on a nearby boat approached and towed it back. In real sea environments, the field of view of the underwater camera is severely limited, and image quality degrades rapidly with distance. At distances exceedingly approximately one or two meters, artificial markers become indistinguishable. As a result, acquiring data of sufficient quality for effective model training in open-water conditions is highly challenging. Consequently, as shown in
Figure 4, the data collected in real marine environments are used as a test set after training the deep learning model with water tank data.
The training data samples from the water tank environment total around 400,000, and the validation set samples are about 60,000. The number of test dataset samples acquired from the ocean environment is approximately 36,000.
3.2. Construction of Multi-Sensor Training Data
In a preliminary evaluation, we investigated multiple training configurations to assess the impact of multi-camera image inputs on model performance. Specifically, we constructed training datasets by combining color images from the primary camera with images from up to four additional cameras mounted horizontally on the ROV. We then briefly trained a baseline model with these stacked images and compared the resulting performance across configurations. The experimental results indicated that including additional camera images only marginally improved learning accuracy. This limited improvement is likely due to the constrained visibility range in underwater environments and the relatively small physical displacement between the cameras mounted on the ROV platform. This likely restricts the amount of complementary visual information captured. Based on these observations, rather than stacking images from multiple cameras to create a single training sample, we combined multi-camera data with data from other sensors to expand and diversify the training dataset.
Figure 5 illustrates the processing and visualization of data obtained from the multibeam sonar sensor mounted on the ROV. The left panel presents the native sonar visualization, in which acoustic measurements are displayed in a polar coordinate system that corresponds to the sonar’s fan-shaped field of view. This representation reflects the spatial distribution of sonar beams and their respective ranges, providing an understanding of the location of obstacles in relation to the sensor.
The right panel shows a transformed representation of the same sonar measurements. In this representation, the intensity of acoustic echoes reflected from underwater objects across a wide radial range is converted into a grayscale image. In this representation, pixel intensity corresponds to the strength of the returned sonar signal; higher intensities indicate stronger reflections. This image-based formulation facilitates integration with a vision-based perception layer.
Figure 6 illustrates the procedure for creating multi-sensor training samples by combining visual and acoustic data. The upper-left panel shows a color image captured by the ROV-mounted underwater camera. Although this camera provides high-resolution visual information, its capabilities are limited by underwater visibility conditions. The lower-left panel depicts the intensity map obtained from the multibeam sonar sensor. This map shows the strength of acoustic reflections from underwater objects over a wide radial range.
To leverage the distinct properties of these two sensing modalities, the color image is decomposed into red (
), green (
), and blue (
) channels, while the sonar measurement is depicted as a single-channel intensity image (
). As shown on the right side of
Figure 6, these four channels are stacked along the channel dimension to create a multi-channel input tensor. This channel-wise stacking strategy allows the learning model to simultaneously process fine-grained visual features from the camera and long-range structural information from the sonar.
In the current experimental setup, sensor synchronization is performed at the software level using the ROV’s onboard shared system clock. Camera images and multibeam sonar intensity maps are recorded using a unified timestamp system, and frame pairing is conducted via nearest-timestamp alignment. Because the platform moves relatively slowly, temporal discrepancies within the synchronization tolerance introduce negligible spatial displacement.
Note that the camera and the multibeam sonar observe different spatial domains and employ different projection models. The camera uses perspective projection, while the sonar uses polar acoustic sampling. In this study, the sonar’s intensity data is reconstructed in polar coordinates and resized to match the spatial resolution of the camera image. Then, the reconstructed sonar map is concatenated with the RGB channels to create a four-channel input tensor.
However, strict pixel-level geometric alignment between modalities is not assumed. The proposed transformer-based architecture does not interpret the four-channel input as a conventional color image. Instead, each channel is treated as an independent feature source. Through attention mechanisms, the model learns cross-modal relationships, enabling it to capture correlations between sonar structural cues, visual features, and 3D localization outputs without enforcing explicit geometric reprojection between sensing domains.
In practical recovery scenarios, the measurements obtained from cameras and multibeam sonar are affected by platform motion and significant latency. Additionally, the two sensors have different sampling and projection characteristics. Under these conditions, enforcing frame-wise or pixel-wise correspondence can introduce systematic bias when the apparent spatial relationship between the two modalities fluctuates. Therefore, although the sonar intensity map is resized to match the camera resolution when constructing the four-channel input, the proposed early fusion method does not rely on geometric reprojection. Rather, each channel is treated as a heterogeneous feature source, and cross-modal associations are implicitly learned through attention. This enables the model to use sonar-derived structural cues for enhanced depth-sensitive localization without making rigid alignment assumptions.
3.3. Hierarchical CNN-Based Vision Encoder for Multi-Level Feature Representation
In prior studies [
15], input images were divided into patches using various horizontal and vertical partitioning schemes, and a CNN-based vision encoder was applied independently to each patch to perform flat feature extraction. This approach captures local visual information, but it has limited in its ability to represent multi-level semantic characteristics within each patch. To address this limitation, the present study introduces a hierarchical convolutional neural network (CNN)-based vision encoder that improves feature representation at the patch level. The hierarchical CNN encoder aggregates multi-level semantics within local regions to improve the quality of patch embeddings provided to the transformer. Unlike purely flat feature extraction, the resulting patch embeddings preserve fine-grained visual cues and higher-level context. This is advantageous in underwater environments where local textures are weak, and appearances frequently change.
Figure 7 illustrates how the input image is divided based on a predefined grid configuration. Each resulting patch is then processed by a convolutional neural network (CNN) vision encoder to extract two-dimensional feature representations. These initial feature vectors are successively refined through a hierarchy of CNN encoders with different kernel sizes. The encoders are applied in an overlapping, multi-stage process. Finally, the proposed encoder combines feature representations from all hierarchical levels to create a composite feature vector that captures both fine-grained local details and higher-level semantic information. This hierarchical feature extraction strategy produces more robust and informative visual representations, thereby improving perception tasks such as glider hull localization and pose estimation.
Let
denote an input underwater RGB image captured by the ROV-mounted camera. The image is first partitioned into a fixed grid of
non-overlapping patches.
where each patch
. For each patch
, a hierarchical CNN vision encoder is applied to extract multi-level feature representations.
Let denote the CNN encoder at hierarchy level , employing different kernel sizes and receptive fields.
The hierarchical features are computed as:
The final patch-level feature vector is obtained by concatenating all hierarchical features.
All patch features are then stacked to form the input token sequence for the transformer.
To investigate the effect of hierarchical depth in the proposed CNN vision encoder, we conducted an ablation study by varying the number of hierarchical levels used for feature extraction. Specifically, we evaluated encoder configurations with depths of , where corresponds to a conventional flat CNN encoder without hierarchical feature aggregation, and larger values of progressively incorporate additional convolutional stages with different kernel sizes and spatial receptive fields. For each configuration of hierarchical depth, the model was trained for 500 epochs with a batch size of 32 on an RTX A5000 GPU manufactured by Nvidia in the USA.
Figure 8 shows scatter plots that demonstrate the relationship between the predicted and actual values at each level of the hierarchy. Quantitative metrics, such as mean absolute error (MAE) and the coefficient of determination, indicate a significant improvement in performance with increasing hierarchy depth. There is poor correlation at depth 1 (
), but the model substantially improves at depth 2 (
) and achieves optimal performance at depth 3 (
). These results suggest that deeper hierarchical representations are essential for capturing the spatial dependencies necessary for precise 3D coordinate estimation.
The regression error characteristic (REC) curves in
Figure 9 illustrate the relationship between cumulative accuracy and allowable error tolerance. Quantitative analysis reveals that the error threshold needed to reach 90% accuracy progressively decreases from 0.218 at depth 1 to 0.185 at depth 2 and to 0.162 at depth 3, with over 50% of samples at depth 3 converging within a margin of error of 0.033. These results suggest that deeper hierarchical features can effectively compensate for fine-grained spatial errors.
The experimental results demonstrate that increasing the hierarchical depth from to improves the accuracy of glider hull localization and heading estimation. This improvement is due to the encoder’s ability to capture fine-grained local details and higher-level semantic context through multi-level feature aggregation. Shallow configurations () tend to focus on local texture and edge information, which is often unreliable in underwater environments due to variations in illumination and turbidity. In contrast, deeper hierarchical encoders provide more robust representations by integrating contextual and structural cues across multiple scales.
However, increasing the depth beyond only marginally improves performance and incurs additional computational costs and a higher risk of feature redundancy. These results suggest a trade-off between representational richness and efficiency. Based on these findings, the study selects a moderate hierarchical depth as the default configuration, offering a favorable balance between perception accuracy and computational complexity.
3.4. Transformer with Variable Mixture-of-Experts for Efficient Inference in 3D Position Estimation
A transformer uses CNN-based vision encoders to process features extracted from image patches of varying sizes and output inter-feature relationships. This allows information to be integrated and refined into a more expressive representation of the extracted features.
The variable mixture-of-experts (vMoE) encoder replaces the conventional feed-forward network (FFN) with a dynamically routed mixture of a varying number of experts while preserving the multi-head attention and residual normalization structure of a transformer. A gating mechanism with sparse selection activates a subset of experts at each layer, enabling efficient and scalable representation learning.
Figure 10 illustrates a comparison between a standard transformer encoder and the proposed vMoE transformer encoder. The conventional transformer encoder architecture is depicted on the left side. Each encoder block consists of a multi-head self-attention (MHA) layer followed by a feed-forward network (FFN). Residual connections and layer normalization, denoted as ‘
’, are applied after both the MHA and FFN sublayers to stabilize training and facilitate gradient flow. In the standard transformer, the FFN is a dense module that processes all input tokens with a fixed set of parameters in each layer.
The proposed vMoE transformer encoder is shown on the right side. It replaces the FFN sublayer with a variable mix-of-experts module while preserving the overall transformer structure. This module consists of multiple feed-forward networks, or “experts,” denoted as ‘’, where represents the total number of experts available at a given layer. A gating network controls the selection and combination of the experts. In early encoder layers, a larger number of low-dimensional experts may be employed to capture diverse low-level patterns, whereas deeper layers may utilize fewer but higher-capacity experts to model more abstract semantic information. The final encoder output is produced after stacking all vMoE-adapted encoder blocks.
Underwater imagery often exhibits contrast loss and sparsity. These issues can cause early representations to become less discriminative across viewpoints and distances. The vMoE encoder allocates more experts to the early stages to encourage diverse low-level feature processing. This provides more stable cues for subsequent fusion and regression. Deeper stages can then operate on richer, more consistent representations to refine semantic understanding while maintaining computational efficiency through sparse expert activation.
To improve computational efficiency, a sparse selection mechanism activates only the experts with the highest weights, rather than all experts. This selective activation enables conditional computation, which allows the model to increase its representational capacity without proportionally increasing computation.
We conducted an ablation study to analyze the impact of expert allocation on model performance and efficiency. For this study, we assigned different numbers of experts to each vMoE transformer encoder block. Specifically, we evaluated configurations in which the number of experts varied across encoder blocks while adjusting the dimensionality of each expert to maintain an approximately constant overall computational resource.
Table 1 summarizes the three expert allocation strategies that were evaluated in the ablation study. According to the rule for varying the number of experts, more experts are assigned to the early encoder blocks that receive output features from the CNN vision encoder. Case 1 uses a constant number of experts and routing parameters throughout all transformer encoder blocks. Case 2 gradually reduces the number of experts and routing capacity toward deeper layers uniformly. Case 3 uses a highly skewed allocation, assigning many experts to the early encoder blocks and progressively fewer to the later blocks. This emphasizes rich feature diversification in the early stages and refined representation learning in the deeper layers. The base model has eight encoder blocks, and each configuration was trained for 200 epochs with a batch size of 32 using an NVIDIA RTX A5000 GPU.
We analyzed the computational complexity of the vMoE architecture using a single forward pass, a batch size of one, and a fixed input resolution. We computed floating-point operations (FLOPs) and measured inference speed (FPS) on the deployed hardware platform after a warm-up stage.
The total computational cost of each Transformer layer consists of self-attention operations and feed-forward (expert) computations. In the Mixture-of-Experts design, the feed-forward network is replaced by multiple experts. However, during inference, only the top k routed experts are activated.
Although more experts are allocated to the early layers to promote feature diversification, the actual inference cost is bounded by the routing parameter. This design enhances representational diversity in shallow layers without proportionally increasing computational complexity.
We analyze the number of parameters, floating-point operations (FLOPs), and inference speed under a single forward pass with a batch size of 1 to evaluate the computational efficiency of the proposed architecture. The model contains 55.86 M parameters and requires 0.860 GFLOPs per inference at the configured input resolution. The FLOPs were computed based on the network’s full forward propagation, including the vMoE layers under top-k expert routing. Inference speed was measured in an Intel i5 notebook environment using the test dataset and achieved approximately 11 frames per second (FPS) in evaluation mode. These results suggest that the early-heavy vMoE architecture increases representational capacity while maintaining practical, real-time feasibility for embedded, underwater inspection scenarios.
Figure 11 and
Figure 12 show the inference results of the three-dimensional position regression models on the validation set for each case.
Figure 11 compares the predicted and ground-truth values along the z-axis, which represents the distance between the ROV and the glider.
The quantitative metrics in the upper-left corner of the plots indicate that Case 3 has the highest coefficient of determination (), followed by Cases 1 () and 2 (). Additionally, Case 3 exhibits the lowest mean absolute error (MAE) of 0.07. These results demonstrate that the proposed expert allocation strategy with a biased distribution of experts significantly improves prediction accuracy.
Figure 12 shows the regression error characteristic (REC) curves, which represent the proportion of predictions that fall within a given error tolerance. A curve that rises more steeply toward the upper-left corner indicates greater model robustness. Case 3 exhibits the narrowest tolerance at 0.174 when comparing the error thresholds required to achieve 90% accuracy (
), thereby outperforming Cases 1 and 2 (0.203 and 0.186, respectively) and indicating more precise prediction capability. Case 3 demonstrates a particularly steep increase, even in the low-error region (
). This suggests a favorable characteristic for achieving high accuracy at early tolerance levels.
The results demonstrate that increasing the variability of the number of experts results in consistent improvements in localization accuracy. A variable expert configuration with more low-dimensional experts in the early layers and fewer high-capacity experts in the deeper layers outperforms a fixed expert design.
3.5. Ablation Study on Multi-Sensor Fusion of Camera and Sonar for Robust Underwater Glider Perception
An ablation analysis was conducted to investigate the effect of merging color images acquired from the ROV-mounted camera with intensity-map images derived from multi-beam sonar data.
Figure 13 shows a quantitative comparison of the proposed 3D relative position regression model with two different training configurations. It illustrates the effect of incorporating sonar measurements.
Figure 13a shows the results when training the model with only camera-derived inputs, i.e., excluding the sonar intensity-map channel from the dataset. In this camera-only configuration, the predicted values are dispersed around the ideal line, especially along the z-axis. This indicates that depth observability is limited from monocular imagery in visually degraded underwater conditions. This result is consistent with the intrinsic ambiguity of monocular depth estimation, which becomes more pronounced when contrast, texture, and visibility are reduced. The relatively low
values across the axes suggest that the model only explains a small amount of the variance in the ground-truth position when relying solely on visual information.
Figure 13b, in contrast, shows the results obtained when the sonar intensity-map channel is included during training to create an early-fusion, multi sensor input. Compared to the camera-only case, the scatter distributions are more concentrated around the ideal prediction line and the regression metrics are improved. Including sonar information yields the most significant benefit for the z-axis. There, the prediction trend aligns more closely with the ideal line, and the reported
value increases, indicating improved depth consistency. This improvement is expected because sonar provides range-sensitive acoustic responses over a wider radial field, which complements the limited depth cues available from monocular images. Overall, comparing (a) and (b) demonstrates that integrating sonar intensity maps improves the reliability of 3D localization, especially along the depth-related axis.