TFDF-YOLO: A Position Detection Model for Underwater Wireless Power Transfer Docking

Yin, He; Cheng, Yuxuan; Shi, Wentao

doi:10.3390/jmse14050429

Open AccessArticle

TFDF-YOLO: A Position Detection Model for Underwater Wireless Power Transfer Docking

by

He Yin

^1,2

,

Yuxuan Cheng

^2,* and

Wentao Shi

¹

College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin 150001, China

²

Department of Yantai Research Institute, Harbin Engineering University, Yantai 264000, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(5), 429; https://doi.org/10.3390/jmse14050429

Submission received: 25 January 2026 / Revised: 19 February 2026 / Accepted: 23 February 2026 / Published: 26 February 2026

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Underwater wireless power transfer (UWPT) technology can improve the endurance of unmanned underwater vehicles (UUVs). The stability and efficiency of UWPT depend on the success rate of UUV docking. A novel detection model, TFDF-YOLO, is proposed for dynamic position identification of UUV docking. First, a spatial–frequency decoupling (SFD) module is proposed by using Fourier-based degradation cues to guide Top-K proxy attention to boost blurred edge extraction capability. A relevance-difference fusion (RD-Fusion) strategy is improved by a global channel attention mechanism to realize multi-scale feature recognition. Furthermore, a new adaptive loss function (U-CIoU) is developed to suppress illumination bias and anchor inflation. Results on a reliable multi-source dataset demonstrate that the proposed model achieves 91.5% accuracy and 92.7% mAP@0.5. This work could enhance the success rate and reliability of UWPT. It shows potential for broader underwater applications, including deep-sea docking and multi-AUV cooperative systems.

Keywords:

underwater wireless power transfer; image recognition; spatial–frequency feature fusion; relevance-difference feature weighting; bounding-box regularization

1. Introduction

Unmanned underwater vehicles (UUVs) are widely used in marine exploration, engineering operations, and security [1]. UUVs have to be frequently salvaged due to limited battery capacity. In recent years, underwater wireless power transfer (UWPT) technology has been developed to enhance the endurance of UUVs [2]. UWPT is categorized into moored and non-moored types according to the UUV docking mode. The reliability and cost-effectiveness of the moored UWPT technology are higher than those of the non-moored UWPT [3]. Accurate detection of the relative position and orientation between a docking station and a UUV is critical for the moored UWPT system, as shown in Figure 1 [4].

Existing docking detection methods include electromagnetic sensing, optical sensing, multi-modal fusion, and visual detection [5]. Electromagnetic and optical sensing are sensitive to magnetic interference, ocean currents, and light scattering [6,7]. Multi-modal fusion improves robustness but increases system complexity and cost [8]. Visual detection offers a cost-effective and flexible solution for underwater localization [9]. The single-perspective UUV vision has a limited view and docking range, reducing docking reliability. This work proposes a bilateral visual detection strategy for rotatable UWPT platforms to improve detection robustness.

Underwater visual detection is divided into model-based methods and deep learning methods. Model-based methods use prior knowledge and matching rules to achieve target detection. The Spanish Underwater Robotics Center employed active optical beacons to match the feature point between the docking station and UUV [10]. Yan et al. [11] proposed a visual positioning algorithm based on an L-shaped optical array to achieve docking at multiple positions. The number of guide lights was often misjudged in the underwater environment, which led to low recognition accuracy. Zhong et al. [12] proposed an adaptively weighted Otsu’s thresholding (OTSU) feature extraction method that adjusts sub-region weights according to the local image contrast. Model-based methods are influenced by expert experience. The accuracy degrades significantly in complex underwater environments. Convolutional Neural Networks (CNNs) and You Only Look Once (YOLO) models have been widely adopted in target recognition [13]. It is divided into two-stage and one-stage algorithms by framework architecture [14]. Two-stage algorithms generate candidate regions before object classification, such as Region-based Convolutional Neural Network (RCNN) [15], Fast-RCNN [16], Faster-RCNN [17], MASK-RCNN [18], and Cascade-RCNN [19]. The detection accuracy of two-stage algorithms is high while the structure is complex.

In contrast, one-stage algorithms like a Single-Shot Detector (SSD) [20] and YOLO [21] directly predict image bounding boxes and class probabilities. They are suitable for online underwater target detection. The SSD is not applicable to moving target recognition [22]. YOLO-series algorithms demonstrate substantial potential for underwater target detection [23]. Lei et al. [24] embedded a Swin Transformer into YOLOv5 to address underwater blurring. Global feature recognition was enhanced by the path aggregation network (PANet) multi-scale fusion. This model achieved 87.2% Mean Average Precision (mAP) on the Underwater Robot Professional Competition (URPC) dataset. To improve computing speed, Yan et al. [25] augmented YOLOv7 with the Convolutional Block Attention Module (CBAM) and the Cross-Stage Partial Fast Spatial Pyramid Pooling (SPPFCSPC) module. The model achieved 80.62% mAP on the URPC dataset with 64.21 FPS detection speed. The CBAM attention only amplifies existing salient features. The challenge of detecting degraded, blurry targets underwater persists.

Chang et al. [26] solved the small-target feature loss problem by YOLOv5s. A Space to Depth (SPD) Conv module was proposed to eliminate feature loss during pooling. This model achieved 80.17% mAP, while the missed-detection rate for small targets was 23% because of an equally weighted feature fusion strategy. Zheng et al. [27] proposed an RG-YOLO model and a Gather–Distribute Feature Pyramid Network (GDFPN) with a global gathering and local distribution linkage mechanism. The model attained 85.1% mAP@50, while the small-target recall rate was only 69.3%. Wang et al. [28] proposed a lightweight model to identify multi-scale features. This approach is prone to dimension mismatch in multi-feature fusion tasks. The underwater multi-scale target detection problem remains unresolved. It is crucial to distinguish shallow spatial information of small objects and deep semantic information of large objects.

Anchor box optimization is important for underwater object detection. Han et al. [29] solved regression bias in low-light conditions by Minimum Point Distance Intersection over Union (MPDIoU) of YOLOv8n. The model achieved 68.8% mAP@50 on the Exclusive Dark dataset. Its anchor boxes tend to shift toward high-illumination areas, and underwater background light is not considered. Gao et al. [30] proposed a PE-Transformer model to mitigate background redundancy. Anchor boxes were replaced by adaptive points. Zhang et al. [31] improved regression accuracy by refining the aspect ratio penalty in Complete Intersection over Union (CIoU) loss. The deviation exceeds 10% in turbid water because the fixed weight is not applicable to underwater conditions. An anti-interference anchor regression mechanism is urgently required to solve uneven lighting, redundant background light, and water turbidity problems.

Unlike generic underwater detection tasks, UWPT docking requires simultaneous localization of two key targets: large UUV structures and small luminous guidance markers. These small markers directly determine docking alignment. Standard YOLO models are primarily optimized for spatial intensity discrimination and scale-balanced natural scenes. In UWPT docking, small structured light sources coexist with scattering noise, motion blur, and illumination bias. This leads to feature imbalance between global contours and fine-scale markers, as well as regression drift toward high-intensity background regions. In summary, object detection of UWPT faces the following challenges: (1) Degradation of underwater image edge features. The edge feature degradation of docking station contours is caused by severe light scattering in turbid water and exacerbated by UUV motion-induced blurring. (2) Imbalanced multi-scale features. The imbalance of small and large target features causes high miss-detection rates and poor scale adaptation. UUV motion and occlusions intensify the conflict between local and global features. (3) Anchor box regression bias. Illumination bias causes a drift toward high-light areas due to uncorrected gradients. Background redundancy triggers box over-expansion in blurred regions.

To address these challenges, a novel TFDF-YOLO model is proposed for UWPT docking detection. The contributions are as follows:

A spatial–frequency decoupling module (SFD) is proposed by using Fourier-based degradation cues to guide Top-K proxy attention to boost blurred edge extraction capability. Proxy tokens and Top-K sparsity reduce the complexity of global relation modeling.
A relevance-difference fusion (RD-Fusion) mechanism is proposed to process imbalanced multi-scale features. A global channel attention mechanism with dynamic weighting is improved to raise the recognition rate of small objects.
A new U-CIoU loss function and an illumination-background adaptive weight strategy are specially designed for underwater target recognition.
The proposed framework integrates lightweight modules and efficient feature fusion mechanisms for underwater docking detection. Its effectiveness and computational efficiency are evaluated on a multi-source underwater dataset.

The remainder of this paper is organized as follows: Section 2 introduces the proposed model architecture and principles, Section 3 describes the multi-source dataset, Section 4 presents extensive experimental validation, and Section 5 provides the conclusion.

2. Methodology

A novel TFDF-YOLO model is proposed for identifying UUVs and docking stations during UWPT docking. The baseline model is YOLOv11s. YOLOv11 uses C3k2 and C2PSA modules to improve spatial awareness. Depthwise separable convolutions (DS Conv) are integrated into the classification head to reduce model size. The architecture of YOLOv11 The SFD module consists of an edge-enhancement preprocessing layer model is shown in Figure 2.

Based on the baseline network, the overall framework of the proposed TFDF-YOLO is illustrated in Figure 3. This section details the YOLOv11s framework, the proposed SFD module, RD-Fusion strategy, and the U-CIoU loss function.

2.1. Spatial–Frequency Decoupling (SFD) Module

The SFD module consists of an edge-enhancement preprocessing layer, a local feature extraction module, a global relation modeling module, and a fusion layer. The edge-enhancement layer separates edges from background regions. The local branch then employs multi-scale depthwise convolutions (DW Conv) to capture edge details. The global branch applies Fourier-based degradation cues as guidance signals to control the grouped sparsity ratios of Top-K proxy attention, and builds a dual-stage “aggregation–broadcast refinement” attention structure to jointly model long-range dependencies and local details. The structure of the SFD module is illustrated in Figure 4.

Underwater docking relies on contour alignment and guide-light localization. Both cues are edge-dominated structures. Edge information must therefore be preserved. For a 2D image

f (x, y)

, the Fourier transform is

F (u, v) = \iint f (x, y) e^{- j 2 π (u x + v y)} d x d y

(1)

Low-frequency components correspond to smooth background regions. High-frequency components correspond to rapid spatial variations. Edge structures can be described by image gradients:

\nabla f (x, y) = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})

(2)

Taking the Fourier transform of the derivative yields:

F \{\frac{\partial f}{\partial x}\} = 2 π i u F (u, v), F \{\frac{\partial f}{\partial y}\} = 2 π i v F (u, v)

(3)

The derivative scales the spectrum by u and v. Higher frequencies receive larger amplification. Edge structures, therefore, correspond to dominant high-frequency responses. Underwater blur can be modeled as

f_{b} = f * k \Leftrightarrow F_{b} (u, v) = F (u, v) \times K (u, v)

(4)

The kernel

K (u, v)

acts as a low-pass filter. High-frequency components are attenuated more strongly. Since edges are high-frequency structures, blur suppresses edge energy first. The high-frequency energy ratio reflects structural edge preservation and provides a physically interpretable degradation indicator.

2.1.1. Edge-Enhancement Preprocessing Layer

The smoothing–differencing fusion can amplify gradient differences between target edges and background intensities. It provides input for feature extraction. Pooling operations include max pooling, average pooling, and soft variants [32]. An average pooling is applied to the input feature map

X_{i n} \in R^{H \times W \times C}

. The element-wise difference between the input and pooled output is calculated to obtain grayscale variations in edge regions. The original input and differential edge features

X_{e d}

are then fused as in Equation (5):

X_{p r e} = X_{i n} + (X_{i n} - AvgPool (X_{i n}))

(5)

where

X_{p r e}

represents the output features after edge enhancement. This layer is a non-parametric preprocessing operation and does not introduce additional trainable variables. It alleviates edge blurring caused by water turbidity.

2.1.2. Multi-Scale Spatial Feature Extraction Module

Multi-branch depthwise convolution is applied to capture local details across varied receptive fields. The input

X_{p r e}

is split into three components of different scales by channel ratio:

X_{c 1} \in R^{H \times W \times C / 2}

,

X_{c 2} \in R^{H \times W \times C / 4}

, and

X_{c 3} \in R^{H \times W \times C / 4}

. The 3 × 3 and 5 × 5 depthwise convolutions are used to expand the receptive fields of

X_{c 2}

and

X_{c 3}

. Channel dimensions are unified by pointwise convolution. Multi-scale feature extraction and concatenation are shown in Equation (6):

X_{c a t} = Concat (DW Conv (X_{c 2}, 3 \times 3), DW Conv (X_{c 3}, 5 \times 5), X_{c 1})

(6)

Channel compression and spatial output are shown in Equation (7):

X_{l o c} = PW Conv (X_{c a t})

(7)

where

X_{l o c} \in R^{B \times C \times H \times W}

represents the final spatial local features.

2.1.3. Frequency-Guided Sparsity for Dual-Stage Spatial Attention

Under underwater imaging conditions, degradation is often spatially non-uniform, where clear and blurred regions coexist within the same frame. Using a fixed Top-K value for all regions may either discard weak yet informative responses in severely degraded areas or preserve redundant background responses in clear areas. To address this issue, we propose a frequency-guided sparsity strategy. Fourier-based degradation cues guide the sparsity of spatial attention.

Given the feature map

X_{p r e} \in R^{B \times C \times H \times W}

, we generate query, key, and value embeddings and flatten them into spatial tokens:

Q, K, V = R e s h a p e (P W C o n v (X_{p r e})), Q, K, V \in R^{B \times N \times d}, N = H W

(8)

1.: Proxy token construction

To reduce the computational burden of dense attention and obtain compact representatives, we construct

n ≪ N

proxy tokens by pooling the Q.

A = AvgPool (Q, kernel_size = (3,1)) A \in R^{B \times n \times d}, n = H^{'} W^{'}

(9)

Let

Ω_{j}

denote the spatial region (pooling window) corresponding to the j-th proxy token

A_{j}

.

2.: Degradation indicator and sparsity ratio assignment

For each proxy region

Ω_{j}

, we extract an aligned local patch

P_{j} = X_{p r e} [\dots, \dots, Ω_{j}] \in R^{B \times C \times p_{h} \times p_{w}}

and compute its 2D Fourier amplitude spectrum. Since blur behaves as a low-pass effect that suppresses high-frequency components, we define a degradation indicator by the ratio of high-frequency energy:

μ_{j} = \frac{\sum_{c} \sum_{(u, v) \in H} {|F (P_{j})|}_{c} (u, v)}{\sum_{c} \sum_{u, v} {|F (P_{j})|}_{c} (u, v) + ϵ}, μ \in R^{B \times n}

(10)

Here,

F (\cdot)

denotes the 2D FFT,

H

is a predefined high-frequency region, and

ϵ

is a small constant for numerical stability.

The high-frequency region

H

is defined as

H = {(u, v) ∣ \sqrt{u^{2} + v^{2}} > τ}

(11)

where

τ

is a radial frequency threshold. It is set to 0.5 times the Nyquist frequency in this work. This radial definition avoids orientation bias. It ensures consistent frequency partition across resolutions.

To ensure scale invariance, the blur score

μ_{j}

is normalized within each image. Let

μ = {μ_{1}, μ_{2}, \dots, μ_{n}}

denote all proxy scores of one sample. We sort u in ascending order. The first and second tertiles are computed as

Q_{1 / 3} (μ) a n d Q_{2 / 3} (μ)

, where

Q_{p} (\cdot)

denotes the p-th quantile operator. Each proxy region is assigned to one of three degradation levels:

g (j) = \{\begin{array}{l} blur, & μ_{j} \leq Q_{1 / 3} (μ) \\ mid, & Q_{1 / 3} (μ) < μ_{j} < Q_{2 / 3} (μ) \\ clear, & μ_{j} \geq Q_{2 / 3} (μ) \end{array}

(12)

This ranking avoids absolute thresholds. The grouping is adaptive to each image and Top-K selection.

Attention is computed over

N = H \times W

spatial tokens. The Top-K mask is applied along this spatial-token dimension. For each proxy region j, the Top-K value is defined as

k_{j} = c l i p (⌊ρ_{g (j)} \cdot N⌋, k_{m i n}, k_{m a x})

(13)

where

ρ_{g (j)} \in {ρ_{large}, ρ_{mid}, ρ_{small}}

,

ρ_{large} > ρ_{mid} > ρ_{small}

.

The sparsity ratio is therefore determined by the degradation level. Blurred regions receive larger ρ. Clear regions receive smaller ρ.

This design ensures that the sparsity behavior is consistent across different feature-map resolutions and retains more candidates for severely degraded regions.

3.: Dual-stage attention: proxy aggregation and broadcast refinement

We adopt a dual-stage attention mechanism. Stage I performs proxy-based aggregation, and Stage II broadcasts the aggregated information back to spatial tokens for refinement.

Stage I: Proxy aggregation.

Using proxies A as queries, K as keys, and V as values, we compute proxy-to-spatial attention and apply the degradation-guided Top-K mask along the spatial-token dimension:

X_{l} = S o f t M a x (M a s k T o p K (\frac{A K^{⊤}}{\sqrt{d}}, k_{j})) V, X_{l} \in R^{B \times n \times d}

(14)

This stage selectively aggregates a sparse set of spatial responses into each proxy, allowing severely degraded regions to retain more candidates.

Stage II: Broadcast refinement.

We then broadcast the aggregated proxy information back to all spatial tokens for refinement. Using Q as queries, A as keys, and the proxy features

X_{I}

as values, we obtain:

X_{a t t} = S o f t M a x (\frac{Q A^{⊤}}{\sqrt{d}}) X_{l}, X_{a t t} \in R^{B \times N \times d}

(15)

X_{a t t}

is reshaped to

R^{B \times d \times H \times W}

and projected back to the original channel dimension via a 1 × 1 point-wise convolution, producing the output

X_{g l o} \in R^{B \times C \times H \times W}

.

2.1.4. Local–Global Feature Fusion Layer

X_{l o c}

and

X_{g l o}

are concatenated by pointwise convolution, which enables cross-channel interaction and reduces dimension. The output

X_{s f d}

preserves local details and global correlations:

X_{s f d} = P W C o n v (C o n c a t (X_{l o c}, X_{g l o}))

(16)

Local features during multi-scale target matching are obtained by differential receptive fields. Global features suppress background interference. The local–global feature fusion layer realizes feature representation of degraded underwater targets.

Let spatial token number be

N = H W

. Standard attention has complexity

O (N^{2})

. Proxy-based aggregation reduces tokens to

n ≪ N

. Stage I complexity becomes

O (n N)

. Stage II complexity becomes

O (n N)

. Total complexity is

O (n N)

. When

n ≪ N

, computational cost is significantly reduced.

2.2. Relevance-Difference Fusion (RD-Fusion) Strategy

The docking task of UWPT requires the localization of small-scale guide lights, large-scale docking stations, and UUVs. Turbid water conditions would cause shallow feature loss of small targets. A relevance-difference fusion (RD-Fusion) strategy is thus proposed. Small-target details are recovered by the calculation of differences in deep and shallow features. Fusion weights of detailed and semantic features are dynamically adjusted. The output feature maps integrate the global positioning of large targets and the precise localization of small targets.

The RD-Fusion strategy is illustrated in Figure 5. It consists of global channel attention generation, difference feature weight generation, and dynamic weighted fusion.

2.2.1. Global Channel Attention Generation

Shallow features

F_{1} \in R^{C \times H \times W}

contain rich spatial details such as edges and textures. Deep features

F_{2} \in R^{C \times H \times W}

represent semantic information after upsampling. These features are fused as Equation (17):

F_{c o n} = C o n c a t (F_{1}, F_{2})

(17)

where

F_{c o n} \in R^{2 C \times H \times W}

. This operation integrates spatial details and semantic context.

Global Average Pooling (GAP) aggregates spatial information across each channel:

G A P (F_{c o n}) [c] = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F_{c o n} [c, i, j], c = 1,2, \dots, 2 C

(18)

where

G A P (F_{c o n}) \in R^{2 C \times 1 \times 1}

.

The pooled feature is processed by pointwise convolution to model channel interactions. The Sigmoid function generates channel attention weights:

h_{i}, l_{i} = Sigmoid (P W C o n v (G A P (F_{c o n})))

(19)

These weights emphasize informative channels and suppress weak responses.

2.2.2. Difference Feature Weight Generation

The feature difference between shallow and deep representations is computed as Equation (20):

F_{d i f f} = F_{1} - F_{2} {, F}_{d i f f} \in R^{C \times H \times W}

(20)

This operation highlights complementary information between spatial details and semantic features. A depthwise separable convolution is applied to generate a difference weight matrix, as shown in Equation (21):

α = Sigmoid ({D S C o n v}_{3 \times 3} (F_{d i f f}))

(21)

where

α \in R^{C \times H \times W}

. Larger values of α indicate stronger feature discrepancy.

2.2.3. Dynamic Weighted Fusion

Channel attention and difference weights are combined to perform dynamic fusion:

F_{o u t} = P W Conv (F_{1} ⊙ α ⊙ h_{i} + F_{2} ⊙ (1 - α) ⊙ l_{i})

(22)

where

⊙

denotes element-wise multiplication.

The fusion mechanism enhances distinctive spatial details from

F_{1}

while preserving the semantic representation from

F_{2}

. The design improves scale adaptability and reduces feature imbalance without increasing computational complexity.

2.3. U-CIoU (Underwater-CIoU) Loss Function

Underwater docking detection is affected by non-uniform illumination and background scattering. In shallow waters, strong light reflection introduces spatial brightness bias. This bias alters feature responses and affects anchor box regression. The CIoU loss does not consider illumination imbalance. In practice, this may cause anchor box inflation or regression drift. To improve stability, an underwater-optimized CIoU (U-CIoU) loss is proposed. It introduces an illumination-background adaptive weight and a scale variation regularization term.

2.3.1. CIoU Loss Function

The CIoU loss is formulated as shown in Equation (23):

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v

(23)

where IoU represents the intersection over union between predicted and ground truth boxes.

ρ^{2} (b, b^{g t})

denotes the squared Euclidean distance between box centers. c indicates the diagonal length of the smallest enclosing rectangle. α is a trade-off parameter, and v measures aspect ratio consistency.

2.3.2. Illumination-Induced Gradient Drift

Underwater image intensity can be modeled as

f (x, y) = s (x, y) + l (x, y)

(24)

where

s (x, y)

represents structural information and

l (x, y)

denotes illumination bias. Bounding-box regression follows gradient descent:

θ_{t + 1} = θ_{t} - η \nabla L

(25)

High illumination regions produce stronger feature activations. During backpropagation:

\nabla L \propto \nabla f (x, y)

(26)

Thus, gradients tend to be amplified in bright regions. The regression direction may drift toward illumination peaks rather than true geometric centers. This effect reduces localization stability. To compensate for this bias, an adaptive weight is introduced.

2.3.3. U-CIoU Loss Function

The enhanced U-CIoU loss is defined as shown in Equation (27):

L_{U - C I O U} = w_{i} [1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v] + λ_{r e g}

(27)

1.: Adaptive weight term w_i

Loss weights are dynamically adjusted according to illumination intensity and background complexity:

w_{i} = \frac{Sigmoid (C_{i})}{Sigmoid (I_{i})}

(28)

where

I_{i}

is the mean intensity within the predicted box.

C_{i}

is the local entropy measuring background complexity. Low illumination and high background complexity increase the weight. The Sigmoid function bounds the value and ensures stable gradients. This mechanism redistributes gradient energy to difficult regions and corrects illumination-driven drift.

2.: Anchor Regularization Term λ_reg

A penalty of size change rate is proposed to suppress anchor box inflation:

λ_{r e g} = \sum_{i = 1}^{N} (|\frac{w_{i}^{t} - w_{i - 1}^{t}}{w_{i - 1}^{t}}| + |\frac{h_{i}^{t} - h_{i - 1}^{t}}{h_{i - 1}^{t}}|)

(29)

where

w_{i}^{t}

,

h_{i}^{t}

denote the width and height of the

i

-th anchor box at iteration t.

w_{i - 1}^{t}

,

h_{i - 1}^{t}

represent corresponding dimensions from the previous iteration. This term penalizes relative scale variation between consecutive iterations. Anchor inflation originates from unconstrained scale updates:

w_{i}^{t + 1} = w^{t} - η \frac{\partial L}{\partial w}

. Large oscillations may accumulate and enlarge bounding boxes. The proposed term enforces smooth scale evolution.

2.3.4. Parameter Sensitivity Discussion

The proposed U-CIoU loss does not introduce additional manually tuned hyperparameters beyond those in the standard CIoU formulation. The adaptive weight

w_{i}

is computed directly from local intensity and entropy statistics. It contains no scaling coefficient and is fully data-driven. The anchor regularization term

λ_{reg}

is defined by the relative variation in anchor dimensions across iterations. No extra balancing factor is introduced. Its magnitude is naturally coupled with the regression dynamics. Therefore, the optimization structure remains unchanged compared with CIoU. No additional hyperparameter tuning is required.

The performance improvement arises from gradient redistribution under uneven illumination and stabilization of scale updates, rather than from parameter adjustment. Hence, the sensitivity level is comparable to CIoU while providing enhanced robustness in underwater environments.

3. UWPT Platform and Dataset Introduction

This section describes the UWPT platform and a multi-source dataset.

3.1. UWPT Platform

The UWPT platform in this project consists of a conical guidance cover, a sleeve structure, and a base module. The platform integrates wireless charging units, power conversion devices, communication modules, underwater cameras, and a motor-driven lifting mechanism. This integrated configuration supports both energy transfer and perception-assisted docking operations.

An underwater camera is mounted on the conical guidance cover to capture real-time visual information of the approaching UUV. The system estimates the relative position and orientation of the UUV and provides feedback for adaptive docking control. The lifting mechanism adjusts the docking angle through rotational motion. This design compensates for misalignment caused by ocean currents or UUV motion. The active adjustment mechanism improves docking tolerance and stability during power transfer.

Multiple guide lights are arranged around the conical cover to enhance perception reliability. These lights serve as structured visual cues for the UUV camera. They provide stable feature references under low-visibility underwater conditions. The bilateral visual configuration integrates platform-side and UUV-side perception. This configuration provides hardware support for the proposed detection framework.

In practical deployment, UWPT systems operate under environmental and hardware constraints. Charging tasks are typically conducted in shallow and mid-depth waters. The effective visual detection range in turbid coastal environments is often limited to several meters, depending on water quality. Light scattering reduces image contrast and weakens edge features. Accurate detection of docking contours and guide lights is therefore critical for stable alignment.

The platform is equipped with embedded computing units. Real-time inference is required during dynamic docking. High-complexity models increase latency and energy consumption. This may affect system stability. The proposed TFDF-YOLO framework is designed under these constraints. The lightweight structure reduces computational burden while maintaining detection robustness under illumination variation and multi-scale conditions.

3.2. Multi-Source Dataset

At present, there is no publicly available dataset that matches the structural characteristics and operational requirements of the UWPT docking scenario. Existing underwater detection datasets focus on generic marine objects and do not contain structured docking stations or luminous guidance markers. Therefore, a task-oriented dataset was constructed for this study. This study employs a multi-source dataset which consists of experimental data, simulated data, and publicly available data, totaling 8240 images. Figure 6 shows the platform and experimental devices. Partial areas are pixelated in Figure 6.

3.2.1. Experimental Data

Experimental data are collected from underwater charging trials of the UWPT platform in Qingdao, China. As shown in Figure 6a–c, the data contain 3800 samples with a pixel resolution of 1920 × 1080. Background light pollution is also incorporated during experiments.

3.2.2. Simulated Data

To enhance the diversity of the dataset, scaled prototype models (1:15) are used in a small testing tank with 3.5% salinity. Four simulation experiments are conducted with artificial lighting and controlled water turbidity using suspended particle densities (0.2, 0.4, 0.6, and 0.8 g/L). Each group contributes 550 sample images, totaling 2200 images at 1920 × 1080 resolution.

3.2.3. Public Data

The public Underwater Docking Identification Dataset (UDID) [33] is incorporated to evaluate model robustness and scalability. As shown in Figure 6h, UDID provides diverse docking station geometries and scales. All docking station images from UDID are re-annotated with guide light categories. This subset contains 2240 samples with 1920 × 1080 resolution.

3.2.4. Typical Scenarios in Multi-Source Data

Multiple representative scenarios are designed for the dataset according to sea trial experience. Figure 7 shows target occlusion, water turbidity, motion blur, color distortion, and low-contrast scenarios.

3.3. Data Analysis

This work focuses on multi-scale target recognition of underwater docking, including large-scale targets (UUVs, docking stations) and small-scale targets (guide lights). A scatter plot shows the pixel distributions of the target bounding box. The horizontal axis shows bounding box widths while the vertical axis displays heights (unit: pixels), as shown in Figure 8.

The bounding boxes are continuously distributed within the range of 0–500 pixels. Small-scale targets (0–100 pixels) mainly correspond to distant UUVs and guide lights. These objects occupy limited pixels and are sensitive to scattering and motion blur. Medium-scale targets (100–300 pixels) include mid-range UUV bodies and docking structures. Large-scale targets (300–500 pixels) represent close-range docking scenes such as UUV heads and dock entrances.

The distribution indicates significant scale variation in the dataset. The model must preserve fine-grained spatial details for small objects. It must also maintain a strong semantic representation for large objects. This multi-scale imbalance increases the difficulty of underwater docking detection and motivates the design of scale-aware feature fusion mechanisms.

The dataset comprises 8240 RGB images with a unified resolution of 1920 × 1080 pixels. All images are annotated using the LabelImg tool. Three target categories are annotated: UUVs, docking stations, and lights. The multi-source dataset is randomly divided into training, validation, and test sets in an 8:1:1 ratio, as shown in Table 1.

To ensure reliable evaluation, the dataset split follows three principles: distribution consistency, source balance, and scene independence.

Stratified random sampling is applied to maintain consistent category proportions across subsets. This avoids evaluation bias caused by class imbalance. The dataset includes images from sea trials, pool experiments, and public sources, accounting for 46%, 27%, and 27% of the total data. The proportion of each source is controlled in every subset. This prevents the model from overfitting to a specific environment.

Scene independence is strictly enforced. Sea and pool images are extracted from continuous video sequences. Adjacent frames are highly similar. Therefore, the split is performed at the sequence level instead of the image level. Images from the same sequence are assigned to the same subset. No sequence appears in both training and test sets. This design prevents near-duplicate leakage and ensures that test results reflect generalization to unseen scenes.

4. Result Analysis

This section introduces the evaluation metrics and experimental results. Three comparative experiments are designed to validate the effectiveness of the proposed model, including model comparison, optimal setting selection, and ablation experiments.

4.1. Experimental Setup

4.1.1. Implementation Details

All experiments were implemented using Python 3.8 and PyTorch 2.2. The hardware platform consisted of an Intel i7-12700F CPU and an NVIDIA RTX 4060 GPU.

The input resolution was fixed to 640 × 640 with a batch size of 16. All models were trained for 300 epochs using the Stochastic Gradient Descent (SGD) optimizer. The initial learning rate is 0.01, the final learning rate is 0.001, the momentum is 0.937, and the weight decay is set to 0.0005. To ensure fair comparison and reproducibility, all competing models were trained under identical settings, including training epochs, optimizer configuration, input size, and anchor configuration.

During training, standard data augmentation strategies were applied, including Mosaic augmentation, random horizontal flipping, random scaling, and HSV perturbation. No additional dataset-specific enhancement was introduced.

During inference, the confidence threshold was set to 0.7, and the Non-Maximum Suppression (NMS) threshold was set to 0.45 for all methods.

4.1.2. Evaluation Metrics

Mean Average Precision (mAP) is used to evaluate object detection performance. It is derived from precision and recall. Precision measures the proportion of correctly predicted targets among all detections:

P = \frac{T P}{T P + F P}

(30)

where

T P

denotes true positives, and

F P

denotes false positives. Recall measures the proportion of correctly detected targets among all ground-truth targets:

R = \frac{T P}{T P + F N}

(31)

where

F N

represents false negatives.

mAP@0.5 refers to the average precision under an IoU threshold of 0.5. mAP@0.5:0.95 computes the mean precision across IoU thresholds from 0.5 to 0.95. Larger mAP values indicate better detection accuracy.

A P = \int_{0}^{1} P (R) d R

(32)

m A P = \frac{1}{n} \sum_{i = 0}^{n} A P_{i}

(33)

where

A P

denotes the average precision of class iii, and nnn is the total number of classes.

Model complexity is evaluated by the number of parameters and the number of Giga Floating-point Operations (GFLOPs). One GFLOP corresponds to 10⁹ floating-point operations.

4.2. Comparative Analysis

4.2.1. Model Analysis

Three mainstream detection models (two-stage, Transformer-based, and one-stage) are analyzed, including Faster R-CNN, RT-DETR-R18 [34], SSD, and multiple YOLO series versions.

1.: Image Quality Analysis

A comparative visualization is presented in Figure 9. From left to right and top to bottom, the four scenarios are water turbidity, target occlusion, motion blur, and low contrast. Rows 2–9 show results of TFDF-YOLO, YOLOv11s, YOLOv10s, YOLOv9s, YOLOv8s, YOLOv5s, RT-DETR-R18, SSD, and Faster R-CNN. Missed detections are marked by yellow dashed circles and false detections by red boxes.

(a) Water Turbidity

In particle-dense water, Faster R-CNN misses small guidance lights and produces incomplete UUV contours. SSD reduces large-object misses but shows drift in small-light localization. RT-DETR-R18 detects major structures yet generates false positives in bright particle regions. YOLOv5s and YOLOv8s exhibit particle-induced false detections. YOLOv9s and YOLOv10s reduce false alarms but still miss some small lights. YOLOv11s improves stability, though interference remains. TFDF-YOLO suppresses particle-related false detections and maintains consistent light localization.

(b) Target Occlusion

Under partial occlusion, Faster R-CNN and SSD produce missed or misaligned small-light detections. RT-DETR-R18 preserves global structure but loses fine light details. YOLOv5s–YOLOv9s detect large objects reliably but occasionally miss occluded lights. YOLOv10s and YOLOv11s improve continuity, yet a slight localization offset remains. TFDF-YOLO maintains compact and stable detection of partially visible guidance lights.

(c) Motion Blur

Motion blur causes box drift and missed lights in Faster R-CNN and SSD. RT-DETR-R18 preserves coarse structure but misses some blurred lights. YOLOv5s and YOLOv8s show instability in small-light localization. YOLOv9s and YOLOv10s reduce misses but exhibit mild box enlargement. YOLOv11s improves robustness, but a slight deviation persists. TFDF-YOLO maintains accurate bounding boxes under blur.

(d) Low Contrast

Under low contrast, Faster R-CNN and SSD show missed detections and inaccurate box placement. RT-DETR-R18 detects large structures but misses dim lights. YOLOv5s–YOLOv10s present unstable small-target localization, with occasional box shift. YOLOv11s improves stability but still shows slight enlargement. TFDF-YOLO consistently detects small guidance lights with tight bounding boxes.

Two-stage detectors exhibit higher small-target miss rates. RT-DETR-R18 enhances global localization but remains sensitive to interference. YOLO variants progressively improve robustness, yet small-target drift and occasional false detections persist under degradation. The proposed TFDF-YOLO not only consistently identifies large objects but also accurately localizes small guidance lights in all scenarios. It achieves no missed or false detections by tightly fitting bounding boxes. The proposed TFDF-YOLO model performs best in complex underwater environments.

2.: Recognition accuracy analysis

Table 2 shows six performance metrics of each model: precision (P), recall (R), mAP@0.5, mAP@0.5:0.95, parameters, and GFLOPs. Each model was trained three times under identical settings using different random seeds (0, 1, and 2). Results are reported as mean ± standard deviation. Values in parentheses denote the 95% confidence intervals.

TFDF-YOLO obtains the highest values across all accuracy metrics. The mAP@0.5 increases from 85.9 ± 0.2% to 92.7 ± 0.2%. The mAP@0.5:0.95 increases from 61.1 ± 0.2% to 67.5 ± 0.2%. The corresponding confidence intervals do not overlap. The improvement is statistically significant at the 0.05 level. Faster R-CNN shows lower detection accuracy and larger variance. RT-DETR-R18 improves mAP@0.5 to 85.1%, but the computational cost reaches 58.8 GFLOPs. This increases inference burden in real-time docking scenarios. Within the YOLO series, YOLOv5s has the smallest parameter size (7.1 M) but lower accuracy. YOLOv8s increases accuracy with higher GFLOPs. YOLOv11s achieves 85.9% mAP@0.5 with 10.1 M parameters and 21.5 GFLOPs. It is used as the baseline model. TFDF-YOLO has 10.5 M parameters and 21.1 GFLOPs. The computational complexity remains similar to the baseline. Precision reaches 91.5%, and mAP@0.5 reaches 92.7%. The standard deviation remains at ±0.2 across metrics. Optimization remains stable during training. The effectiveness of the proposed model is verified.

Figure 10 presents a normalized radar chart of the performance of different methods. It indicates that TFDF-YOLO is capable of the object detection task of UWPT docking by the proposed feature extraction and fusion modules.

To provide a targeted evaluation, per-class AP results are reported in Table 3. The metrics include AP-UUV, AP-Docking, AP-Light, and mAP. All values are reported as mean ± standard deviation over three independent runs (seeds 0, 1, 2). Values in parentheses denote the 95% confidence intervals.

For large-scale UUV targets, two-stage Faster R-CNN and SSD show lower accuracy due to limited feature adaptation to underwater blur. RT-DETR-R18 improves AP-UUV to 86.3%. The YOLO series achieves stable performance between 82% and 87%. YOLOv11s reaches 87.2%. TFDF-YOLO achieves 93.5%. For docking station detection, similar trends are observed. Faster R-CNN and SSD remain below 72%. RT-DETR-R18 reaches 84.7%. The YOLO variants range from 80.7% to 85.6%. TFDF-YOLO achieves 91.8%. The improvement remains consistent across repeated runs. For small guidance lights, performance differences become more pronounced. Small targets are sensitive to blur and contrast attenuation. Faster R-CNN and SSD remain below 67%. RT-DETR-R18 reaches 80.5%. YOLO variants range from 76.9% to 81.3%. TFDF-YOLO achieves 88.6%. The absolute gain over YOLOv11s is 7.3%. The SFD module strengthens edge features. The RDF module improves multi-scale representation. The U-CIoU mechanism stabilizes regression under illumination variation. These components jointly improve localization stability and small-target detection performance. The overall mAP increases from 85.9% to 92.7%. The effectiveness of the proposed model is verified.

4.2.2. Optimal Setting Analysis

1.: Ablation on kernel configuration settings.

To justify the convolutional parameter selection, we compare different kernel combinations while keeping stride = 1 and channel numbers unchanged for fair evaluation, as shown in Table 4. Stride is fixed at 1 to preserve spatial resolution and avoid internal downsampling, which is critical for contour alignment and small-target detection in underwater docking.

Single-kernel results show that the 3 × 3 convolution achieves the best individual performance. Larger kernels (e.g., 7 × 7) lead to feature over-smoothing and reduced recall. Among multi-scale configurations, the 1–3–5 combination achieves the highest accuracy, with 89.4% mAP@0.5 and 65.8% mAP@0.5:0.95. This setting balances local detail extraction and moderate receptive field expansion without excessive parameter growth. Considering detection performance and structural stability, the 1–3–5 configuration with stride = 1 was selected as the optimal setting.

2.: Ablation on single sparsity ratio and grouped sparsity ratios.

We first study the single sparsity ratio ρ for Top-K masking. Table 5 shows that an overly small ratio

(ρ = 1 / 4)

causes a clear performance drop. This result indicates that excessive sparsification removes informative edge cues in degraded regions. Increasing ρ improves the results. The best single setting is

ρ = 1 / 3

. Larger ratios (

ρ = 1 / 2

, 2/3, and 1.0) do not bring further gains. They tend to preserve redundant background responses and weaken the precision–recall balance.

We then evaluate grouped sparsity ratios

(ρ_{s m a l l}, ρ_{m i d}, ρ_{l a r g e})

assigned by degradation ranking. As shown in Table 6, all grouped settings outperform the single-ρ baseline. This observation supports the necessity of spatially adaptive sparsification under underwater docking degradation. Specifically, (1/6, 1/3, 2/3) mainly improves recall with limited mAP gains. In addition, (1/4, 1/3, 2/3) further improves accuracy. The proposed (1/4, 1/3, 1/2) achieves the best overall parameters and 21.8 GFLOPs. The result suggests that an overly large

ρ_{l a r g e}

may introduce noisy or redundant features, while a moderate grouped setting yields a better trade-off.

These gains come from the proposed frequency-guided sparsity mechanism. Fourier-based degradation cues control the proxy-wise Top-K values. Stage-I aggregation preserves more candidates in severely degraded regions. Stage-II refinement broadcasts the aggregated information back to spatial tokens. The model strengthens blurred structural edges and suppresses background interference during UWPT docking.

3.: Ablation on Illumination Intensity Computation

To evaluate the influence of illumination estimation, different computation strategies for

I_{i}

are compared.

As summarized in Table 7, the results show that using the mean pixel intensity within the predicted bounding box achieves the best performance, reaching 92.7% mAP@0.5 and 67.5% mAP@0.5:0.95. Global image averaging leads to noticeable performance degradation, indicating that local illumination cues are more informative for regression weighting. The median and histogram-peak strategies are slightly inferior to the mean-based method. The mean intensity provides stable gradient guidance while preserving sensitivity to brightness variation within the target region. The bounding-box mean intensity is adopted in the final U-CIoU formulation.

4.: Feature extraction module

Different extraction modules are integrated into YOLOv11 to verify the effectiveness of the SFD module, such as Ghost Dynamic Conv [35], Faster-Net [36], Star-Net [37], Dynamic Snake Convolution (DySnakeConv) [38], Wavelet Transform Convolution (WTConv) [39], and the proposed SFD module. Based on previous experimental results, YOLOv11s serves as the baseline reference. Comparison results are shown in Table 8.

The C3k2-GhostDynamicConv module achieves lightweight parameters of 8.9 M. C3k2-DySnakeConv improves recall by snake-style convolution. C3k2-Star enhances cross-level interaction while increasing parameters. The lack of optimized edge extraction leads to a minimal mAP@0.5 improvement of 0.4%. Significant advantages are demonstrated by the proposed SFD module. Recall achieves 83.8%, marking a 4.7% increase. mAP@0.5:0.95 attains 65.8%, achieving 4.7% growth. Its parameter count remains at 10.7 M with 21.8 GFLOPs. Results indicate that lost target details are recovered by the SFD module. Fourier-based degradation cues adaptively control proxy-wise Top-K masking, and the dual-stage spatial attention enhances degraded edges while suppressing redundant background responses.

The convergence curve of the SFD module is shown in Figure 11. The proposed SFD module achieves high precision and recall rapidly. The SFD module converges faster than the other variants during the first 50 epochs. It also sustains the highest and most stable performance. The proposed SFD module is capable of object detection of UWPT docking.

As shown in Figure 12, the variants show inaccurate or missing detections with unstable or drifting boxes under low-contrast conditions. In contrast, the proposed SFD module strengthens edge and fine-scale features, which improves the detectability of blurred targets and achieves more accurate and stable localization.

5.: Feature fusion strategy

Different feature fusion strategies, such as SlimNeck [40], Bi-directional Feature Pyramid Network (Bi-FPN) [41], Attentional Scale Sequence Fusion (ASF) [42], Generalized Feature Pyramid Network (G-FPN) [43], and the proposed RD-Fusion, are all integrated into the YOLOv11s model for a fair comparison. Results are summarized in Table 9.

Bi-FPN achieves the lowest parameter count of 9.7 M, while the mAP@0.5 is only 86%. ASF has a high recall and mAP@0.5 for small objects, while GFLOPs is 21.3. The precision of G-FPN is slightly higher than that of RD-Fusion. However, the proposed RD-Fusion achieves the best in all other indicators. It indicates that subtle target-background distinctions are captured by the proposed strategy. Its computational cost, GFLOPs, is only 20.7.

The convergence curve of RD-Fusion is shown in Figure 13. RD-Fusion converges faster than other methods. The overall performance and real-time capability of RD-Fusion are the best, while G-FPN shows marginally higher precision at certain stages. The proposed RD-Fusion strategy is suitable for the real-time task of UWPT docking.

As shown in Figure 14, the variants produce false positives or miss small targets when facing complex underwater backgrounds and mixed-scale objects. The proposed RD-Fusion enhances multi-scale feature integration and achieves more consistent and reliable detections.

6.: Loss function

To validate the effectiveness of the proposed U-CIoU loss function, CIoU, Distance Intersection over Union (DIoU) [44], Generalized Intersection over Union (GIoU) [45], and Shape-IoU [46] functions are implemented in a YOLOv11s model. The comparative results are presented in Table 10.

DIoU shows the lowest precision because blurred target edges in underwater environments amplify center-point errors. GIoU handles non-overlapping boxes by minimum bounding rectangles. It achieves slightly higher precision and recall than CIoU. The recall of shape-IoU is increased to 80.6% by a shape prior mechanism. The proposed U-CIoU function reaches 86.6% mAP@0.5, which is clearly superior to other functions. U-CIoU without any shape priors has high robustness, especially in complex underwater environments.

As shown in Figure 15, faster convergence and a lower final loss are achieved by U-CIoU under identical training epochs. The loss of U-CIoU decreases rapidly during the initial phase (0–50 epochs), which means that the bounding box is optimized efficiently. Results show that the proposed U-CIoU performs best.

As shown in Figure 16, U-CIoU alleviates box over-expansion and suppresses illumination-driven localization bias. It improves geometric consistency of regression, producing tighter and more stable bounding boxes.

4.2.3. Ablation Experiment

An ablation experiment is discussed in this section to evaluate the performance of the proposed SFD Module, RD-Fusion, and U-CIoU function. YOLOv11s is the baseline model. The results are presented in Table 11.

The proposed SFD module solves the insufficient feature extraction problem in blurry underwater environments by spatial–frequency fusion. Results of lines 1, 2, 5, and 6 indicate that background interference is successfully suppressed by an integrated dynamic proxy mechanism. Results of lines 1, 3, 5, and 7 show that the proposed RD-Fusion effectively captures target-background distinctions and reduces feature redundancy. In addition, anchor box inflation and localization bias in underwater environments are solved by the proposed U-CIoU function. Results of lines 8 and 9 demonstrate that the detection accuracy of the proposed TFDF-YOLO model is greatly improved. TFDF-YOLO is able to identify underwater objects accurately in UWPT tasks.

As shown in Figure 17, each module provides a limited improvement in performance. Optimal performance is achieved by integrated multi-module configurations. It demonstrates that the proposed TFDF-YOLO model has the complementary advantages of all improved modules.

Figure 18 shows that the recognition result of TFDF-YOLO is concentrated in effective areas (UUV, docking station, guidance lights). Contour details are clearly captured, and interference of suspended particles and background clutter is effectively suppressed. The proposed TFDF-YOLO model can greatly improve the success rate of UWPT docking.

5. Conclusions

This work focuses on the visual localization of UUV UWPT docking. The proposed TFDF-YOLO model solves edge degradation, imbalanced multi-scale feature, and bounding-box regression bias problems in a complex underwater environment. A new SFD module is introduced into YOLOv11s to balance the global convergence capability and computational speed. It effectively restores degraded edges by frequency-guided sparsification of Top-K proxy attention and multi-scale spatial convolution. The RD-Fusion strategy is then proposed to recover small-object details by semantic differences. Positional and semantic attributes of different targets are also distinguished by dynamic weighting and global channel attention. The U-CIoU loss function can suppress lighting interference and box over-expansion. Results in multiple scenarios show that the proposed TFDF-YOLO achieves 91.5% precision with only 10.5 M parameters and 21.1 GFLOPs. TFDF-YOLO can identify underwater objects accurately in UUV UWPT docking tasks. This technology is essential for developing underwater wireless charging.

Advanced spatial–frequency modeling strategies and adaptive feature fusion mechanisms could be used in future work to enhance robustness under extreme underwater conditions, such as severe turbidity, dynamic illumination variations, and motion disturbances. In addition, multimodal fusion approaches and sophisticated soft pooling strategies, such as intuitionistic fuzzy pooling, may improve measurement accuracy under uncertainty conditions. The proposed model can be further used in broader underwater scenarios, including deep-sea docking operations, long-term autonomous charging networks, underwater infrastructure inspection, and multi-AUV cooperative systems.

Author Contributions

Conceptualization, H.Y.; methodology, Y.C.; software, Y.C.; validation, H.Y. and Y.C.; formal analysis, Y.C. and W.S.; investigation, H.Y. and Y.C.; resources, H.Y.; data curation, Y.C. and W.S.; writing—original draft preparation, H.Y. and Y.C.; writing—review and editing, H.Y. and Y.C.; visualization, Y.C. and W.S.; supervision, H.Y.; project administration, H.Y.; funding acquisition, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is funded by the Hainan Provincial Key Research and Development Program (Grant No. ZDYF2024GXJS004).

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. No conflicts of interest exist in the submission of this manuscript, and the manuscript is approved by all authors for publication. I would like to declare on behalf of my co-authors that the work described was original research that has not been published previously, and is not under consideration for publication elsewhere, in whole or in part. All the authors listed have approved the manuscript that is enclosed.

Abbreviations

The following abbreviations are used in this manuscript:

Variables
$X_{i n}$	Input feature map to the SFD module
$X_{p r e}$	Preprocessed feature map after edge enhancement
$X_{l o c}$	Local spatial feature map after multi-scale DW convolutions
$X_{a t t}$	Refined spatial tokens after Stage-II refinement
$X_{g l o}$	Global interaction feature reshaped from $X_{a t t}$
$X_{s f d}$	Final output feature of the SFD module after local–global fusion
$F_{1} / F_{2}$	Shallow and deep input features
$F_{c o n c a t}$	Concatenated cross-layer features
$F_{d i f f}$	Difference feature between shallow and deep layers
$F_{o u t}$	Fused output feature from RD-Fusion
$L_{U - C I o U}$	Underwater-optimized CIoU loss function
Q, K, V	Query/Key/Value token embeddings
$Ω_{j}$	Pooling window corresponding to the $j$ -th proxy token.
$P_{j}$	Local patch aligned with $Ω_{j}$ for degradation estimation
$μ_{j}$	Fourier-based degradation indicator of proxy region $Ω_{j}$
$ρ$	Sparsity ratio
$w_{i}$	Illumination-background adaptive weight
$λ_{r e g}$	Anchor regularization coefficient
$α$	Dynamic weight matrix for difference features in RD-Fusion
$h_{i},$ $l_{i}$	Channel attention weights for shallow and deep features
Acronyms
UWPT	Underwater Wireless Power Transfer
UUVs	Unmanned Underwater Vehicles
YOLO	You Only Look Once
SFD module	Spatial–Frequency Decoupling Module
RD-Fusion	Relevance-Difference Fusion
U-CIoU	Underwater-CIoU
DW Conv	Depthwise Convolution
PW Conv	Pointwise Convolution
GAP	Global Average Pooling
DS Conv	Depthwise Separable Convolution
SGD	Stochastic Gradient Descent
$P$	Precision
R	Recall
mAP	Mean Average Precision
GFLOPs	Giga Floating-point Operations Per Second
TPs/FPs/FNs	True Positives, False Positives, False Negatives
PANet	Path Aggregation Network
CBAM	Convolutional Block Attention Module
URPC	Underwater Robot Professional Competition
UDID	Underwater Docking Identification Dataset

References

Wang, L.; Zhu, D.; Pang, W.; Liu, X. A survey of underwater search for multi-target using multi-AUV: Task allocation, path planning and formation control. Ocean Eng. 2023, 278, 114393. [Google Scholar] [CrossRef]
Zhang, B.; Jiang, C.; Yang, F.; Chen, C.; Lu, Y.; Zhou, J. An anti-rotation wireless power transfer system with a flexible magnetic coupler for autonomous underwater vehicles. IEEE Trans. Power Electron. 2025, 40, 2593–2603. [Google Scholar] [CrossRef]
Wang, D.; Zhang, J.; Cui, S.; Bie, Z.; Chen, F.; Zhu, C. The state-of-the-art of underwater wireless power transfer: A comprehensive review and new perspectives. Renew. Sustain. Energy Rev. 2024, 189, 113910. [Google Scholar] [CrossRef]
Lin, M.; Lin, R.; Yang, C.; Li, D.; Zhang, Z.; Zhao, Y.; Ding, W. Docking to an underwater suspended charging station: Systematic design and experimental tests. Ocean Eng. 2022, 249, 110766. [Google Scholar] [CrossRef]
Liu, J.; Yu, F.; He, B.; Guedes Soares, C. A review of underwater docking and charging technology for autonomous vehicles. Ocean Eng. 2024, 297, 117154. [Google Scholar] [CrossRef]
Lin, R.; Zhao, Y.; Li, D.; Lin, M.; Yang, C. Underwater electromagnetic guidance based on the magnetic dipole model applied in AUV terminal docking. J. Mar. Sci. Eng. 2022, 10, 995. [Google Scholar] [CrossRef]
Yang, Q.; Liu, H.; Hong, L.; Yu, X.; Chen, J.; Chen, B. Anti-disturbance control strategy in capture stage for AUV dynamic-base docking with optical-guided constraints. Ocean Eng. 2024, 311, 118946. [Google Scholar] [CrossRef]
Watt, G.; Roy, A.; Currie, J.; Gillis, C.; Giesbrecht, J.; Heard, G.; Birsan, M.; Seto, M.; Carretero, J.; Dubay, R.; et al. A concept for docking a UUV with a slowly moving submarine under waves. IEEE J. Ocean. Eng. 2016, 41, 471–498. [Google Scholar] [CrossRef]
Li, Y.; Sun, K.; Han, Z.; Lang, J. Deep learning-based docking scheme for autonomous underwater vehicles with an omnidirectional rotating optical beacon. Drones 2024, 8, 697. [Google Scholar] [CrossRef]
Palomeras, N.; Vallicrosa, G.; Mallios, A.; Bosch, J.; Vidal, E.; Hurtós, N.; Carreras, M.; Ridao, P. AUV homing and docking for remote operations. Ocean Eng. 2018, 154, 106–120. [Google Scholar] [CrossRef]
Yan, Z.; Gong, P.; Zhang, W.; Li, Z.; Teng, Y. Autonomous underwater vehicle vision-guided docking experiments based on L-shaped light array. IEEE Access 2019, 7, 72567–72576. [Google Scholar] [CrossRef]
Zhong, L.; Li, D.; Lin, M.; Lin, R.; Yang, C. A fast binocular localisation method for AUV docking. Sensors 2019, 19, 1735. [Google Scholar] [CrossRef]
Gomes, D.; Saif, A.; Nandi, D. Robust underwater object detection with autonomous underwater vehicle: A comprehensive study. In Proceedings of the International Conference on Computer Advancements (ICCA 2020), Dhaka, Bangladesh, 10–12 January 2020. Article 17. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2015), Santiago, Chile, 13–16 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV 2016), Amsterdam, The Netherlands, 11–14 October 2016; Volume 9905, pp. 21–37. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Zhang, M.; Xu, S.; Song, W.; He, Q.; Wei, Q. Lightweight underwater object detection based on YOLOv4 and multi-scale attentional feature fusion. Remote Sens. 2021, 13, 4706. [Google Scholar] [CrossRef]
Lu, D.; Yi, J.; Wang, J. Enhanced YOLOv7 for improved underwater target detection. J. Mar. Sci. Eng. 2024, 12, 1127. [Google Scholar] [CrossRef]
Lei, F.; Tang, F.; Li, S. Underwater target detection algorithm based on improved YOLOv5. J. Mar. Sci. Eng. 2022, 10, 310. [Google Scholar] [CrossRef]
Yan, J.; Zhou, Z.; Zhou, D.; Su, B.; Xuanyuan, Z.; Tang, J.; Lai, Y.; Chen, J.; Liang, W. Underwater object detection algorithm based on attention mechanism and cross-stage partial fast spatial pyramidal pooling. Front. Mar. Sci. 2022, 9, 1056300. [Google Scholar] [CrossRef]
Chang, Y.; Li, D.; Gao, Y.; Su, Y.; Jia, X. An improved YOLO model for UAV fuzzy small target image detection. Appl. Sci. 2023, 13, 5409. [Google Scholar] [CrossRef]
Zheng, Z.; Yu, W. RG-YOLO: Multi-scale feature learning for underwater target detection. Multimed. Syst. 2025, 31, 26. [Google Scholar] [CrossRef]
Wang, Z.; Ruan, Z.; Chen, C. DyFish-DETR: Underwater fish image recognition based on detection transformer. J. Mar. Sci. Eng. 2024, 12, 864. [Google Scholar] [CrossRef]
Han, Z.; Yue, Z.; Liu, L. 3L-YOLO: A lightweight low-light object detection algorithm. Appl. Sci. 2025, 15, 90. [Google Scholar] [CrossRef]
Gao, J.; Zhang, Y.; Geng, X.; Tang, H.; Bhatti, U. PE-Transformer: Path enhanced transformer for improving underwater object detection. Expert Syst. Appl. 2024, 246, 123253. [Google Scholar] [CrossRef]
Zhang, Y.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IoU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Rajafillah, C.; El Moutaouakil, K.; Patriciu, A.-M.; Yahyaouy, A.; Riffi, J. INT-FUP: Intuitionistic Fuzzy Pooling. Mathematics 2024, 12, 1740. [Google Scholar] [CrossRef]
Liu, S.; Ozay, M.; Okatani, T.; Xu, H.; Sun, K.; Lin, Y. Detection and pose estimation for short-range vision-based underwater docking. IEEE Access 2019, 7, 2720–2749. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seatle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Guo, J.; Wu, E. ParameterNet: Parameters are all you need for large-scale visual pretraining of mobile networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seatle, WA, USA, 16–22 June 2024; pp. 15751–15761. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.; Chan, S. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar] [CrossRef]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the stars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seatle, WA, USA, 16–22 June 2024; pp. 5694–5703. [Google Scholar] [CrossRef]
Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic snake convolution based on topological geometric constraints for tubular structure segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 1–6 October 2023; pp. 6070–6079. [Google Scholar] [CrossRef]
Finder, S.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet convolutions for large receptive fields. In Proceedings of the European Conference on Computer Vision (ECCV 2024), Milan, Italy, 29 September–4 October 2024; Volume 15112, pp. 363–380. [Google Scholar] [CrossRef]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
Kang, M.; Ting, C.; Ting, F.; Phan, R. ASF-YOLO: A novel YOLO model with attentional scale sequence fusion for cell instance segmentation. Image Vis. Comput. 2024, 147, 105057. [Google Scholar] [CrossRef]
Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. DAMO-YOLO: A report on real-time object detection design. arXiv 2022. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2020), New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 16–20 June 2019; pp. 658–666. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, S. Shape-IoU: More accurate metric considering bounding box shape and scale. arXiv 2024, arXiv:2312.17663. [Google Scholar] [CrossRef]

Figure 1. Moored UWPT docking.

Figure 2. The overall architecture of YOLOv11s.

Figure 3. The overall architecture of the proposed TFDF-YOLO.

Figure 4. Spatial–frequency decoupling (SFD) module structure.

Figure 5. RD-Fusion strategy structure.

Figure 6. Multi-source dataset sources. (a) Docking station framework. (b) UUV prototype. (c,d) Water-entry deployment. (e) Sea trial site. (f) Surface docking. (g) Underwater docking. (h) Wireless charging. (i) Scaled UUV model. (j) UDID.

Figure 7. Representative dataset samples. (a) Normal. (b) Target occlusion. (c) Water turbidity. (d) Motion blur. (e) Low contrast.

Figure 8. Width and height of bounding boxes for datasets.

Figure 9. Detection results of different models.

Figure 10. Normalized radar chart of model performance metrics.

Figure 11. Convergence curve of different modules. (a) Curve of precision. (b) Curve of recall. (c) Curve of mAP@0.5. (d) Curve of mAP@0.5:0.95.

Figure 12. Recognition results. (a) Input images. (b) Results of other variants. (c) Results of SFD module.

Figure 13. Convergence curve of different methods. (a) Curve of precision. (b) Curve of recall. (c) Curve of mAP@0.5. (d) Curve of mAP@0.5:0.95.

Figure 14. Detection results of guide lights. (a) Input images. (b) Results of other variants. (c) Results of RD-Fusion.

Figure 15. Box loss for different IoU loss functions.

Figure 16. Detection results of different loss functions. (a) Input images. (b) Results of other variants. (c) Results of U-CIoU.

Figure 17. Performance of improved modules.

Figure 18. Underwater recognition result heatmap.

Table 1. Dataset split and annotation statistics.

Subset	Image Count	UUV (Instances)	Dock (Instances)	Light (Instances)
Training set	6592	1479	2451	4636
Valid set	824	295	490	963
Test set	824	247	445	927

Table 2. Recognition accuracy results.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Param/M	GFLOPs
Faster-RCNN	77.8 ± 0.5 (77.2–78.4)	72.5 ± 0.6 (71.8–73.2)	69.1 ± 0.5 (68.5–69.7)	56.3 ± 0.5 (55.7–56.9)	40.7	211.4
SSD	79.6 ± 0.4 (79.1–80.1)	73.3 ± 0.4 (72.8–73.8)	70.3 ± 0.5 (69.7–70.9)	57.2 ± 0.4 (56.7–57.7)	25.6	36.2
RT-DETR-R18	84.2 ± 0.4 (83.7–84.7)	78.6 ± 0.4 (78.1–79.1)	85.1 ± 0.4 (84.6–85.6)	60.2 ± 0.4 (59.7–60.7)	39.4	58.8
YOLOv5s	82.3 ± 0.3 (81.9–82.7)	76.2 ± 0.3 (75.8–76.6)	82.9 ± 0.4 (82.4–83.4)	58.7 ± 0.4 (58.2–59.2)	7.1	16.7
YOLOv8s	83.6 ± 0.2 (83.3–83.9)	77.8 ± 0.2 (77.5–78.1)	84.2 ± 0.3 (83.8–84.6)	60.3 ± 0.3 (59.9–60.7)	10.6	28.5
YOLOv9s	80.7 ± 0.4 (80.2–81.2)	75.2 ± 0.4 (74.7–75.7)	81.3 ± 0.4 (80.8–81.8)	59.4 ± 0.4 (58.9–59.9)	6.4	22.6
YOLOv10s	83.1 ± 0.3 (82.7–83.5)	76.4 ± 0.3 (76.0–76.8)	82.3 ± 0.3 (81.9–82.7)	59.3 ± 0.3 (58.9–59.7)	8.4	24.5
YOLOv11s	84.9 ± 0.2 (84.7–85.1)	79.1 ± 0.2 (78.9–79.3)	85.9 ± 0.2 (85.7–86.1)	61.1 ± 0.2 (60.9–61.3)	10.1	21.5
TFDF-YOLO	91.5 ± 0.2 (91.3–91.7)	87.2 ± 0.2 (87.0–87.4)	92.7 ± 0.2 (92.5–92.9)	67.5 ± 0.2 (67.3–67.7)	10.5	21.1

Table 3. Recognition accuracy results (per-class AP).

Model	AP-UUV (%)	AP-Docking (%)	AP-Light (%)	mAP (%)
Faster-RCNN	72.5 ± 0.5 (71.9–73.1)	70.3 ± 0.5 (69.7–70.9)	64.5 ± 0.5 (63.9–65.1)	69.1 ± 0.5 (68.5–69.7)
SSD	74.2 ± 0.4 (73.7–74.7)	71.8 ± 0.4 (71.3–72.3)	66.7 ± 0.4 (66.2–67.2)	70.3 ± 0.5 (69.7–70.9)
RT-DETR-R18	86.3 ± 0.4 (85.8–86.8)	84.7 ± 0.4 (84.2–85.2)	80.5 ± 0.4 (80.0–80.7)	85.1 ± 0.4 (84.6–85.6)
YOLOv5s	83.6 ± 0.3 (83.2–84.0)	81.9 ± 0.3 (81.5–82.3)	78.2 ± 0.3 (77.8–78.6)	82.9 ± 0.4 (82.4–83.4)
YOLOv8s	85.1 ± 0.2 (84.8–85.4)	83.5 ± 0.2 (83.2–83.8)	79.6 ± 0.2 (79.3–79.9)	84.2 ± 0.3 (83.8–84.6)
YOLOv9s	82.4 ± 0.4 (81.9–82.9)	80.7 ± 0.4 (80.2–81.2)	76.9 ± 0.4 (76.4–77.4)	81.3 ± 0.4 (80.8–81.8)
YOLOv10s	84.5 ± 0.3 (84.1–84.9)	82.8 ± 0.3 (82.4–83.2)	79.1 ± 0.3 (78.7–79.5)	82.3 ± 0.3 (81.9–82.7)
YOLOv11s	87.2 ± 0.2 (87.0–87.4)	85.6 ± 0.2 (85.4–85.8)	81.3 ± 0.2 (81.1–81.5)	85.9 ± 0.2 (85.7–86.1)
TFDF-YOLO	93.5 ± 0.2 (93.3–93.7)	91.8 ± 0.2 (91.6–92.0)	88.6 ± 0.2 (88.4–88.8)	92.7 ± 0.2 (92.5–92.9)

Table 4. Ablation on kernel size selection.

Kernel Size	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Param/M	GFLOPs
—	84.9	79.1	85.9	61.1	10.1	21.5
3	85.1	81.5	86.8	63.5	10.2	20.8
5	84.6	79.8	85.7	62.3	10.5	21.6
7	83.2	77.1	82.7	61.7	10.7	22.5
3, 5, 7	83.6	79.4	86.3	62.8	11.4	24.1
1, 3, 7	86.5	82.9	87.9	64.1	10.9	22.3
1, 5, 7	84.2	80.1	85.7	62.3	11.2	23.5
1, 3, 5	87.8	83.8	89.4	65.8	10.7	21.8

Table 5. Ablation on single sparsity ratio for Top-K in SFD.

Setting (Single ρ)	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Param/M	GFLOPs
—	84.9	79.1	85.9	61.1	10.1	21.5
ρ = 1/6	82.7	77.7	84.1	58.4	10.2	21.5
ρ = 1/4	83.6	78.5	84.9	59.8	10.4	21.6
ρ = 1/3	85.9	82.5	87.5	64.3	10.5	21.8
ρ = 1/2	85.5	81.8	87.3	64.5	10.7	22.1
ρ = 2/3	85.1	81.2	86.6	64.1	10.7	22.3
ρ = 1.0	84.7	80.5	86.3	63.9	10.9	22.3

Table 6. Ablation on grouped sparsity ratios for Top-K in SFD.

Setting (Grouped ρ $ρ_{large} {, ρ}_{mid}, ρ_{small}$ )	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Param/M	GFLOPs
ρ = 1/3	85.9	82.5	87.5	64.3	10.5	21.8
(1/6,1/3,2/3)	86.8	83.6	87.9	64.7	10.6	21.9
(1/4,1/3,2/3)	87.4	84.0	88.5	65.2	10.8	22.1
(1/4,1/3,1/2)	87.8	83.8	89.4	65.8	10.7	21.8

Table 7. Ablation on illumination intensity computation strategies.

Strategies	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)
Global image averaging intensity	88.3	83.5	89.1	64.2
Bounding-box median intensity	90.7	86.1	91.6	66.8
Histogram-peak intensity	91.0	86.5	92.0	66.9
Bounding-box mean intensity	91.5	87.2	92.7	67.5

Table 8. Different feature extraction modules.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Param/M	GFLOPs
C3K2	84.9	79.1	85.9	61.1	10.1	21.5
C3k2- GhostDynamicConv	84.1	78.6	85.7	60.7	8.9	21.7
C3k2-Faster	85.5	80.8	84.8	63.3	9.3	23.2
C3k2-DySnakeConv	86.3	82.7	86.9	63.5	10.4	22.1
C3k2-Star	86.7	81.8	86.3	64.2	11.2	22.4
C3k2-WTConv	87.2	82.2	87.6	64.9	9.6	20.1
SFD Module (Ours)	87.8	83.8	89.4	65.8	10.7	21.8

Table 9. Different feature fusion strategies.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Param/M	GFLOPs
PA-FPN	84.9	79.1	85.9	61.1	10.1	21.5
SlimNeck	85.3	79.3	83.9	59.3	9.8	20.0
Bi-FPN	86.5	81.1	86.0	62.7	9.7	20.9
ASF	86.9	82.2	87.5	62.9	10	21.3
G-FPN	87.5	84.8	87.9	63.5	10.3	21.1
RD-Fusion (Ours)	87.1	85.5	89.1	64.7	9.8	20.7

Table 10. Results of different loss functions.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)
CIoU	84.9	79.1	85.9	61.1
DIoU	83.7	78.3	85.5	59.3
GIoU	85.3	79.6	85.7	61.3
Shape-IoU	85.5	80.6	86.1	61.6
U-CIoU (Ours)	85.7	80.8	86.6	61.5

Table 11. Ablation experiment results.

SFD	RD-F	U-CIoU	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Param/M	GFLOPs
			84.9	79.1	85.9	61.1	10.1	21.5
√			87.8	83.8	89.4	65.8	10.7	21.8
	√		87.1	85.5	89.1	64.7	9.8	20.7
		√	85.7	80.8	86.6	61.5	10.1	21.5
√	√		90.2	86.4	91.4	67.3	10.5	21.1
√		√	88.9	86.0	90.1	66.9	10.7	21.8
	√	√	88.0	86.9	89.5	66.6	9.8	20.7
√	√	√	91.5 (↑6.6)	87.2 (↑8.1)	92.7 (↑6.8)	67.5 (↑6.4)	10.5 (↑0.4)	21.1 (↓0.4)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yin, H.; Cheng, Y.; Shi, W. TFDF-YOLO: A Position Detection Model for Underwater Wireless Power Transfer Docking. J. Mar. Sci. Eng. 2026, 14, 429. https://doi.org/10.3390/jmse14050429

AMA Style

Yin H, Cheng Y, Shi W. TFDF-YOLO: A Position Detection Model for Underwater Wireless Power Transfer Docking. Journal of Marine Science and Engineering. 2026; 14(5):429. https://doi.org/10.3390/jmse14050429

Chicago/Turabian Style

Yin, He, Yuxuan Cheng, and Wentao Shi. 2026. "TFDF-YOLO: A Position Detection Model for Underwater Wireless Power Transfer Docking" Journal of Marine Science and Engineering 14, no. 5: 429. https://doi.org/10.3390/jmse14050429

APA Style

Yin, H., Cheng, Y., & Shi, W. (2026). TFDF-YOLO: A Position Detection Model for Underwater Wireless Power Transfer Docking. Journal of Marine Science and Engineering, 14(5), 429. https://doi.org/10.3390/jmse14050429

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TFDF-YOLO: A Position Detection Model for Underwater Wireless Power Transfer Docking

Abstract

1. Introduction

2. Methodology

2.1. Spatial–Frequency Decoupling (SFD) Module

2.1.1. Edge-Enhancement Preprocessing Layer

2.1.2. Multi-Scale Spatial Feature Extraction Module

2.1.3. Frequency-Guided Sparsity for Dual-Stage Spatial Attention

2.1.4. Local–Global Feature Fusion Layer

2.2. Relevance-Difference Fusion (RD-Fusion) Strategy

2.2.1. Global Channel Attention Generation

2.2.2. Difference Feature Weight Generation

2.2.3. Dynamic Weighted Fusion

2.3. U-CIoU (Underwater-CIoU) Loss Function

2.3.1. CIoU Loss Function

2.3.2. Illumination-Induced Gradient Drift

2.3.3. U-CIoU Loss Function

2.3.4. Parameter Sensitivity Discussion

3. UWPT Platform and Dataset Introduction

3.1. UWPT Platform

3.2. Multi-Source Dataset

3.2.1. Experimental Data

3.2.2. Simulated Data

3.2.3. Public Data

3.2.4. Typical Scenarios in Multi-Source Data

3.3. Data Analysis

4. Result Analysis

4.1. Experimental Setup

4.1.1. Implementation Details

4.1.2. Evaluation Metrics

4.2. Comparative Analysis

4.2.1. Model Analysis

4.2.2. Optimal Setting Analysis

4.2.3. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI