1. Introduction
Underwater target detection and segmentation in SSS imagery are critical for marine exploration, wreck recovery, and pipeline inspection. However, SSS data exhibit intrinsic challenges such as speckle noise, geometric distortions, and low signal-to-noise ratios (SNRs) due to acoustic scattering and seabed heterogeneity [
1]. Traditional methods rely on handcrafted features (e.g., texture descriptors [
2] or SVM classifiers [
3]), but their performance degrades under complex underwater conditions. Recent advances in deep learning have improved accuracy, yet the following three critical gaps persist: (1) limited adaptability to SSS-specific noise patterns, (2) inefficient multi-scale feature fusion for distorted targets, and (3) excessive computational costs.
Convolutional neural networks (CNNs) have dominated SSS target detection and segmentation. U-Net variants [
4] achieved the pixel-level segmentation of seabed textures by leveraging skip connections [
5], while Mask R-CNN [
6] enabled joint detection and segmentation through region-based optimization [
7]. Topological Data Analysis (TDA) [
8] enhances CNN interpretability by capturing the topological features of data spaces. Nevertheless, its high computational complexity requires careful consideration, with current optimizations focusing on GPU-accelerated parallel computing. Standard convolutions exhibit limitations in modeling SSS noise distributions, resulting in false positives within low-SNR regions [
9]. To enhance robustness, the Convolutional Block Attention Module (CBAM) [
10] has been incorporated into the Feature Pyramid Network (FPN) framework [
11], though its fixed-weight filters require further optimization for dynamic acoustic environments [
12]. The Sparse Attention U-Net [
13] introduces a dynamic sparse attention mechanism that effectively suppresses background noise by focusing on target regions, establishing a novel paradigm for weakly supervised sonar segmentation. It should be noted that the method’s generalizability demonstrates strong dependency on training data quality.
Recent advances in imaging primarily address the following three thematic challenges: (1) Lightweight architectures—Zhang et al. [
14] combined Ghost modules with a Bidirectional Feature Pyramid Network (BiFPN) to improve small-object segmentation in botanical imaging, while Wang et al. [
15] and Li et al. [
16] developed task-specific lightweight networks for medical CT and chip pad detection, respectively, demonstrating parameter reduction without accuracy loss. (2) Noise suppression—Zheng et al. [
17] and Weng et al. [
18] employed spatial-channel attention mechanisms to enhance blurred target detection in sonar imagery, with Chen et al.’s AquaYOLO [
19] achieving high mAP in turbid waters through adaptive feature selection. (3) Edge deployment—current research emphasizes hardware-efficient designs compatible with mobile GPUs. SSS-specific synthesis—while these advancements provide foundational insights, critical gaps remain in SSS applications, outlined as follows:
Dynamic noise adaptation: Fixed-weight filters in existing attention mechanisms show limited efficacy against SSS-specific coherent noise patterns.
Geometric edge distortion tolerance: Multi-scale fusion approaches inadequately address SSS artifacts caused by non-linear sensor motions.
Parameter efficiency trade-offs: Architectural complexity in current methods increases computational costs versus baseline models, challenging real-time SSS processing.
Real-time processing on UUVs demands model compression. MobileNet [
20] and EfficientDet [
21] reduced parameters via depthwise convolutions, yet they sacrificed segmentation precision [
22]. YOLO-based approaches [
23] balanced speed and accuracy but lacked dedicated modules for SSS geometric distortions [
24]. YOLOv4-Tiny [
25] introduces channel pruning and 8-bit quantization to achieve real-time detection of 45 FPS. The feasibility of model compression in UUV deployment is verified, but the accuracy and speed need to be balanced. Dynamic Neural Architecture UUV (DNA-UUV) [
26] adjusts the depth and width of the model in real-time based on hardware resources, reducing energy consumption by 40%. It provides flexible computing solutions for heterogeneous UUV platforms, but it needs to optimize the switching mechanism. The two-stage model [
27] has achieved end-to-end optimization of small sample sonar segmentation for the first time. Firstly, the target shadow feature is used to locate the initial area, and it is then combined with the level set algorithm for fine segmentation to transfer optical image data and enhance small sample performance. However, computational efficiency needs to be improved. The lightweight network U-Net combined with heterogeneous filters [
28] achieves 25 FPS real-time segmentation on an Field Programmable Gate Array (FPGA)-embedded platform, with energy consumption < 5 W, providing a low-power solution for UUV deployment. However, in strong noise environments, the segmentation mIoU decreases by about 10%, and it is necessary to enhance its generalization ability in complex environments.
Knowledge distillation [
29] and quantization [
30] further optimized efficiency, but most methods ignored the interplay between noise suppression and multi-task learning. Spline-based networks [
31] and Kolmogorov–Arnold representations [
32] recently gained traction for interpretable feature learning. For example, B-spline CNNs [
33] achieved noise-adaptive filtering in medical imaging, while deformable kernels [
34] improved geometric invariance. However, these works focused on optical or synthetic aperture sonar (SAS) data [
35], leaving SSS-specific adaptations unexplored. In addition, the MAML framework based on meta [
36] learning only requires 10 annotated images to adapt to new acoustic devices, with a mIoU of 75.2%. It solves the problem of small sample sonar segmentation, but requires strengthening the domain adaptation module. Meanwhile, the generalization ability of cross-device noise distribution is unstable (±8% mIoU fluctuation).
To bridge these gaps, we propose CKAN-YOLOv8, a lightweight multi-task network integrating KANConv into the YOLOv8 framework. Our key innovations include the following:
KANConv blocks: Replacing standard convolutions with learnable B-spline activations to dynamically suppress SSS noise while preserving edge details.
KANConv-PAN: A deformable feature pyramid network using spline-parameterized kernels to optimize geometric edge distortion and fuse multi-scale targets.
Dual-task head: Combining CIoU Loss for detection and segmentation with Dice Loss to refine boundary-sensitive segmentation.
The remainder of this paper is organized as follows:
Section 2 reviews related work on sonar data and YOLOv8,
Section 3 details the architecture of CKAN-YOLOv8,
Section 4 presents experimental results, and
Section 5 provides a conclusion.
3. Proposed Method
To address the intertwined challenges of low SNRs, geometric distortions in SSS imagery, and computational constraints on UUVs, this paper proposes CKAN-YOLOv8—a lightweight multi-task network. The core innovations are hierarchically structured as follows:
(1) Dynamic noise suppression: The KANConv module replaces standard convolutions with learnable B-spline basis functions, enabling input-adaptive nonlinear activations that suppress speckle noise while preserving target edge details.
(2) Geometric distortion correction: The KANConv-PANet deformable feature pyramid dynamically adjusts multi-scale fusion weights via spline parameterization, mitigating feature misalignment caused by target deformation and scale variations.
(3) Multi-task synergy: A dual-task loss function (CloU + Dice) jointly optimizes detection box localization accuracy and segmentation mask boundary continuity, resolving target blurring and background adhesion in SSS imagery.
Through the hierarchical collaboration of these innovations, CKAN-YOLOv8 achieves balanced improvements in noise robustness, geometric consistency, and real-time inference efficiency under a lightweight framework, offering an end-to-end solution for intelligent perception in complex underwater environments.
The methodological pipeline of this study is illustrated in
Figure 3, comprising three key phases as follows:
Data preprocessing: Raw SSS images undergo noise suppression and resolution standardization to generate inputs compatible with deep learning models.
Multi-task model processing: The CKAN-YOLOv8 framework sequentially employs dynamic feature extraction via C2f-KANConv (KC2f) modules, multi-scale feature fusion through KANConv-PANet, and joint optimization with a cascaded loss function, enhancing segmentation and detection capabilities for SSS images.
Result validation: Model performance is evaluated using quantitative metrics (AP@0.5, IoU) and real-time inference speed (FPS), supplemented by visual analysis to verify geometric consistency between segmentation boundaries and detection boxes, thereby ensuring algorithmic accuracy and reliability.
3.1. Structure of the CKAN-YOLOv8 Model
CKAN-YOLOv8 is a deeply optimized framework based on YOLOv8, designed to address noise suppression, geometric edge optimization, and lightweight deployment. Its architecture comprises four key stages as follows:
The architecture implements a task-driven positive sample matching strategy, dynamically weighting classification confidence and localization accuracy during anchor assignment. For loss optimization, it combines the following:
Classification: Binary Cross-Entropy (BCE) for object/non-object differentiation.
Localization: Distribution Focal Loss (DFL) for probability distribution-aware regression.
Bounding box refinement: CIoU metric to address aspect ratio discrepancies.
Edge segmentation: Dice Loss enhances sensitivity to sparse edge pixels and segmentation accuracy by emphasizing overlap optimization between predicted and true edges.
Enhanced by its modular design and adaptive training protocols, YOLOv8 achieves real-time detection and segmentation with computational efficiency. Extending this framework, CKAN-YOLOv8 integrates KANConv into the backbone network and a mask branch into the detection head, leveraging the highest-resolution feature map from the feature pyramid network as input to a prototype generator (Protonet) that produces primitive mask templates. The enhanced prediction head jointly outputs bounding box coordinates, class probabilities, and mask coefficients, which dynamically weight the prototypes through matrix multiplication after non-maximum suppression (NMS). Instance-specific masks are synthesized via coordinate-aligned cropping based on the predicted boxes and threshold-based binarization, with Dice Loss integrated into multi-task optimization to refine boundary-sensitive segmentation. Coordinate-aligned cropping preserves geometric proportions by extracting target regions within predicted bounding box coordinates via bilinear interpolation, eliminating boundary misalignment caused by conventional affine transformations (e.g., rotation/scaling). This process is essential to (1) ensure the spatial alignment between segmentation masks and detection boxes, preventing feature drift in multi-task learning, and (2) maintain aspect ratios to minimize resampling distortion, enhancing pixel-level precision for boundary-sensitive task. Technically, fixed-size Region of Interest (ROI) grids are generated from predicted boxes, with geometric continuity preserved through differentiable bilinear sampling before threshold-based binarization produces instance masks, jointly optimized by Dice Loss for boundary-aware segmentation.
Figure 4 illustrates the CKAN-YOLOv8 architecture.
3.2. KAN Convolutions
Kolmogorov–Arnold Networks (KANs) [
37], a novel deep learning architecture, are grounded in the Kolmogorov–Arnold representation theorem. This theorem asserts that any multivariate continuous function can be expressed as a finite composition of univariate functions. Unlike traditional multilayer perceptrons (MLPs) that fix nonlinear activations on nodes, KANs innovatively position learnable activation functions—parameterized via B-splines—along network edges (weights). This design enables adaptive nonlinear transformations tailored to input patterns. By leveraging this structural paradigm, KANs exhibit enhanced capabilities in modeling intricate relationships, achieving state-of-the-art performance in tasks such as time-series forecasting, graph-structured data analysis, and convolutional feature learning. The architectural details of KANs are visualized in
Figure 5.
Its mathematical form is as follows:
In the mathematical formulation of KANs,
denotes the
j-th dimension of the input vector, while
represents a learnable univariate function applied to the
j-th input along the
i-th computational path. The function
at the output layer aggregates these intermediate results into final predictions through another learnable univariate transformation. By hierarchically stacking multiple KAN layers, the network constructs deep architectures via adaptive nonlinear compositions. This design employs a divide-and-conquer strategy as follows: high-dimensional functions are decomposed into combinations of low-dimensional univariate components, effectively addressing the gradient vanishing problem inherent in traditional MLPs caused by the curse of dimensionality. Simultaneously, the spline-based approximation framework ensures parametric efficiency and preserves interpretability through localized function interactions.
where
is the function matrix of the
l-th layer, and each element is a learnable unary activation function.
Traditional CNNs rely on fixed-weight linear filters and static nonlinear activations (e.g., ReLU), which struggle to dynamically adapt to the complex noise distributions and geometric edge distortions in SSS imagery. To address this, we integrate KANConv (as depicted by
Figure 6) into the model, which replaces linear convolution with learnable nonlinear basis functions inspired by KAN theory, enabling input-adaptive feature extraction.
The core design is as follows:
For an input feature map , KANConv applies learnable B-spline basis functions to nonlinearly map local receptive fields, where and W denote the input channels, kernel height, and width; is the combination of learnable nonlinear basis functions for the c-th input channel and convolution kernel position. B-spline basis functions are dynamically adjusted via gradient descent during training to replace linear CNN kernels. Moreover, the coefficients of the basis function are optimized by backpropagation to make the model adaptive to the data distribution. Among them, the training gradient of the univariate function is more stable, and it can alleviate the gradient disappearance problem of traditional CNN.
Implementation details. Basis function initialization: Uniformly distributed B-spline control points initialize the activation functions as near-identity mappings, ensuring behavior consistent with standard convolutions in early training. Dynamic fine-tuning: Control points are optimized via backpropagation to adapt basis functions to noise patterns.
Key advantages. Dynamic noise suppression: The smoothness and local support properties of B-spline basis functions allow for the adaptive filtering of sonar speckle noise (high-frequency interference) while preserving target edges (low-frequency structures). Parameter efficiency: Compared to standard convolutions, KANConv reduces parameters through sparse B-spline parameterization (storing only control points). Gradient stability: The continuous differentiability of B-splines (existence of first- and second-order derivatives) mitigates gradient mutation caused by piecewise functions like ReLU, improving training stability.
In the CKAN-YOLOv8 framework for SSS image processing, the B-spline coefficient optimization mechanism achieves dynamic balance between speckle suppression and edge preservation through end-to-end deep learning. Our approach reconstructs B-spline kernels as differentiable modules integrated into the feature extraction backbone. Guided by backpropagated gradients from the detection loss, the coefficients adaptively evolve as follows: they reduce sensitivity in noisy regions to suppress high-frequency artifacts while amplifying contrast near edges to preserve geometric edge fidelity. This self-adjusting capability allows the network to implicitly learn noise-edge discrimination rules directly from the data, overcoming both the edge-blurring drawbacks of conventional methods and the environmental rigidity of static parameter designs.
3.3. Cross-Stage Partial Fusion with Two KAN Convolutions
The traditional Cross-Stage Partial (CSP) module separates feature streams and concatenates multi-branch outputs, but this approach introduces channel dimension expansion and significant computational overhead. To address the redundant computation and gradient fragmentation in conventional CSP modules, YOLOv8 introduces the C2f module [
38]. This architecture optimizes cross-stage multi-scale feature interaction efficiency through a dual-branch lightweight design and cross-stage gradient enhancement mechanisms.
The core architecture of C2f comprises the following:
Dual-convolution lightweight branches: Compress multi-branch convolutions in traditional CSP modules into two parallel branches performing 1 × 1 convolution (channel reduction) and 3 × 3 depthwise separable convolution (spatial feature extraction), formulated as follows:
Cross-stage gradient enhancement: Mitigates gradient vanishing via residual connections that aggregate original inputs with dual-branch outputs, expressed as follows:
This design reduces parameters while preserving cross-stage feature interaction continuity.
As shown in
Figure 7, the original CBS module (composed of a convolutional layer, batch normalization, and SiLU activation) is redesigned into KCBS as follows: the conventional 3 × 3 convolution is replaced with KANConv while retaining batch normalization (BN) for training stability, strictly preserving input/output channel dimensions, stride, and padding parameters to prevent feature map size alterations.
Key enhancements to the original bottleneck structure include the following:
Deep feature extraction enhancement: Replacing the two sequential CBS modules with KCBS modules to improve deep feature representation through KANConv’s dynamic nonlinear mappings.
Multi-scale fusion optimization: Retaining 1 × 1 CBS for dimensionality reduction to prevent information loss, while substituting the original bottleneck with Bottleneck_KANConv and adopting KCBS for channel fusion, enabling adaptive nonlinear feature integration.
Structural compatibility assurance: Maintaining identical stacking counts and Split-Concat topology to ensure seamless compatibility with other YOLOv8 components.
3.4. KANConv-PANet
The conventional CSB module in PANet relies on fixed-weight convolutional kernels, which lack dynamic adaptability to multi-scale target response intensities. In scenarios with significant geometric edge distortions, fixed kernels fail to adaptively fuse cross-resolution features, resulting in degraded object detection accuracy. Additionally, the linear superposition nature of traditional convolutions struggles to model complex nonlinear feature relationships, while PANet’s multi-layer feature fusion critically depends on nonlinear representation capabilities. KANConv overcomes these limitations through three the following key innovations:
Higher-order nonlinear mapping: Implements learnable edge-weight functions for higher-order nonlinear spatial transformations, enhancing cross-layer feature interaction efficacy;
Parameter efficiency optimization: While CSP modules reduce parameter redundancy via channel splitting, their parallel branches still employ duplicated fixed kernels, leading to suboptimal parameter utilization;
Embedded deployment compatibility: Addresses GPU memory bottlenecks caused by multi-scale feature concatenation during PANet’s upsampling phase, reducing bandwidth requirements via sparse storage mechanisms, thereby meeting resource constraints for real-time inference on embedded hardware.
Figure 8 illustrates the KANConv-PAN architecture and its core component KC2f (equivalent to the C2f-KANConv module). KANConv replaces CSB modules in PANet, balancing accuracy and speed through dynamic nonlinear modeling; its co-optimized parameter efficiency and noise robustness prove critical for multi-scale sonar imaging tasks. By replacing fixed kernel parameters in traditional convolutions with B-spline activation functions, KANConv enables the dynamic shape adjustment of activation functions based on input features. This capability is critical for PANet’s multi-scale feature fusion; the adaptive correction of geometric edge distortions and scale variations in sonar imagery significantly enhances cross-resolution feature alignment accuracy.
The technical advantages of KANConv-PANet are as follows:
Technical advantages:
Local texture sensitivity: When fusing high-level semantic features (small-target detection) and low-level detail features (boundary localization), KANConv’s differentiable B-spline basis functions exhibit higher sensitivity to local texture variations compared to conventional convolutions.
Memory efficiency: The sparse matrix storage of B-spline basis functions reduces memory footprint versus dense parameter matrices in traditional convolutions, effectively alleviating GPU memory pressure during feature concatenation in PANet’s upsampling stages.
Multi-scenario generalization: Through dynamic nonlinear modeling and parameter efficiency optimization, KANConv-PAN achieves an accuracy–speed balance in multi-scale target scenarios.
KANConv-PAN establishes a new paradigm for lightweight real-time detection systems as follows: B-spline-based adaptive kernels effectively suppress acoustic scattering artifacts and tissue boundary ambiguities, while sparse tensor decomposition in feature fusion pathways maintains computational efficiency, laying a technical foundation for real-time perception in complex scenarios.
3.5. Loss Function
In sonar image classification tasks, YOLOv8’s decoupled head accelerates model convergence by assigning prediction cells through IoU calculations between feature map cells and ground truth bounding boxes. However, the spatial misalignment between optimal cells for classification and regression tasks often leads to degraded task synergy. To address this, YOLOv8 integrates Task Alignment Learning (TAL) [
39], which dynamically adjusts positive/negative sample allocation strategies to enhance spatial consistency between classification and regression tasks. The loss function design comprises the following:
Regression branch: Combines Distribution Focal Loss (DFL) with CIoU loss (Equations (
12) and (
15)) to optimize bounding box probability distribution and geometric alignment accuracy.
Classification branch: Employs Binary Cross-Entropy (BCE, Equation (
16)) to strengthen binary discrimination between the foreground and background.
In Equations (
12)–(
14),
: intersection over union between predicted and ground truth boxes, measuring the overlap ratio;
: Euclidean distance between centers of predicted box
b and ground truth box
;
c: diagonal length of the smallest enclosing rectangle covering both boxes;
: weight coefficient balancing aspect ratio consistency and IoU contribution;
v: aspect ratio consistency term, quantifying the similarity between predicted and ground truth aspect ratios;
: width and height of the ground truth box; and
w and
h: width and height of the predicted box.
In Equation (
15),
: probability distribution values at adjacent positions, modeling the discrete probability distribution of target center points;
y: continuous coordinate value of the ground truth center; and
: discretized grid coordinates closest to
y.
In Equation (
16),
N: total number of samples;
: ground truth label (0 or 1) for the
i-th sample; and
: predicted probability of the
i-th sample belonging to the positive class.
Inspired by YOLACT’s segmentation design [
40], this approach achieves end-to-end instance segmentation through the dual-branch parallel prediction of prototype masks and instance-level mask coefficients as follows:
Prototype mask generation: Full-image prototype masks are generated on the largest-scale feature maps, preserving high-resolution spatial details.
Mask coefficient prediction: Three-scale feature maps simultaneously output detection boxes, classification scores, and instance-specific mask coefficients.
Mask synthesis: Instance-specific masks are synthesized via linear combination of prototype masks and coefficients, eliminating computational redundancy from traditional two-stage RoIPool while maintaining output resolution and improving segmentation accuracy.
Tailored for sonar imagery characteristics, the segmentation head employs Dice Loss to optimize region overlap as follows:
where
: predicted mask probability of the
i-th pixel (range [0, 1]);
: ground truth mask label of the
i-th pixel (0 or 1); and
: smoothing coefficient to prevent division by zero.
To achieve the synergistic optimization of detection and segmentation tasks, we design a composite loss function with the following weighted components:
where
: bounding box regression loss;
: Distribution Focal Loss;
: Binary Cross-Entropy Loss; and
: segmentation overlap loss.
Through dynamic weight allocation (), this function enables the following:
Detection enhancement: CloU loss for the precise geometric regression of bounding boxes; DFL loss for optimizing positional probability distributions.
Segmentation refinement: Dice Loss focusing on mask boundary continuity; BCE loss improving foreground/background discrimination.
This loss function design achieves multi-task synergy through weighted coefficients that balance precise detection box regression, localization distribution optimization, classification accuracy enhancement, and segmentation boundary continuity. The framework significantly enhances the network’s capability in multi-scale target detection and edge-sensitive segmentation within complex sonar scenarios, establishing a critical technical foundation for the intelligent interpretation of underwater acoustic imaging.