YOLO-StarLS: A Ship Detection Algorithm Based on Wavelet Transform and Multi-Scale Feature Extraction for Complex Environments

Wang, Yihan; Zhang, Shuang; Xu, Jianhao; Cheng, Zhenwen; Du, Gang

doi:10.3390/sym17071116

Open AccessArticle

YOLO-StarLS: A Ship Detection Algorithm Based on Wavelet Transform and Multi-Scale Feature Extraction for Complex Environments

by

Yihan Wang

,

Shuang Zhang

^*

,

Jianhao Xu

,

Zhenwen Cheng

and

Gang Du

^*

School of Computer and Big Data (School of Cyber Security), Heilongjiang University, Harbin 150080, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2025, 17(7), 1116; https://doi.org/10.3390/sym17071116

Submission received: 12 June 2025 / Revised: 2 July 2025 / Accepted: 4 July 2025 / Published: 11 July 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Ship detection in complex environments presents challenges such as sea surface reflections, wave interference, variations in illumination, and a range of target scales. The interaction between symmetric ship structures and wave patterns challenges conventional algorithms, particularly in maritime wireless networks. This study presents YOLO-StarLS (You Only Look Once with Star-topology Lightweight Ship detection), a detection framework leveraging wavelet transforms and multi-scale feature extraction through three core modules. We developed a Wavelet Multi-scale Feature Extraction Network (WMFEN) utilizing adaptive Haar wavelet decomposition with star-topology extraction to preserve multi-frequency information while minimizing detail loss. We introduced a Cross-axis Spatial Attention Refinement module (CSAR), which integrates star structures with cross-axis attention mechanisms to enhance spatial perception. We constructed an Efficient Detail-Preserving Detection head (EDPD) combining differential and shared convolutions to enhance edge detection while reducing computational complexity. Evaluation on the SeaShips dataset demonstrated YOLO-StarLS achieved superior performance for both mAP50 and mAP50–95 metrics, improving by 2.21% and 2.42% over the baseline YOLO11. The approach achieved significant efficiency, with a 36% reduction in the number of parameters to 1.67 M, a 34% decrease in complexity to 4.3 GFLOPs, and an inference speed of 162.0 FPS. Comparative analysis against eight algorithms confirmed the superiority in symmetric target detection. This work enhances real-time ship detection and provides foundations for maritime wireless surveillance networks.

Keywords:

ship detection; deep learning; YOLO; wavelet transform; attention mechanism; lightweight network

1. Introduction

The rapid expansion of marine economies and increasingly sophisticated maritime security challenges have intensified demands for advanced sea area monitoring and vessel identification technologies. With the accelerated development of marine wireless sensor networks and edge computing technologies, the construction efficient real-time maritime wireless surveillance networks has emerged as a critical direction that imposes heightened requirements on the computational efficiency and detection accuracy of ship detection algorithms. Ship target detection serves as a cornerstone technology for marine surveillance systems and demonstrates substantial practical value across maritime traffic management, illegal fishing monitoring, search and rescue operations, marine security maintenance, and ocean resource protection [1]. Nevertheless, marine environments present formidable obstacles, including varying illumination conditions, sea surface reflections, wave interference, foggy weather, intricate backgrounds, and dense vessel distributions. Notably, marine environments encompass abundant symmetrical characteristics: ship targets exhibit distinct structural symmetry, while ocean waves display periodic symmetric patterns, which provide crucial prior information for target detection. However, conventional detection algorithms frequently overlook effective utilization of these symmetrical characteristics and demonstrate inadequate performance when processing complex interactions between symmetric targets and asymmetric backgrounds, failing to fully exploit discriminative features within symmetrical information while struggling to meet real-time detection requirements in marine wireless network environments.

To precisely characterize the complexity of maritime environments, we define complex environments as scenarios exhibiting three concurrent challenge dimensions that traditional detection methods struggle to address simultaneously. First, the complexity of environmental interference includes atmospheric disturbances such as fog, precipitation, and dynamic changes in illumination, which create time-varying visual occlusions and alterations in surface reflectance. Second, geometric complexity arises from the spatial interaction between vessel structural symmetry and irregular background patterns, in which symmetric ship features become entangled with wave textures, coastal infrastructure, and surface reflections, creating ambiguous feature spaces that confound conventional detection algorithms. Third, scale complexity manifests through dramatic target-size variations within single surveillance frames, ranging from small fishing vessels appearing as mere pixel objects to large cargo ships spanning significant image portions, each requiring different feature extraction strategies. Consequently, developing efficient ship detection algorithms capable of thoroughly leveraging symmetrical features holds profound significance for the construction of high-performance maritime wireless surveillance networks and in enhancing maritime situational awareness [2].

Traditional ship detection methodologies primarily rely on manually designed feature extractors, achieving detection through the development of ship-specific feature descriptors [3]. Early approaches utilizing HOG features, SIFT features, and LBP texture features demonstrated reasonable performance in simplified scenarios. Recent advances in multi-feature extraction have shown promising results in remote sensing applications, where researchers have successfully combined texture features extracted via a Gray-Level Co-occurrence Matrix (GLCM) with distance-based similarity measures to construct comprehensive feature representations for dimensionality reduction tasks [4]. However, as image characteristics have become increasingly diverse, the design of handcrafted features has grown progressively complex. Particularly when dealing with high-resolution optical remote sensing images characterized by intricate features, conventional detection methods encounter numerous difficulties, including complex marine environments, dense distributions, and rotational variations, rendering traditional approaches inadequate for the complex scenarios encountered in actual marine environments [5].

Recent years have witnessed breakthrough advances in deep learning technologies in the computer vision domain. Deep learning-based ship detection approaches can be broadly categorized into two classes: two-stage detection methods and single-stage detection methods. Two-stage detection frameworks, such as R-CNN, Fast R-CNN, and Faster R-CNN, have demonstrated superior detection accuracy in marine monitoring applications [6]. Zhang et al. [7] discovered that Faster R-CNN exhibits enhanced precision and recall rates for ship detection in marine environments, albeit with slower computational speed. Conversely, single-stage detection methods, including SSD and the YOLO series, have gained widespread attention in practical applications due to their real-time capabilities and computational efficiency [8].

Object detection technology, which stands as one of the basic undertakings in computer vision, has reached a high level of development. In the fields of remote sensing image and video monitoring, object recognition methods have shown excellent real-time capabilities and precision, offering strong backing for ship detection applications [9]. As deep learning has progressed, object detection algorithms have made substantial enhancements in detection precision, velocity, and stability. This has laid a firm groundwork for ship detection in intricate situations [10]. Consequently, this study employs an approach based on object detection, capitalizing on the benefits of deep learning to tackle the difficulties of ship detection in complex maritime settings.

Despite the outstanding performance of YOLO-series algorithms in object detection, ship detection in complex marine environments continues to face three core challenges: First, factors such as sea surface reflections, wave fluctuations, and adverse weather conditions severely impact image quality, leading to blurred or distorted ship features, with detection performance particularly degrading under low-light conditions. Second, ship sizes range from several meters to hundreds of meters, and ships frequently appear in dense distributions within port areas, causing YOLO11 to exhibit obvious missed detections and false positives when handling such extreme scale variations and dense targets. Finally, background elements such as coastal buildings and marine buoys share similar shapes with ships, easily leading to false detections, while existing attention mechanisms struggle to effectively distinguish ships from interference objects.

To address these challenges, we propose YOLO-StarLS, a wavelet transform and multi-scale feature extraction algorithm designed for ship detection in complex environments. Our approach fundamentally reconceptualizes traditional object detection pipelines by strategically integrating frequency-domain analysis with spatial attention mechanisms specifically tailored for maritime surveillance scenarios. The main contributions of this work are outlined as follows:

(1) To address the inefficiency of traditional networks in multi-scale feature extraction, we introduce the Wavelet-based Multi-scale Feature Extraction Network (WMFEN). This network utilizes the inherent frequency–spatial collaborative properties of wavelet decomposition to simultaneously preserve fine-grained vessel details and capture global structural patterns. Unlike conventional backbone architectures that suffer from information loss during downsampling operations, WMFEN employs adaptive Haar wavelet transforms to decompose input features into multiple frequency components. This approach ensures the preservation of critical high-frequency information, which is essential for detecting small-scale maritime targets, while maintaining computational efficiency.

(2) To address the challenge of distinguishing ships from interfering objects in complex backgrounds, we propose the Cross-axis Spatial Attention Refinement (CSAR) mechanism. This mechanism addresses the geometric asymmetry characteristic of ship targets by employing a dual-pathway attention strategy that processes horizontal and vertical spatial dependencies independently. By integrating the star structure topology with cross-axis attention computations, CSAR significantly enhances the model’s spatial perception capabilities for ship targets, effectively mitigating false positives caused by wave patterns, coastal infrastructure, and other maritime interference objects.

(3) To enhance the detail preservation capabilities of detection heads, we design the Efficient Detail-Preserving Detection (EDPD) head. This detection head surpasses the limitations of traditional detection architectures, which depend on separate convolutional branches for multi-scale processing. EDPD employs a shared convolutional structure in conjunction with differential convolution operations, enabling enhanced extraction of ship target edge and texture features through the coordinated fusion of differential convolution and shared convolution structures. This design significantly reduces computational overhead while improving detection precision for vessels with complex geometric configurations.

The remainder of this paper proceeds as follows. Section 2 reviews related work on object detection algorithms and lightweight model architectures. Section 3 presents our proposed methodology, detailing the WMFEN backbone, CSAR mechanism, and EDPD head, along with their mathematical formulations. Section 4 provides a comprehensive experimental evaluation that includes ablation studies and comparative analysis against state-of-the-art methods. Section 5 concludes with our findings and future research directions.

2. Related Work

2.1. Object Detection Algorithms

Object recognition is among the most basic yet difficult undertakings in computer vision. In recent times, it has made notable headway, thanks to the development of deep learning techniques. Conventional object recognition methods mainly depended on manually crafted features and shallow trainable frameworks. On the other hand, deep learning methods have the ability to learn in-depth semantic features at a high level, offering more powerful instruments for object recognition [11]. Current deep learning-based object recognition algorithms can be generally divided into two main types: two-stage detectors and single-stage detectors.

Two-stage detectors, like R-CNN, Fast R-CNN, and Faster R-CNN, first create region-of-interest (RoI) suggestions. Then, they carry out classification and bounding-box regression on these areas. Although these techniques usually attain high recognition accuracy, they have considerable computational intricacy, which makes real-time applications difficult [12]. Conversely, single-stage detectors such as SSD (Single-Shot, MultiBox Detector) and the YOLO (You Only Look Once) series simplify the recognition task into a unified network structure. They directly forecast object positions and classes. These approaches maintain a comparable level of accuracy while greatly enhancing recognition speed [8].

Li et al. [13] proposed an enhanced YOLOv3 Tiny network for real-time ship detection in visual images. Their approach incorporated several improvements to the original YOLOv3 Tiny network, including pre-configured anchor points trained on annotated SeaShips data and the replacement of max pooling layers with convolutional layers, effectively enhancing small ship detection capabilities and recognition accuracy under complex backgrounds. Zhao et al. [6] utilized Faster R-CNN to develop an SAR image-based ship detection system, significantly improving detection performance in complex marine environments. Cheng et al. [10] introduced the YOLOv5-ODConvNeXt model for detecting ships in drone-captured images, addressing the challenges posed by significant size variations and complex backgrounds from aerial perspectives. Zhang et al. [1] conducted a systematic evaluation of deep learning-based object detection methods in marine environments, highlighting the substantial advantages of deep learning approaches in complex maritime scenarios.

The YOLO series of object detection algorithms has gained widespread adoption across various practical applications due to its exceptional real-time performance and continuously evolving architectural design [14]. With the ongoing development of deep learning in object detection, significant improvements have been achieved in network architecture design, feature extraction and fusion methodologies, and loss function optimization, providing robust technical support for object detection tasks in complex scenarios [15].

2.2. YOLO11 Baseline

The You Only Look Once (YOLO) family of object detection algorithms has become well known in the realm of computer vision due to its outstanding real-time performance and detection accuracy. The newly launched YOLO11 has made remarkable advancements in architectural design and optimization methods. In comparison to its forerunner, YOLOv8, it shows greater detection precision and a lower number of parameters [16]. As per the performance assessments by Ultralytics, YOLO11m achieves 51.5% mAPval50–95 on the COCO dataset. At the same time, it uses 22% fewer parameters than YOLOv8m [17]. These enhancements allow YOLO11 to function effectively on edge devices with restricted computational capabilities while still delivering excellent detection results. This is of great significance for real-time maritime monitoring systems.

YOLO11 represents the latest object detection model released by Ultralytics, inheriting YOLOv8’s multi-task processing capabilities while achieving significant breakthroughs in architectural design and optimization techniques [18]. The core innovations of YOLO11 include the introduction of C3k2 (Cross-Stage Partial with kernel size 2) modules and C2PSA (Cross-Stage Partial with Spatial Attention) spatial attention modules.

The C3k2 module constitutes an efficient implementation of the CSP (Cross-Stage Partial) bottleneck module, utilizing two smaller convolution operations to replace the single large convolution used in YOLOv8, thereby reducing processing time while maintaining feature extraction capabilities [19]. The C2PSA module is introduced following the SPPF (Spatial Pyramid Pooling—Fast) module, enhancing the model’s understanding of spatial features in images through self-attention mechanisms, particularly improving detection capabilities for small objects and partially occluded targets [20].

In terms of performance, YOLO11 demonstrates exceptional results on the COCO dataset, with the medium variant (YOLO11m) achieving a 51.5% mAPval50–95 while utilizing 22% fewer parameters than YOLOv8m [19]. YOLO11 offers multiple variants, ranging from YOLOv5 to YOLO11, accommodating diverse application requirements while optimizing computational efficiency and inference speed, in addition to maintaining high accuracy [21]. Furthermore, YOLO11 supports various computer vision tasks, including object detection, instance segmentation, image classification, pose estimation, and oriented object detection (OBB), establishing it as a versatile deep learning framework [22].

2.3. Current Research Status of Lightweight Models

With the widespread application of deep learning technologies in object detection, developing efficient and accurate lightweight object detection models under limited computational resources has emerged as a research focal point [23]. Lightweight models hold particular significance for edge devices and mobile terminals, especially in complex, resource-constrained application scenarios such as autonomous driving, marine monitoring, and real-time video analysis [24].

Current research on lightweight models primarily concentrates on three aspects: network architecture optimization, model compression, and hardware acceleration [25]. In network architecture optimization, innovative techniques such as depthwise separable convolutions, inverted residual structures, and ghost convolutions have been extensively applied to lightweight model design [26]. For instance, the MobileNet series significantly reduces model parameters and computational load through depthwise separable convolutions, while the ShuffleNet series improves information flow efficiency through channel shuffle operations [27].

Model compression techniques encompass knowledge distillation, quantization, and pruning methods, transferring knowledge from complex models to lightweight models or reducing parameter precision and quantity, thereby significantly lowering computational complexity and storage requirements while maintaining relatively high detection accuracy [28]. In marine monitoring applications, LSDNet, based on YOLOv7-tiny and incorporating partial convolutions and ghost convolutions, has successfully achieved lightweight ship detection models, attaining high accuracy on the SeaShips7000 dataset [29].

Additionally, Neural Architecture Search (NAS) has been applied to automatically design lightweight object detection models, such as lightweight feature extraction network structures based on YOLOv5s suitable for embedded devices and mobile terminals in aerospace scenarios [30]. These studies demonstrate that lightweight models possess broad application prospects in object detection tasks under complex scenarios, though better balance between detection accuracy and real-time performance remains necessary.

3. Methodology

The methodology presented in this work was developed to address the fundamental limitations of existing object detection frameworks when applied to maritime surveillance scenarios. Our approach focuses on the development of a comprehensive detection pipeline that systematically tackles the multi-faceted challenges inherent in ship detection tasks. We designed YOLO-StarLS as an integrated framework that combines frequency-domain feature analysis with spatial attention mechanisms specifically engineered to handle the complex interference patterns and scale variations encountered in marine environments. The methodology was structured around three interconnected innovations: a wavelet-based backbone network for enhanced multifaceted feature extraction, a cross-axis attention mechanism for improved spatial discrimination, and an efficient detection head optimized for detail preservation. Each component was designed to work synergistically, creating a robust detection system capable of maintaining high accuracy while meeting the computational constraints typical of real-world maritime applications.

3.1. Network Overview

In this study, we propose the YOLO-StarLS detection framework for ship target detection in complex marine environments. As shown in Figure 1, YOLO-StarLS consists of three core components: the WMFEN, leveraging the frequency-domain symmetry of wavelet transforms and the radial symmetry of star-shaped structures; the CSAR module, employing cross-axis attention mechanisms to strengthen perceptual capabilities for symmetric ship structures; and the EDPD head, adopting a differential convolution design for precise capture of ship-edge symmetry features. This design enables effective processing of symmetric interference from sea surface reflections and symmetric confusion in dense target scenarios while ensuring real-time requirements of marine wireless networks through lightweight implementation. Next, we provide detailed descriptions of the three key innovations.

3.2. Wavelet-Based Multi-Scale Feature Extraction Network (WMFEN)

Traditional YOLO-series object detection algorithms employ backbone networks that primarily rely on simply stacked convolutional layers and residual connections. However, conventional convolutions are inefficient for multi-scale feature extraction, struggling to simultaneously capture both the fine structures and overall contours of ships. Additionally, traditional downsampling operations often result in high-frequency information loss, severely degrading features of small vessels. Furthermore, simplistic feature fusion methods cannot effectively address target interference issues in complex maritime backgrounds. To tackle these challenges, we propose the WMFEN backbone architecture, as illustrated in Figure 2. This architecture effectively resolves the problems of insufficient detection accuracy, blurred vessel edges, and similar-target confusion in complex sea conditions through the introduction of adaptive wavelet transforms and star-shaped feature extraction modules, providing a solid feature foundation for high-precision ship detection.

The core innovations of the WMFEN backbone lie in its unique frequency–spatial collaborative feature extraction mechanism and efficient star-shaped connection topology, as shown in Figure 3. Unlike the linear stacking approach of traditional networks, each stage of this architecture first leverages the symmetry properties of wavelet transforms to decompose and reduce features in the frequency domain through the Adaptive Haar Wavelet Downsampler (AHWD) module, preserving symmetry-discriminative information from different frequency bands. Subsequently, spatial-domain feature refinement is performed through multiple Feature Transformation Networks with Radial Connections (FTNRCs), establishing more flexible feature transformation pathways. Particularly for densely distributed, multi-scale ship targets, WMFEN provides more robust symmetry-aware feature representations, effectively mitigating the negative impacts of background interference and target occlusion.

The AHWD module is the core component in the WMFEN architecture responsible for frequency domain feature extraction and dimensionality reduction. Given an input feature map (

X \in R^{C \times H \times W}

), AHWD first applies Haar wavelet transform to perform two-dimensional discrete wavelet decomposition, yielding a low-frequency approximation component and three directional high-frequency detail components:

Φ_{D W T} (X) = {X_{L}, X_{H}} = {X_{L}, {X_{H L}, X_{L H}, X_{H H}}}

(1)

where

X_{L} \in R^{C \times \frac{H}{2} \times \frac{W}{2}}

represents the low-frequency approximation component preserving the main structure and global morphological information of ship targets;

X_{H L}

,

X_{L H}

, and

X_{H H} \in R^{C \times \frac{H}{2} \times \frac{W}{2}}

represent high-frequency detail components in the horizontal, vertical, and diagonal directions, respectively, encoding discriminative features such as edges, textures, and corner points of ships. Subsequently, AHWD concatenates these four components along the channel dimension and applies a 1 × 1 convolution for adaptive feature fusion:

Z = F_{c o n v} (C (X_{L}, X_{H L}, X_{L H}, X_{H H})) = F_{c o n v} (X_{c a t}) \in R^{C^{'} \times \frac{H}{2} \times \frac{W}{2}}

(2)

where

C (\cdot)

denotes the channel-wise concatenation operation,

F_{c o n v}

represents the parameterized 1 × 1 convolution mapping function, and

C^{'}

is the number of output channels. Through this wavelet-based frequency-domain decomposition and fusion mechanism, the AHWD module can preserve rich multi-scale features while reducing spatial dimensions, providing a more comprehensive information foundation for subsequent feature refinement.

The FTNRC serves as the basic building block of WMFEN, employing an innovative star-shaped connection topology and adaptive feature transformation mechanism. As illustrated in Figure 2, the FTNRC processing flow first models symmetry-aware spatial context through depthwise separable convolutions on the input features:

X^{'} = D W C o n v (X) = B N (\sum_{g = 1}^{C} K_{g} ⊛ X_{g})

(3)

where

K_{g}

represents the convolution kernel for the corresponding channel, ⊛ denotes the depthwise convolution operation, and

B N

represents batch normalization. Subsequently, FTNRC introduces a dual-path feature transformation and adaptive gating mechanism, implementing dynamic adjustment of feature channel importance:

G (X^{'}) = σ (F_{1} (X^{'}) ⊙ F_{2} (X^{'})) = A ⊙ B

(4)

where

F_{1}

and

F_{2}

are feature mapping functions implemented by 1 × 1 convolutions,

σ

represents the ReLU6 activation function, and ⊙ denotes the Hadamard product. Through this gating mechanism, the network can adaptively enhance the feature representation of key regions of ships while suppressing interference from background and redundant information. Finally, FTNRC integrates the original features with the transformed features through residual connections, forming a richer representation:

Y = X + D r o p P a t h (D W C o n v_{2} \circ P (G (X^{'})))

(5)

where

P

represents a channel dimensionality reduction projection function and

D W C o n v_{2}

represents the second depthwise separable convolution.

The proposed WMFEN backbone architecture, through the organic combination of AHWD modules and FTNRC, achieves three significant innovations: First, wavelet transform-based multi-scale frequency-domain analysis enhances the network’s capability to extract multi-level features of ship targets. Second, the star-shaped connection topology and adaptive gating mechanism optimize feature transformation pathways, improving model robustness to complex sea conditions and lighting variations. Finally, the lightweight design and efficient computation strategy of the overall architecture significantly reduce parameter count and computational complexity, meeting the efficiency requirements of real-time systems while maintaining high detection accuracy.

3.3. Cross-Axis Spatial Attention Refinement (CSAR)

Traditional C3k2 modules serve as core components in YOLO series networks and demonstrate excellent performance in general object detection tasks. However, they exhibit significant limitations in specialized scenarios such as ship target detection. First, the standard convolution structure of C3k2 proves unable to effectively exploit the symmetric geometric structures of ship targets, making it inefficient in capturing long-range spatial dependencies of ship targets, resulting in incomplete feature capture for large vessels and elongated hulls. Second, the lack of effective feature enhancement mechanisms makes it difficult to accurately separate foreground from background under complex sea surface interference. To address these issues, we proposed CSAR, which effectively improves model detection accuracy and robustness for ship targets in complex marine environments through the integration of the symmetric feature transformation capabilities of star structures and the directional feature enhancement capabilities of the Cross-Axis Attention Module (CAAM).

The CSAR module retains the branching architecture concept of CSP networks while significantly improving their feature extraction capabilities. As shown in Figure 4, the module first splits input features into two pathways through

1 \times 1

convolution, then cascades multiple Multi-directional Attention Enhancement Units (MAEUs) in the main pathway for deep feature extraction, while the bypass pathway preserves original features to prevent information loss. Finally, feature fusion integrates multi-pathway information. The mathematical expression of the entire module can be summarized as follows:

F_{o u t} = F_{c o n v 2} (Concat [F_{c o n v 1 - 1} (X), F_{c o n v 1 - 2} (X), \sum_{i = 1}^{n} M_{i} (F_{c o n v 1 - 1} (X))])

(6)

where

F_{c o n v 1 - 1}

and

F_{c o n v 1 - 2}

represent the two pathways after initial feature splitting,

M_{i}

denotes the i-th MAEU module, and

F_{c o n v 2}

indicates the final feature fusion layer. This structural design ensures information flow diversity while enhancing feature representation through the nonlinear transformation capabilities of MAEU modules, making it particularly suitable for capturing complex structural features of ship targets. During inference, CSAR significantly improves computational efficiency through parallel computation and feature reuse while maintaining powerful feature extraction capabilities.

MAEU serves as the core computational unit of CSAR, integrating spatial feature enhancement and attention mechanisms. The module first applies

7 \times 7

depthwise separable convolution to expand the receptive field, then generates complementary feature mappings through two parallel

1 \times 1

convolutions and employs gating mechanisms for feature activation. This process can be expressed as follows:

X_{g a t e} = ReLU 6 (W_{f 1} (D W_{7 \times 7} (X))) ⊙ W_{f 2} (D W_{7 \times 7} (X))

(7)

where

D W_{7 \times 7}

represents the

7 \times 7

depthwise convolution operation,

W_{f 1}

and

W_{f 2}

denote two parallel

1 \times 1

convolution weight matrices, and ⊙ indicates the Hadamard product. This gating mechanism enables the module to adaptively suppress irrelevant features and enhance key features, which proves particularly important for the suppression of complex backgrounds in ship detection.

Subsequently, after features are enhanced through the CAAM attention module, they undergo a series of transformations to obtain the final output:

Y = X + DropPath (D W_{2} (G (A_{C A A} (X_{g a t e}))))

(8)

where

A_{C A A}

represents the cross-axis attention operation,

G

denotes the feature dimensionality reduction

1 \times 1

convolution,

D W_{2}

indicates the second depthwise convolution, and DropPath serves as the regularization operation. This design enhances feature expressiveness while maintaining computational efficiency and gradient flow through residual connections and multi-level feature transformations, making it particularly suitable for capturing multi-scale features of ship targets.

The CAAM constitutes an efficient attention mechanism designed specifically for ship target aspect-ratio characteristics, focusing on capturing long-range dependencies in horizontal and vertical directions. The CAAM first compresses input features through

7 \times 7

average pooling, then extracts directional features along horizontal and vertical axes through a series of convolution operations. The entire process can be expressed as follows:

A_{C A A} (X) = σ (F_{c o n v 2} (F_{v_c o n v} (F_{h_c o n v} (F_{c o n v 1} (F_{p o o l} (X)))))) \otimes X

(9)

where

F_{p o o l}

represents the average pooling operation;

F_{c o n v 1}

and

F_{c o n v 2}

denote the front and rear convolutions for feature transformation, respectively;

F_{h_c o n v}

and

F_{v_c o n v}

represent depthwise separable convolutions in the horizontal

(1 \times k)

and vertical

(k \times 1)

directions, respectively;

σ

indicates the Sigmoid activation function; and ⊗ denotes broadcast multiplication. This design enables the model to simultaneously focus on horizontal symmetric structures and vertical symmetric features of ships, significantly improving symmetry recognition capabilities for ships at different angles and poses. The CAAM effectively captures directional features of ship targets under low computational complexity, providing crucial spatial contextual information for detection.

The CSAR module successfully addresses multiple challenges of traditional C3k2 in ship target detection through deep integration of star structures and cross-axis attention mechanisms. This module significantly enhanced the model’s spatial perception and feature extraction capabilities for ship targets, enabling more accurate identification of ships in complex marine environments. It demonstrates obvious advantages, particularly in handling extreme situations such as distant small targets, occluded targets, and illumination changes. Theoretical analysis indicates that CSAR improves feature representation diversity and robustness while maintaining computational efficiency; the introduction of cross-axis attention mechanisms enables the model to adaptively focus on key features in different directions, effectively suppressing background interference such as sea surface reflections and wave textures.

3.4. Efficient Detail-Preserving Detection (EDPD)

Traditional object detection networks typically employ independent convolutional layers in their detection heads for feature extraction and prediction, leading to parameter redundancy and computational inefficiency. For ship targets characterized by complex structures and intricate edges, conventional detection heads struggle to effectively extract and preserve crucial symmetric contour information, resulting in limited detection accuracy. Moreover, traditional detection heads often rely on independent convolutional branches when processing multi-scale features, lacking effective symmetric feature-sharing mechanisms and failing to capture consistent features of ship targets across different resolutions. To address these limitations, we propose an EDPD head that innovatively integrates differential convolution and shared convolution structures, enhancing the extraction capability for ship target edges and texture features while significantly reducing model parameters and computational complexity, thereby improving both detection accuracy and efficiency. As illustrated in Figure 5, EDPD employs group normalization instead of batch normalization, reducing sensitivity to batch size and making it more suitable for deployment on resource-constrained platforms.

The core innovation of the EDPD detection head lies in its Detail Enhancement Convolution with Group Normalization (DEConv_GN) and shared convolution architecture. The detection head first receives multi-scale feature maps (P3, P4, and P5) from the backbone network and adjusts channel dimensions through

1 \times 1

convolutions:

F_{i}^{'} = {Conv_GN}_{1 \times 1} (F_{i}) \forall i \in {1, 2, \dots, n_{l}}

(10)

where

F_{i}

represents the original feature map of the i-th detection layer,

F_{i}^{'}

denotes the channel-adjusted feature map, Conv_GN indicates a convolution operation with group normalization, and

n_{l}

represents the number of detection layers.

Subsequently, feature maps from all scales undergo processing through a shared DEConv_GN module, which can be expressed as a cascaded operation:

F_{i}^{″} = G_{θ} (F_{i}^{'}) = (DEConv_GN \circ DEConv_GN) (F_{i}^{'}) \forall i \in {1, 2, \dots, n_{l}}

(11)

where

G_{θ}

represents the shared convolution function with parameters (

θ

) and ∘ denotes function composition.

The DEConv_GN module constitutes the key innovation of EDPD, integrating five distinct convolution operations: Central Difference convolution (CD), Horizontal Difference convolution (HD), Vertical Difference convolution (VD), Angular Difference convolution (AD), and standard convolution. The mathematical expression is formulated as follows:

DEConv (x) = {Conv}_{c d} (x) + {Conv}_{h d} (x) + {Conv}_{v d} (x) + {Conv}_{a d} (x) + {Conv}_{s t d} (x)

(12)

This multi-differential convolution fusion employs a symmetric design that enables the network to simultaneously capture edge symmetry and texture symmetry information across different orientations, proving particularly effective for extracting contour symmetry and structural symmetry features of ship targets. Specifically, horizontal and vertical differential convolutions are designed to capture the main symmetric axis features of ships, angular differential convolutions handle symmetry variations at tilted angles, and central differential convolution enhances local symmetry details. Furthermore, the DEConv_GN module utilizes group normalization instead of batch normalization, reducing dependency on batch size and enhancing training stability.

Following shared convolution processing, feature maps generate final outputs through two parallel convolution branches:

O_{i} = [\begin{matrix} S_{i} (C_{r e g} (F_{i}^{″})) \\ C_{c l s} (F_{i}^{″}) \end{matrix}] = [\begin{matrix} S_{i} (C_{r e g} (G_{θ} (F_{i}^{'}))) \\ C_{c l s} (G_{θ} (F_{i}^{'})) \end{matrix}] \forall i \in {1, 2, \dots, n_{l}}

(13)

where

C_{r e g}

and

C_{c l s}

represent convolution operations for bounding-box regression and classification prediction, respectively, and

S_{i}

denotes the learnable scale factor function for the i-th detection layer, balancing prediction contributions across different scales. Finally, predicted bounding-box parameters are decoded through Distribution Focal Loss (DFL), combined with pre-computed anchors and strides to achieve high-precision target localization and classification results.

The introduction of the EDPD detection head significantly enhances YOLO11-based ship target detection performance. Compared to traditional detection heads, EDPD strengthens perception capabilities for ship target edge symmetry and texture symmetry through differential convolution mechanisms, particularly improving contour extraction accuracy under complex sea conditions and low-contrast environments. The shared convolution structure not only reduces the parameter count and computational complexity but also facilitates effective fusion of multi-scale features, improving detection consistency for ship targets of varying sizes. Additionally, the application of group normalization enhances model stability during small-batch training, making deployment on edge computing devices more reliable. Overall, EDPD maintains low computational cost while significantly improving ship target detection accuracy and robustness, providing more dependable technical support for maritime target monitoring and intelligent navigation systems.

4. Experimental Evaluation

This chapter presents a comprehensive experimental assessment of the YOLO-StarLS algorithm under challenging maritime conditions. We first describe the SeaShips dataset employed in our evaluation, followed by a detailed exposition of the experimental setup and training protocols. Subsequently, we define a robust performance evaluation framework and demonstrate the effectiveness and superiority of our proposed approach through extensive comparative and ablation studies.

4.1. Dataset Description

Our experimental evaluation employed the SeaShips dataset [3], which comprises imagery captured by coastal surveillance systems deployed along Hengqin Island, Zhuhai, China. The dataset encompasses 7000 high-resolution images (1920 × 1080 pixels) featuring six distinct vessel categories—ore carriers, bulk carriers, container ships, general cargo vessels, passenger ships, and fishing boats—as illustrated in Figure 6. The data collection spanned multiple seasons and temporal periods throughout the year, effectively capturing diverse illumination conditions, meteorological variations, and sea states. Notably, small-scale targets constitute over 38% of the dataset, with prevalent environmental challenges including solar glare, wave-induced reflections, and partial vessel occlusion. To ensure robust model training, we partitioned the dataset using an 8:1:1 ratio for training, validation, and testing phases, respectively.

As shown in Figure 7, data preprocessing involved comprehensive augmentation strategies implemented through the Albumentations library [31], incorporating brightness adjustment and Gaussian noise injection to enhance model robustness. All images underwent proportional scaling to ab 800 × 800 pixel resolution to optimize computational efficiency while preserving aspect ratios and critical visual features.

4.2. Experimental Configuration

All experiments were conducted on a single NVIDIA RTX 3060 GPU (NVIDIA Corporation, Santa Clara, CA, USA) with 6GB memory using PyTorch 2.1.0 and the CUDA 11.8 framework. Network training employed the Stochastic Gradient Descent (SGD) optimizer with momentum set to 0.937 and weight decay configured at 5 × 10⁻⁴. The initial learning rate was established at 0.01. The training batch size was set to 32, with all input images uniformly resized to 800 × 800 resolution. All other parameters use the default parameters of YOLO11. To ensure equitable comparison, all baseline models were trained and evaluated under identical hardware and software configurations, employing consistent training protocols that have been demonstrated to achieve superior convergence performance in lightweight object detection networks.

4.3. Evaluation Metrics

To comprehensively assess algorithmic performance, we established an evaluation framework that encompasses three dimensions: detection accuracy, real-time capability, and computational resource consumption. Precision (P) and Recall (R) were utilized to evaluate model prediction correctness and completeness, respectively, and were calculated using true-positive, false-positive, and false-negative samples. Precision quantifies the proportion of correctly identified positive predictions among all positive predictions, mathematically expressed as follows:

P = \frac{T P}{T P + F P}

(14)

where

T P

represents true positives and

F P

denotes false positives. Recall measures the fraction of actual positive instances that were correctly identified, defined as follows:

R = \frac{T P}{T P + F N}

(15)

where

F N

indicates false negatives. The F1 score, calculated as the harmonic mean of precision and recall, provides a balanced assessment metric when precision and recall trade-offs exist:

F 1 = 2 \times \frac{P \times R}{P + R}

(16)

Mean Average Precision (mAP) served as the comprehensive assessment metric, and we adopted a stepwise calculation approach across IoU threshold intervals [0.50:0.95], denoted as mAP50 and mAP50–95, which provided thorough performance reflection under varying strictness levels. The mAP metric is computed as follows:

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(17)

where N represents the number of classes and

A P_{i}

denotes the average precision for class i. For each class, the average precision is calculated by integrating the precision–recall curve:

A P = \int_{0}^{1} P (R) d R

(18)

Frames Per Second (FPS) was employed to evaluate processing speed based on the ratio of processed frames to total inference time, reflecting real-time application capability. The FPS metric is defined as follows:

F P S = \frac{N_{f r a m e s}}{T_{t o t a l}}

(19)

where

N_{f r a m e s}

represents the number of processed frames and

T_{t o t a l}

denotes the total inference time. Floating Point Operations (FLOPs) were used to measure algorithmic computational complexity, which was used to assess model deployment potential on resource-constrained devices. These metrics were consistently applied throughout subsequent comparative and ablation experiments to comprehensively evaluate the overall advantages of YOLO-StarLS.

4.4. Ablation Study

4.4.1. Individual-Module Ablation Analysis

To validate the effectiveness of our proposed WMFEN backbone architecture, we conducted comprehensive ablation experiments. These experiments employed a controlled variable approach, systematically testing the baseline model, individual integration of the FTNRC module, standalone implementation of the AHWD module, and the complete scheme incorporating both FTNRC and AHWD modules under identical dataset and training configurations. Through comparative analysis of detection accuracy, computational complexity, and model parameter counts across different configurations, we comprehensively assessed each component’s contribution to ship target detection performance. The results of the ablation experiment are presented in Table 1, where Model 1 represents the baseline configuration, Models 2 and 3 validate individual module effectiveness, and Model 4 constitutes our complete WMFEN architecture.

The experimental results demonstrate that our proposed WMFEN backbone architecture exhibits exceptional performance in ship target detection tasks. From the tabulated data, it is evident that, compared to the baseline model’s mAP50 of 96.63%, the standalone integration of the AHWD module elevated detection accuracy to 97.89%, representing an improvement of 1.26%, while simultaneously enhancing the mAP50–95 metric from 79.05% to 80.63%. Most significantly, our complete approach achieved an mAP50 of 98.8% after integrating both FTNRC and AHWD modules, marking a 2.17% improvement over the baseline model. Remarkably, while maintaining high accuracy, the computational load was reduced to 5.1 GFLOPs, and the parameter count decreased to 1.98 M, achieving dual optimization of precision and efficiency.

Figure 8 presents a performance comparison radar chart that evaluates detection performance across three dimensions for different module combinations: mAP50, mAP50–95, and efficiency. These results conclusively demonstrate the effectiveness of both the wavelet transform downsampling module and feature extraction units within the WMFEN architecture, providing an efficient solution for ship target detection in complex maritime environments.

4.4.2. Comprehensive Ablation Study

To rigorously evaluate the effectiveness of our proposed YOLO-StarLS algorithm, we conducted systematic ablation experiments to quantify the individual contributions of each innovative component. The ablation study employed YOLO11 as the baseline architecture, progressively incorporating our three core innovations: the WMFEN wavelet-based feature extraction backbone, the CSAR cross-axis attention mechanism, and the EDPD detection head. Through comparative analysis of various module combinations, we examined the specific impact of each component on ship detection accuracy and model efficiency, thereby validating the scientific merit and practical effectiveness of our approach.

The experimental results presented in Table 2 substantiate the superiority of our YOLO-StarLS algorithm. As illustrated in Figure 9 and the comprehensive performance analysis, the complete YOLO-StarLS architecture achieved remarkable improvements over the YOLO11 baseline, with mAP50 and mAP50–95 metrics enhanced by 2.21% and 2.42%, respectively, reaching 98.84% and 81.47%. More significantly, these accuracy gains were accompanied by substantial model compression: the parameter count reduced from 2.61 M to 1.67 M, while computational complexity decreased from 6.5 GFLOPs to 4.3 GFLOPs.

The individual module analysis revealed distinct contributions: the WMFEN backbone delivered substantial complexity reduction while maintaining high precision, the WMFEN+EDPD combination achieved optimal lightweight performance, and the complete YOLO-StarLS system established an optimal balance between accuracy and efficiency through synergistic interaction of all three modules. These findings conclusively demonstrate the practical value of our proposed algorithm for ship detection tasks in complex maritime environments.

The performance improvements could be attributed to several key factors. The WMFEN module leverages wavelet decomposition to capture multi-scale features more effectively while reducing computational overhead through frequency-domain processing. The CSAR attention mechanism enhances feature representation by modeling cross-dimensional dependencies, leading to more discriminative feature maps. Finally, the EDPD detection head optimizes the detection pipeline through enhanced feature pyramid design, improving both localization accuracy and classification performance.

Furthermore, the ablation study revealed that module combinations yielded complementary benefits. The WMFEN+EDPD pairing demonstrated the most significant efficiency gains, reducing GFLOPs to 4.2 while achieving competitive accuracy. The integration of all three components in YOLO-StarLS represents the optimal configuration, balancing trade-offs between detection performance and computational efficiency essential for practical deployment scenarios.

4.5. Comparative Experiments

4.5.1. Individual Module Comparison

To verify the efficacy and pre-eminence of the proposed WMFEN backbone network for ship detection undertakings in intricate marine settings, we carried out comprehensive comparative trials. These trials employed the cutting-edge YOLO11 series of detection models as reference points. The comparative analysis incorporated the baseline YOLO11n model, the improved YOLO11-dyhead version, the YOLO11-AUX auxiliary detection model, and the YOLO11-LADH model. All the experiments were executed with the same dataset and training setups. Precision, recall, mean Average Precision (mAP50), and mAP50–95 were used as the main evaluation measures. Moreover, computational complexity and the number of parameters were taken into account to offer a comprehensive appraisal of detection performance and computational efficiency. The outcomes of these experiments are presented in Table 3.

The experimental results demonstrate that the proposed WMFEN backbone network achieved remarkable performance improvements across all evaluation metrics. As illustrated in Figure 10, WMFEN attained a precision of 98.7%, representing a substantial improvement of 2.7% over the baseline YOLO11n model. For the critical mAP50 metric, WMFEN achieved 98.31% detection accuracy, surpassing the baseline model by 1.68%. Furthermore, the mAP50–95 metric reached 80.01%, demonstrating clear advantages over all comparative models.

Particularly noteworthy is that WMFEN maintained exceptional detection performance while requiring only 6.1 GFLOPs of computational overhead and 2.26 M parameters. This demonstrates its excellent computational efficiency and lightweight model characteristics, thereby validating the effectiveness of the proposed wavelet transform and multi-scale feature extraction strategies for ship detection tasks in complex marine environments.

The superior performance of WMFEN could be attributed to several key factors. First, the integration of wavelet transform enables more effective frequency-domain analysis, allowing the model to better capture fine-grained features that are often obscured by noise and environmental interference in maritime scenarios. Second, the multi-scale feature extraction mechanism enhances the model’s ability to detect ships of varying sizes and orientations, which is particularly crucial in complex marine environments, where target scales can vary significantly due to distance and perspective changes.

Furthermore, the computational efficiency attained by WMFEN holds great significance for real-world applications. When compared with models like YOLO11-dyhead, the proposed approach exhibits decreases in both computational intricacy and the number of parameters. This characteristic renders it more appropriate for implementation in settings with limited resources, such as maritime monitoring systems and self-driving vessel navigation platforms.

4.5.2. Overall Performance Comparison

To systematically evaluate the overall performance of the proposed YOLO-StarLS framework, we conducted extensive comparative experiments against eight representative detection architectures, including single-stage detectors such as YOLOv5n, YOLOv7-tiny, YOLOv8n, YOLO11n, YOLOSeaShip, and EL-YOLOn; the Transformer-based RT-DETR-R18; and the two-stage Deformable-DETR. All models were retrained on the SeaShips dataset following the unified training configuration described in Section 4.2, with experimental results presented in Table 4.

The experimental findings demonstrate substantial performance enhancements achieved by the proposed YOLO-StarLS algorithm across all critical evaluation metrics. As illustrated in the comparative analysis, YOLO-StarLS attains a precision of 97.31% and recall of 98.28%, representing improvements of 0.51% and 2.88% respectively, over the second best performing YOLOSeaShip method. Regarding detection accuracy, the proposed approach achieves superior results, with an mAP50 of 98.84% and mAP50–95 of 81.47%, establishing new benchmarks among all evaluated models. These metrics reflect notable gains of 2.21% and 2.42% when compared against the baseline YOLO11n architecture. Particularly noteworthy is YOLO-StarLS’s ability to maintain exceptional detection performance while achieving favorable computational efficiency trade-offs. The algorithm requires merely 4.3 GFLOPs of computational complexity and 1.67 M parameters, representing the lowest parameter count among all comparison methods while sustaining a rapid inference speed of 162.0 FPS.

Figure 11 presents the comprehensive experimental comparison. Overall, YOLO-StarLS achieved the best or near-best performance across all comparative metrics. These comprehensive results provide compelling evidence for the effectiveness of our proposed wavelet transform integration and multi-scale feature extraction methodology. The experimental validation confirms that YOLO-StarLS offers significant technological advantages for ship detection applications operating within challenging maritime environments.

4.5.3. Visualization Analysis of Results

To provide intuitive validation of the YOLO-StarLS algorithm’s effectiveness, this study conducted feature activation visualization analysis across multiple complex maritime scenarios. Figure 12 presents feature response results compared between YOLO-StarLS and current mainstream object detection algorithms YOLOv8 and YOLO11 on identical test samples. The generated heat maps clearly reflect each algorithm’s attention level and localization accuracy for vessel targets, where brighter regions indicate higher confidence that the algorithm assigns to target presence at specific locations. Test samples encompassed vessels of varying sizes, diverse marine environmental conditions, and different illumination scenarios, aimed at comprehensively evaluating algorithmic detection performance in practical application contexts. This visualization analysis not only qualitatively demonstrates algorithmic superiority but, more importantly, reveals fundamental differences in feature extraction and target localization mechanisms across different approaches, providing robust empirical support for understanding the technical advantages of the YOLO-StarLS algorithm.

Through comparative analysis of the visualization results shown in Figure 12, distinct advantages of the YOLO-StarLS algorithm in vessel detection tasks within complex maritime environments become clearly observable. Examining the overall performance of feature activation heat maps, the proposed YOLO-StarLS method demonstrates more precise and concentrated target response characteristics. Compared to YOLOv8 and YOLO11, YOLO-StarLS generates activation maps exhibiting more pronounced target contours, with activation intensity forming distinct peak distributions in vessel regions while effectively suppressing interference signals from surrounding marine backgrounds. This phenomenon proves particularly prominent when processing large vessel targets, where the algorithm accurately captures complete structural features of vessels, avoiding the scattered activation regions and boundary ambiguity commonly encountered in traditional methods.

The visualization results reveal several critical failure modes that significantly limit the practical applicability of baseline approaches. YOLOv8 consistently produces false-positive detections, generating spurious bounding boxes in wave-disturbed water regions and misidentifying sea surface reflections as potential targets. These false detections not only compromise detection precision but also overwhelm downstream processing systems with irrelevant information. Meanwhile, YOLO11 demonstrates notably unstable detection performance across varying environmental conditions, exhibiting inconsistent sensitivity to target scale variations and illumination changes. The algorithm frequently fails to maintain consistent detection responses for vessels of similar sizes under different lighting scenarios. This instability becomes particularly problematic in operational maritime surveillance contexts where environmental conditions fluctuate continuously.

Furthermore, both baseline methods suffer from incomplete target localization, where vessel boundaries receive insufficient activation coverage, leading to imprecise bounding-box regression. YOLOv8 tends to generate fragmentary activation patterns that fail to encompass complete vessel structures, while YOLO11 produces diffuse activation regions that extend beyond actual target boundaries, compromising localization accuracy. In contrast, YOLO-StarLS maintains consistent detection reliability across all tested scenarios, successfully avoiding both false-positive generation and target omission while providing precise spatial localization. The proposed method generates activation patterns that closely conform to actual vessel contours, ensuring accurate bounding-box placement, even under challenging conditions such as low-contrast situations and complex background interference.

In small target detection scenarios, the superiority of YOLO-StarLS becomes even more pronounced. Even when target dimensions approach detection limits and contrast with background remains minimal, the algorithm consistently produces clearly distinguishable activation responses, benefiting from effective fusion of its multi-scale feature extraction mechanism. This enhanced small target detection capability directly addresses one of the most persistent challenges in maritime surveillance applications, where distant vessels or small craft often escape detection by conventional algorithms.

Regarding spatial distribution characteristics, YOLO-StarLS not only achieves superior target localization accuracy but also demonstrates exceptional interference resistance capabilities. Complex maritime environments routinely present serious challenges to detection algorithms through wave textures, illumination variations, and sea surface reflections. However, YOLO-StarLS, through its distinctive feature processing mechanism, effectively distinguishes genuine targets from environmental noise, maintaining high detection stability across diverse conditions. This robustness enhancement directly manifests in activation heat maps through more pronounced contrast differences between target regions and background areas. The conspicuous absence of spurious activations in non-target regions, which frequently plague baseline methods, further validates the superior discriminative capability of our proposed approach and its practical viability for real-world maritime surveillance applications.

5. Conclusions

This work has developed YOLO-StarLS, a deep learning framework that successfully addresses ship detection challenges in complex maritime environments. Our approach, which integrates wavelet-based multi-scale feature extraction, cross-axis spatial attention refinement, and efficient detail-preserving detection mechanisms, achieved superior performance on the SeaShips dataset, surpassing the YOLO11 baseline by 2.21% and 2.42% for mAP50 and mAP50–95, respectively.

The experimental analysis revealed the effectiveness of leveraging symmetry characteristics inherent in wavelet transformations to handle complex interference patterns, particularly to mirror symmetry from sea surface reflections and periodic wave textures. Our cross-axis attention mechanism proved capable of enhancing spatial perception of vessel symmetry features while suppressing symmetric background noise. Additionally, the lightweight architecture achieved substantial efficiency improvements, reducing parameters by 36% to 1.67 M and computational overhead by 34% to 4.3 GFLOPs, making real-time deployment feasible for maritime wireless networks. Performance comparisons with eight contemporary algorithms confirmed the superior detection capabilities of our approach. Ablation studies provided empirical support for the symmetry-based design principles underlying the framework. The results indicate that YOLO-StarLS constitutes a practical solution for maritime surveillance, autonomous navigation, and oceanic security systems.

Future work will investigate the generalization potential of symmetry-based features across varied maritime conditions and explore extensions to multimodal fusion architectures for comprehensive wireless network monitoring applications.

Author Contributions

Conceptualization, Y.W. and S.Z.; Methodology, Y.W. and J.X.; Software, Y.W.; Validation, Y.W., J.X. and Z.C.; Formal Analysis, Y.W.; Investigation, Y.W. and J.X.; Resources, S.Z. and G.D.; Data Curation, Y.W.; Writing—Original Draft Preparation, Y.W.; Writing—Review and Editing, S.Z., J.X. and G.D.; Visualization, Y.W. and Z.C.; Supervision, S.Z. and G.D.; Project Administration, S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Heilongjiang Provincial Natural Science Foundation of China (Grant No. PL2024F031).

Data Availability Statement

The data presented in this study are openly available in the SeaShips-7000 dataset. The dataset can be downloaded from http://www.lmars.whu.edu.cn/prof_web/shaozhenfeng/datasets/SeaShips(7000).zip (accessed on 16 February 2025). These data were derived from the publicly available SeaShips dataset [3], which contains 7000 images for ship detection tasks.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, R.; Li, S.; Ji, G.; Zhao, X.; Li, J.; Pan, M. Survey on Deep Learning-Based Marine Object Detection. J. Adv. Transp. 2021, 2021, 5808206. [Google Scholar] [CrossRef]
Chen, C.; He, C.; Hu, C.; Pei, H.; Jiao, L. A deep neural network based on an attention mechanism for SAR ship detection in multiscale and complex scenarios. IEEE Access 2019, 7, 104848–104863. [Google Scholar] [CrossRef]
Shao, Z.; Wu, W.; Wang, Z.; Du, W.; Li, C. Seaships: A large-scale precisely annotated dataset for ship detection. IEEE Trans. Multimed. 2018, 20, 2593–2604. [Google Scholar] [CrossRef]
Zhuang, J.; Chen, W.; Huang, X.; Yan, Y. Band Selection Algorithm Based on Multi-Feature and Affinity Propagation Clustering. Remote Sens. 2025, 17, 193. [Google Scholar] [CrossRef]
Deng, L.; Liu, Z.; Wang, J.; Yang, B. ATT-YOLOv5-Ghost: Water surface object detection in complex scenes. J. Real-Time Image Process. 2023, 20, 97. [Google Scholar] [CrossRef]
Zhao, T.; Wang, Y.; Li, Z.; Gao, Y.; Chen, C.; Feng, H.; Zhao, Z. Ship detection with deep learning in optical remote-sensing images: A survey of challenges and advances. Remote Sens. 2024, 16, 1145. [Google Scholar] [CrossRef]
Zhang, S.; Wu, R.; Xu, K.; Wang, J.; Sun, W. R-CNN-based ship detection from high resolution remote sensing imagery. Remote Sens. 2019, 11, 631. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the ECCV 2016: 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Chen, Y.; Zhao, D.; Lv, L.; Zhang, Q. Multi-task learning for dangerous object detection in autonomous driving. Inf. Sci. 2018, 432, 559–571. [Google Scholar] [CrossRef]
Cheng, S.; Zhu, Y.; Wu, S. Deep learning based efficient ship detection from drone-captured images for maritime surveillance. Ocean Eng. 2023, 285, 115440. [Google Scholar] [CrossRef]
Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep learning for generic object detection: A survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the NIPS’15: 29th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
Li, H.; Deng, L.; Yang, C.; Liu, J.; Gu, Z. Enhanced YOLO v3 tiny network for real-time ship detection from visual image. IEEE Access 2021, 9, 16692–16706. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Jegham, N.; Koh, C.Y.; Abdelatti, M.; Hendawi, A. YOLO Evolution: A Comprehensive Benchmark and Architectural Review of YOLOv12, YOLO11, and Their Previous Versions. arXiv 2024, arXiv:2411.00201. [Google Scholar]
Sapkota, R.; Meng, Z.; Churuvija, M.; Du, X.; Ma, Z.; Karkee, M. Comprehensive performance evaluation of yolo11, yolov10, yolov9 and yolov8 on detecting and counting fruitlet in complex orchard environments. TechRxiv 2024. [Google Scholar] [CrossRef]
Jegham, N.; Koh, C.Y.; Abdelatti, M.; Hendawi, A. Evaluating the evolution of yolo (you only look once) models: A comprehensive benchmark study of yolo11 and its predecessors. arXiv 2024, arXiv:2411.00201. [Google Scholar]
Khan, A.; Ibrahim, M.; Ikram, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Sharma, K.; Mehta, D. Guide on YOLOv11 Model Building from Scratch using PyTorch. Anal. Vidhya 2025, 3, 45–52. [Google Scholar]
Tariq, M.F.; Javed, M.A. Small Object Detection with YOLO: A Performance Analysis Across Model Versions and Hardware. arXiv 2025, arXiv:2504.09900. [Google Scholar]
Hidayatullah, P.; Syakrani, N.; Sholahuddin, M.R.; Gelar, T.; Tubagus, R. YOLOv8 to YOLO11: A Comprehensive Architecture In-depth Comparative Review. arXiv 2025, arXiv:2501.13400. [Google Scholar]
Lyu, Z.; Yu, T.; Pan, F.; Zhang, Y.; Luo, J.; Zhang, D.; Chen, Y.; Zhang, B.; Li, G. A survey of model compression strategies for object detection. Multimed. Tools Appl. 2024, 83, 48165–48236. [Google Scholar] [CrossRef]
Dantas, P.V.; Sabino da Silva, W., Jr.; Cordeiro, L.C.; Carvalho, C.B. A comprehensive review of model compression techniques in machine learning. Appl. Intell. 2024, 54, 11804–11844. [Google Scholar] [CrossRef]
Li, Z.; Li, H.; Meng, L. Model compression for deep neural networks: A survey. Computers 2023, 12, 60. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Kim, J.; Chang, S.; Kwak, N. PQK: Model compression via pruning, quantization, and knowledge distillation. arXiv 2021, arXiv:2106.14681. [Google Scholar]
Lang, C.; Yu, X.; Rong, X. LSDNet: A lightweight ship detection network with improved YOLOv7. J. Real-Time Image Process. 2024, 21, 60. [Google Scholar] [CrossRef]
Chen, S.; Wang, S.; Zhao, X.; Li, Y. A Lightweight Object Detection Model Based on Neural Architecture Search. In Proceedings of the 2021 5th International Conference on Electronic Information Technology and Computer Engineering, Xiamen, China, 22–24 October 2021; pp. 991–998. [Google Scholar]
Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and flexible image augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef]
Jocher, G.; Qiu, J. Ultralytics YOLO11, Version 11.0.0. 2024. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 22 February 2025).
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 16965–16974. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Jocher, G. YOLOv5 by Ultralytics, Version 7.0. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 26 February 2025).
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8, Version 8.0.0. 2023. Available online: https://docs.ultralytics.com/models/yolov8/ (accessed on 5 March 2025).
Jiang, X.; Cai, J.; Wang, B. YOLOSeaShip: A lightweight model for real-time ship detection. Eur. J. Remote Sens. 2024, 57, 2307613. [Google Scholar] [CrossRef]
Yang, D.; Solihin, M.I.; Ardiyanto, I.; Zhao, Y.; Li, W.; Cai, B.; Chen, C. A streamlined approach for intelligent ship object detection using EL-YOLO algorithm. Sci. Rep. 2024, 14, 15254. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed YOLO-StarLS framework.

Figure 2. Architecture of the WMFEN backbone. The network comprises four progressive stages, each integrating AHWD and FTNRC. The detailed structure of FTNRC is shown on the right. Conv: convolution; DW-Conv: depthwise convolution; FC: fully connected layer; *: element-wise multiplication.

Figure 3. Detailed architecture of the AHWD module. The module consists of two main stages: lossless feature encoding using Adaptive Haar Wavelet Transform and feature refinement through convolutional layers. The bottom panels illustrate the Adaptive Haar Wavelet decomposition process and the spatial arrangement of frequency components.

Figure 4. Architecture of the CSAR module, showing the overall structure with MAEU and CAAM. Conv: convolution; DW-Conv: depthwise separable convolution; FC: fully connected layer.

Figure 5. Architecture of the EDPD head. The EDPD head processes multi-scale feature maps (P3, P4, and P5) through shared DEConv_GN modules, followed by parallel regression and classification branches. The detailed structure of DEConv_GN is shown at the bottom, integrating five different convolution operations.

Figure 6. Representative samples from the SeaShips dataset showcasing diverse maritime scenarios and ship types. (a) Ships in clear conditions, (b) Multiple ships in harbor, (c) Work boats in waterway, (d) Large ship with support boats, (e) Ships near urban area, (f) Ships in foggy conditions. Note: Chinese characters visible in images are original timestamp watermarks from the SeaShips dataset.

Figure 7. Examples of data augmentation techniques implemented using Albumentations library to enhance the robustness of ship detection in complex maritime environments.

Figure 8. Ablation study showing module contribution analysis. (a) Detection performance comparison across progressive model configurations showing mAP50 and mAP50–95 improvements. (b) Computational efficiency analysis displaying GFLOPs and parameter counts for each model variant. (c) Overall performance improvements of the final model relative to the baseline across key metrics.

Figure 9. Comprehensive performance analysis of the YOLO-StarLS algorithm. (a) Comparison of performance metrics between the YOLO11 baseline and the proposed YOLO-StarLS. (b) mAP50 comparison among state-of-the-art detection models. (c) Model efficiency analysis showing computational complexity versus model parameters. (d) Radar chart comparing YOLO-StarLS and YOLO11 across multiple metrics. (e) Performance improvements of YOLO-StarLS over the YOLO11 baseline.

Figure 10. Performance analysis of WMFEN for ship detection. (a) Comparison of key performance metrics between the YOLO11n baseline and the proposed WMFEN. (b) mAP50 performance comparison across different YOLO variants. (c) WMFEN improvements versus the YOLO11n baseline showing performance changes and efficiency gains.

Figure 11. Performance analysis of YOLO-StarLS versus state-of-the-art detection methods. (a) Detection accuracy comparison across different models. (b) Efficiency–accuracy trade-off between GFLOPs and performance. (c) Multi-dimensional radar chart for three models.(d) Model complexity comparison showing parameter counts for various algorithms. (e) Inference speed comparison for top-performing models.

Figure 12. Visual comparison of ship detection performance. From left to right: original images, YOLOv8, YOLO11, and YOLO-StarLS results. Heat maps show feature activation intensity, where warmer colors indicate stronger target responses. Our method demonstrates superior target localization and background suppression in complex maritime environments. Bounding boxes show detected ships. Note: Chinese characters visible in images are original timestamp watermarks from the SeaShips dataset, and bounding boxes indicate detected ships with different colors representing various ship types.

Table 1. Ablation study results of individual modules. Note: ✓ indicates that the corresponding module is included in the model configuration.

Model	FTNRC	AHWD	mAP50(%)	mAP50–95(%)	GFLOPS	Param (M)
1			96.63	79.05	6.5	2.61
2	✓		97.41	79.82	5.3	2.01
3		✓	97.89	80.63	6.3	2.51
4	✓	✓	98.8	80.2	5.1	1.98

Table 2. Ablation study results demonstrating the contribution of each proposed module.

Model	WMFEN	CSAR	EDPD	P	R	mAP50	mAP50–95	GFLOPs	Params
Model	WMFEN	CSAR	EDPD	(%)	(%)	(%)	(%)	GFLOPs	(M)
YOLO11	×	×	×	96.03	95.97	96.63	79.05	6.5	2.61
WMFEN	✓	×	×	97.6	97.6	98.8	80.2	5.1	1.98
CSAR	×	✓	×	97.73	94.72	98.16	79.57	6.5	2.51
EDPD	×	×	✓	98.72	95.94	98.31	80.01	6.1	2.26
WMFEN+CSAR	✓	✓	×	97.47	96.16	98.41	80.14	5.0	1.85
WMFEN+EDPD	✓	×	✓	98.46	95.98	98.8	81.13	4.2	1.65
CSAR+EDPD	×	✓	✓	97.74	95.97	97.89	80.32	5.0	2.36
YOLO-StarLS	✓	✓	✓	97.31	98.28	98.84	81.47	4.3	1.67

Table 3. Performance comparison of different detection models.

Model	P (%)	R (%)	mAP50 (%)	mAP50–95 (%)	GFLOPs	Params (M)
YOLO11 [32] (Base)	96.0	96.0	96.63	79.05	6.5	2.61
YOLO11-dyhead	97.2	95.5	97.25	79.42	7.8	3.14
YOLO11-AUX	96.5	96.2	97.10	79.85	6.5	2.61
YOLO11-LADH	95.8	96.8	96.80	79.12	5.3	2.32
WMFEN (Ours)	98.7	95.9	98.31	80.01	6.1	2.26

Table 4. Performance comparison of different algorithms on the SeaShips test set.

Model	P	R	mAP50	mAP50–95	GFLOPs	FPS	Params
Model	(%)	(%)	(%)	(%)	GFLOPs	FPS	(M)
RT-DETR-R18 [33]	94.60	93.50	96.30	77.70	8.9	110.2	20.13
Deformable DETR [34]	91.80	93.20	94.50	76.90	173.4	23.5	40.96
YOLOv5n [35]	95.60	94.20	96.40	72.50	4.2	127.9	1.77
YOLOv7-tiny [36]	96.20	94.50	97.10	75.30	6.2	138.6	6.22
YOLOv8n [37]	96.40	95.00	97.70	78.00	8.1	147.4	3.01
YOLOSeaShip [38]	96.80	95.40	97.60	79.50	5.8	119.84	5.30
EL-YOLOn [39]	96.10	94.80	96.70	77.40	3.9	142.5	2.45
YOLO11n [32] (Base)	96.03	95.97	96.63	79.05	6.5	155.3	2.61
YOLO-StarLS (Ours)	97.31	98.28	98.84	81.47	4.3	162.0	1.67

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Zhang, S.; Xu, J.; Cheng, Z.; Du, G. YOLO-StarLS: A Ship Detection Algorithm Based on Wavelet Transform and Multi-Scale Feature Extraction for Complex Environments. Symmetry 2025, 17, 1116. https://doi.org/10.3390/sym17071116

AMA Style

Wang Y, Zhang S, Xu J, Cheng Z, Du G. YOLO-StarLS: A Ship Detection Algorithm Based on Wavelet Transform and Multi-Scale Feature Extraction for Complex Environments. Symmetry. 2025; 17(7):1116. https://doi.org/10.3390/sym17071116

Chicago/Turabian Style

Wang, Yihan, Shuang Zhang, Jianhao Xu, Zhenwen Cheng, and Gang Du. 2025. "YOLO-StarLS: A Ship Detection Algorithm Based on Wavelet Transform and Multi-Scale Feature Extraction for Complex Environments" Symmetry 17, no. 7: 1116. https://doi.org/10.3390/sym17071116

APA Style

Wang, Y., Zhang, S., Xu, J., Cheng, Z., & Du, G. (2025). YOLO-StarLS: A Ship Detection Algorithm Based on Wavelet Transform and Multi-Scale Feature Extraction for Complex Environments. Symmetry, 17(7), 1116. https://doi.org/10.3390/sym17071116

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-StarLS: A Ship Detection Algorithm Based on Wavelet Transform and Multi-Scale Feature Extraction for Complex Environments

Abstract

1. Introduction

2. Related Work

2.1. Object Detection Algorithms

2.2. YOLO11 Baseline

2.3. Current Research Status of Lightweight Models

3. Methodology

3.1. Network Overview

3.2. Wavelet-Based Multi-Scale Feature Extraction Network (WMFEN)

3.3. Cross-Axis Spatial Attention Refinement (CSAR)

3.4. Efficient Detail-Preserving Detection (EDPD)

4. Experimental Evaluation

4.1. Dataset Description

4.2. Experimental Configuration

4.3. Evaluation Metrics

4.4. Ablation Study

4.4.1. Individual-Module Ablation Analysis

4.4.2. Comprehensive Ablation Study

4.5. Comparative Experiments

4.5.1. Individual Module Comparison

4.5.2. Overall Performance Comparison

4.5.3. Visualization Analysis of Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI