OWTDNet: A Novel CNN-Mamba Fusion Network for Offshore Wind Turbine Detection in High-Resolution Remote Sensing Images

Sha, Pengcheng; Lu, Sujie; Xu, Zongjie; Yu, Jianhai; Li, Lei; Zou, Yibo; Zhao, Linlin

doi:10.3390/jmse13112124

Open AccessArticle

OWTDNet: A Novel CNN-Mamba Fusion Network for Offshore Wind Turbine Detection in High-Resolution Remote Sensing Images

by

Pengcheng Sha

¹,

Sujie Lu

¹,

Zongjie Xu

¹,

Jianhai Yu

¹,

Lei Li

²,

Yibo Zou

^3,* and

Linlin Zhao

³

¹

Longyuan Qidong Wind Power Generation Co., Ltd., Nantong 226236, China

²

Shanghai Enshu Data Technology Co., Ltd., Shanghai 200335, China

³

College of Information Technology, Shanghai Ocean University, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(11), 2124; https://doi.org/10.3390/jmse13112124

Submission received: 23 September 2025 / Revised: 10 October 2025 / Accepted: 11 October 2025 / Published: 10 November 2025

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Real-time monitoring of offshore wind turbines (OWTs) through satellite remote sensing imagery is considered an essential process for large-scale infrastructure surveillance in ocean engineering. Current detection systems, however, are constrained by persistent technical limitations, including prohibitive deployment costs, insufficient discriminative power for learned features, and susceptibility to environmental interference. To address these challenges, a dual-branch architecture named OWTDNet is proposed, which integrates global contextual modeling via State Space Models (SSMs) with CNN-based local feature extraction for high-resolution OWTs detection. The primary branch utilizes a Mamba-structured encoder with linear computational complexity to establish long-range spatial dependencies, while an auxiliary Blurring-MobileNetv3 (B-Mv3) branch is designed to compensate for the local feature extraction deficiencies inherent in SSMs. Additionally, a novel Feature Alignment Module (FAM) is introduced to systematically coordinate cross-modal feature fusion between Mamba and CNN branches through channel-wise recalibration and position-aware alignment mechanisms. This module not only enables complementary feature integration but also enhances turbine-specific responses through attention-driven feature modulation. Comprehensive experimental validation demonstrated the superiority of the proposed framework, achieving a mean average precision (AP) of 47.1% on 40,000 × 40,000-pixel satellite imagery, while maintaining practical computational efficiency (127.7 s per image processing time).

Keywords:

offshore wind turbines; satellite remote sensing images; Mamba; Convolutional Neural Network (CNN); high-resolution; Feature Alignment Module (FAM); feature disparities

1. Introduction

Real-time detection and monitoring of offshore wind turbines are crucial for identifying damage to marine infrastructure, enabling timely maintenance, and minimizing economic losses. Additionally, such monitoring systems allow regulatory personnel to maintain real-time control of offshore engineering layouts, providing a solid foundation for enforcing laws against illegal activities such as overbuilding, unauthorized construction, and disorderly development.

Satellite remote sensing technology offers significant advantages in marine monitoring systems, including extensive coverage, real-time capabilities, and strong objectivity, enabling the rapid acquisition of high-precision ocean monitoring data in critical sea areas, especially for wind turbines. In recent years, the marine community has increasingly emphasized the importance of marine resource supervision through satellite remote sensing [1,2,3]. However, the complex and variable nature of ocean environments poses significant challenges to the expertise required of supervisory personnel. Additionally, the immense resolution of remote sensing images, often exceeding 40,000 × 40,000 pixels, significantly increases the workload of staff compared to surveillance cameras.

With the successful application of deep learning methods across various domains, researchers have focused on utilizing object detection techniques in marine surveillance. A survey of recent deep-learning-driven object detection methods is provided in [4,5]. These methods can be categorized into Oriented Bounding Box (OBB) [6,7,8] and Horizontal Bounding Box (HBB) [9,10,11] techniques, based on the necessity to distinguish the target’s direction, such as for ships and planes. HBB methods focus on predicting the center point, length, and width of objects, while OBB methods require an additional rotation angle. In our detection task, the shape feature of OWTs equipment negates the necessity for angle information, as illustrated in Figure 1a,b. Therefore, this article focuses on discussing the application value of HBB methods in our OWTs monitoring task.

In the realm of ocean engineering applications, HBB techniques can be categorized into Convolutional Neural Network (CNN)-based and Transformer-based methodologies [12,13,14]. CNN-based methods offer advantages in terms of a higher inference speed and lower deployment costs, which are crucial for detection tasks in actual application processes. However, these methods are prone to complex marine environmental interference, as evidenced in Figure 1c,d, posing challenges in achieving optimal performance in applications. In contrast, Transformer-based approaches excel in modeling long-range dependencies, exhibit strong feature learning, and possess robust anti-interference capabilities, achieving superior performance. However, the computational complexity of Transformers increases quadratically with the image size, due to the self-attention mechanism. This significant computational workload hinders their widespread application in marine monitoring systems that require speed and cost efficiency. In the context of the OWTs detection task, we encounter the following challenges: (1) achieving an optimal trade-off between speed, performance, and deployment costs that satisfies the demands of monitoring applications; (2) enhancing the network’s anti-interference ability against complex marine environmental conditions; and (3) strengthening the feature modeling capabilities for OWTs under small and weak features. In this paper, a novel object detection network called OWTDNet is proposed, specifically designed for OWTs detection, which simultaneously integrates Mamba and CNN structures in the encoding stage.

To address the first challenge, the Mamba [15] architecture is adopted as the backbone network of our framework. Through integration of SSMs and an optimized GPU hardware implementation, long-range dependency modeling is achieved with linear computational complexity, effectively balancing operational efficiency and detection performance. For the second and third challenges, a CNN-based auxiliary feature encoding branch is introduced to enhance multi-scale representation learning in marine remote sensing imagery. A lightweight Blurring-MobileNetv3 (B-Mv3) network is proposed to strengthen local feature discriminability, while maintaining translation invariance during sliding window operations over large-scale satellite images. To reconcile the complementary information between Mamba-based global modeling and CNN-based local processing, a novel Feature Alignment Module (FAM) is developed. This module systematically resolves feature discrepancies through dual-path processing of channel-wise and spatial-wise alignment. Channel dimension harmonization is accomplished through global context aggregation and attention-based feature recalibration, while spatial alignment is achieved via multi-scale context aggregation using dilated convolutional operators with varying receptive fields. Subsequently, position-aware attention weighting is applied to amplify region-specific responses corresponding to marine infrastructure targets. Further enhancement is implemented through a dedicated feature refinement stage, where max pooling operations are employed to extract channel-position joint saliency maps, prioritizing discriminative patterns of OWTs, while suppressing background interference. Our proposed framework demonstrated superior detection capability, achieving a mean average precision (AP) of 47.1% on 40,000 × 40,000-pixel marine remote sensing imagery with a practical computational efficiency of 127.7 s per image processing time when implemented in an NVIDIA RTX 3090 GPU workstation configuration.

In summary, the main contributions of our OWTDNet for OWTs surveillance in remote sensing images are as follows:

(1): We propose a novel offshore wind turbine detection network called OWTDNet, which synergistically integrates Mamba and CNNs as feature encoding backbones.
(2): An auxiliary lightweight encoding branch termed B-Mv3 is introduced to enrich the local feature information related to wind turbine targets.
(3): We introduce a feature alignment module to mitigate disparities between Mamba and CNN features, focusing on both channel and position dimensions.
(4): A feature enhancement approach is proposed to amplify the response weights of small-sized OWTs targets against backgrounds and interferences.
(5): Extensive experimental results demonstrate that our OWTDNet achieves an optimal balance between performance and cost, and the proposed algorithm has been successfully applied in a real-world ocean monitoring system.

2. Related Works

2.1. CNN-Based Object Detection

In CNN architectures, detectors can be classified into single-stage and two-stage approaches. Single-stage approaches excel in achieving higher speeds, while two-stage approaches prioritize performance. R-CNN [16], the pioneer of two-stage detectors, introduced selective search [17] to extract region proposals from input images. Subsequently, Fast R-CNN [18] and Faster R-CNN [19] were proposed to enhance selection speed, achieving simultaneous improvements in both speed and performance by replacing selective search with a region proposal network (RPN). More recently, a novel detector backbone named ConvNeXt [20] has garnered significant attention. This method is purely based on a convolutional structure and has achieved remarkable results by optimizing the design space of ResNet [21] and introducing diverse training techniques.

The YOLO series, a notable exemplar of a single-stage detector framework, has achieved remarkable success across diverse fields. Unlike two-stage detectors, single-stage detectors directly predict the object location from the input image without the RPN process, reducing computational overhead. The YOLO approach initially involves partitioning the input image into numerous overlapping anchor boxes with different predefined sizes. It then predicts the probability and location offset within these anchor regions. Redundant detection boxes are subsequently eliminated using the Non-Maximum Suppression (NMS) method. Building upon these design principles, YOLOv1 [22] was introduced, achieving the fastest inference speed of its time. To enhance prediction accuracy, YOLOv2 [23] incorporated batch normalization, a high-resolution classifier, and an anchor box generation method based on k-means clustering. YOLOv3 [24] further built upon this foundation by introducing concepts such as data augmentation, multi-scale training, and independent classification heads. With the integration of attention mechanisms and multi-scale features, the YOLO series has evolved into diverse versions, including YOLOv5 [25], YOLOF [26], YOLOX [27], YOLOv7 [28], and YOLOv8 [29]. In [30], Park introduced YOLOv3 and YOLOv5s for ship detection, proving that YOLOv5s can achieve better performance than YOLOv3.

In contrast to the improvements in the YOLO series, RetinaNet [31] emphasized optimizing the detector’s loss function. Specifically, it introduced a focal loss function designed to differentiate between difficult and easy samples during the training stage. The focal loss enhances the model’s learning capability for challenging samples through a penalty mechanism, ultimately improving the network’s robustness. Concurrently, numerous approaches have been proposed to refine the box regression loss, such as SIoU [32], WIoU [33], GIoU [34], DIoU [35], and CIoU [35]. Additionally, EfficientDet [36] introduced the Bidirectional Feature Pyramid Network (BiFPN) as an alternative to the conventional FPN structure, aiming to efficiently fuse multi-scale features with learned weights.

In practical applications, CNN-based models can achieve faster running speeds and lower deployment costs. However, in the face of complex marine environments, CNN-based models are prone to serious false detection problems, such as objects with highly similar features to OWTs. This is mainly because the CNN structure only focuses on the responses of the local region and ignores the global information of the surrounding background, resulting in the problem of high recall and low precision in CNN-based models.

2.2. Transformer-Based Object Detection

The Transformer method, initially introduced in the realm of natural language processing (NLP), has demonstrated state-of-the-art performance compared to other methods. Like the encoding–decoding structure of CNNs, the Transformer approach can replace the CNN encoding stage with a self-attention mechanism. However, the complexity of attention computation scales quadratically with the image size, posing a limitation in computer vision applications. To mitigate this challenge, researchers partition the input image into non-overlapping patches and independently compute the correlation within each patch image.

The Detection Transformer (DETR) [37] pioneered the application of the Transformer approach in computer vision by modeling features derived from a CNN backbone. However, the original DETR structure faced challenges such as slow convergence and limited feature spatial resolution. To address these issues, deformable DETR [38] was proposed, which focuses on a small set of key sampling points around reference points. This approach significantly improves performance, requiring only 10% of the training epochs compared to DETR. Subsequent works, such as conditional DETR [39] and DAB-DETR [40], were introduced to further address the slow training convergence observed in DETR. More recently, a novel end-to-end object detector named DETR with improved denoising anchor boxes (DINO) [41] was presented. This method enhances the performance and efficiency of previous DETR-like models through a contrastive denoising training approach, a mixed query selection method for anchor initialization, and a look-forward-twice scheme for box prediction. Consequently, DINO has achieved the current best detection performance, demonstrating the effectiveness of the Transformer approach in different object detection tasks.

Transformer-based architectures exhibit superior spatial feature extraction capabilities compared to conventional CNN structures, enabling enhanced detection performance through their ability to capture long-range dependencies. However, this advantage comes at the cost of quadratic computational complexity relative to input image dimensions, resulting in two critical implementation challenges. The inference speed limitations cause significant latency when processing high-resolution remote-sensing imagery. The resource-intensive computation demands prohibitively high hardware requirements. These constraints collectively impede the practical deployment of transformer models in the OWTs detection task.

2.3. Mamba-Based Object Detection

The inherent locality of convolution operations in CNN architectures limits their ability to model long-distance features, necessitating the employment of self-attention methods. However, the immense computational complexity of Transformer methods poses challenges for practical applications. To address this gap, Gu et al. [15] introduced the Mamba method, which leverages SSMs to reduce the computational complexity of capturing long dependencies from quadratic to linear. The Mamba model has emerged as a strong competitor to the Transformer method due to its superior performance and faster speed.

Building on this foundation, Wang et al. [42] proposed an object detection model based on Mamba and YOLO structures, named Mamba-YOLO. Mamba-YOLO introduces an LSBlock and RGBlock, enabling more precise capture of local image dependencies and significantly enhancing the robustness of the model. Zhou et al. [43] nested multiple Mamba structures and achieved state-of-the-art results in image classification tasks. Dong et al. [44] extended the application of Mamba to different types of infrared images by introducing a Fusion Mamba block, which reduces differences between cross-modal features and enhances the representation consistency of the fused features. Meanwhile, Zhu et al. [45] presented the VMamba framework, which outperformed Transformer methods in various tasks such as image classification, object detection, and semantic segmentation.

In the field of remote sensing, Xie et al. [46] introduced RSWDet, a neural network-based detector for wind turbines in remote sensing imagery under complex distribution scenarios. RS-Mamba [47] has demonstrated superior performance compared to both CNNs and Transformers in dense prediction tasks. Additionally, the integration of Mamba with UNet [48] structures, named CM-UNet [49], has shown superiority in remote sensing semantic segmentation tasks.

3. Materials and Methods

The OWTs detection pipeline comprises the following sequential processing stages: (1) Image Input: Acquisition of remote sensing imagery for subsequent analysis. (2) Image Splitting: Partitioning of the large-scale image into multiple independent detection windows. (3) Window-level Detection: Application of an object detection model (OWTDNet) to identify OWTs within each window. (4) Result Integration: Aggregation of individual detection outputs with coordinate regression for bounding box localization. (5) Final Output: Generation of comprehensive OWTs detection results, including precise bounding box coordinates.

The proposed OWTDNet architecture, as illustrated in Figure 2, consists of three sequentially connected components: an image preprocessing module, a core detection network, and a result aggregation module. Due to the extreme spatial dimensions of satellite-acquired marine imagery, direct processing of raw input data through the detection network is rendered computationally infeasible. In the developed offshore wind turbine monitoring framework, a preprocessing module is implemented to systematically tile the input imagery into standardized 1024 × 1024 pixel patches using an optimized grid-based partitioning protocol. These spatially normalized patches are subsequently processed through the detection network in batch mode to ensure hardware resource optimization. Following detection, a post-processing module is employed to aggregate localized predictions through coordinate remapping and non-maximum suppression (NMS) operations, ultimately generating the final detection results under the original input resolution.

3.1. Pre-Processing of High-Resolution Remote Images

In the realm of ocean engineering, processing satellite remote sensing images often poses challenges due to their immense size, often exceeding

40,000 \times 40,000

pixels. To facilitate efficient analysis, it is essential to partition these images into smaller, fixed-sized patches. As depicted in Figure 3, two strategies for image splitting are available: (1) splitting with a gap, and (2) splitting without a gap. Table 1 offers an overview of the advantages and disadvantages of these strategies. For our OWTs detection system, we have chosen Strategy 2 as the pre-processing approach. A more detailed discussion of this choice will be presented in Section 4.4.1.

3.2. Review of Mamba

A comparative analysis of object detection methodologies is presented in Figure 4 using the COCO benchmark dataset, where YOLO-series and CNN-based detectors are observed to demonstrate superior inference speeds, while Transformer-based methods achieve enhanced detection accuracy. The computational overhead arises from fundamental architectural differences: global contextual relationships are captured by Transformers through pairwise pixel interactions, whereas localized receptive field processing is exclusively employed in convolutional neural networks. Despite achieving state-of-the-art detection performance (55.3% AP on COCO val2017), Transformer-based approaches are constrained by quadratic computational complexity and memory-intensive operations, rendering them suboptimal for real-time marine monitoring scenarios requiring processing of 40,000 × 40,000-pixel satellite imagery.

To reconcile this efficiency–performance trade-off while preserving global modeling capabilities, the Vision Mamba architecture (VMamba) [50] has been introduced by researchers, leveraging SSMs with linear time complexity for long-range dependency capture. The divergent architectural approaches of Mamba and Transformer-based global information modeling are visually contrasted in Figure 5 through feature activation maps and computational graph representations.

Mamba employs a distinct approach to Transformers for calculating the correlation between different patch partitions. Transformers compute the correlation between the current patch and all other patches, resulting in a computational complexity of O (

N^{2}

). In contrast, Mamba uses a combination of forward and backward scanning techniques to achieve this process with a complexity of O(N). Specifically, Mamba utilizes the VSS block to achieve this process, and its structure is illustrated in Figure 6.

Assuming an input feature tensor

f_{i n} \in R^{B, H, W, C}

, where B represents the batch size, C is the number of channels, and H and W are the height and width, respectively, VMamba initially employs Layer Normalization (LN) to normalize the information before forwarding it to the 2D Selective Scan module (SS2D). Within the SS2D block, the input feature is initially separated into a gate variable

f_{g a t e} \in R^{B, H, W, C}

and a data variable

f_{d a t a} \in R^{B, H, W, C}

through a linear operation. This separation is executed as follows:

f_{n o r m} = L N (f_{i n})

(1)

f_{d a t a} = L i n e a r (f_{n o r m})

(2)

f_{g a t e} = L i n e a r (f_{n o r m})

(3)

Subsequently, a depth-wise convolution with SiLU activation is applied on

f_{d a t a}

, before conducting forward and backward scanning. Specifically,

f_{d w} = S i L U (D W (f_{d a t a}))

(4)

where DW represents the depth-wise convolution operation.

f_{d w}

is then fed into the SS2D block for correlation calculation and feature fusion

y_{f u s i o n}

between different patches, considering both forward and backward dimensions. The SS2D block can obtain the corresponding outputs

y_{f o r w a r d} \in R^{B, H, W, C}

and

y_{b a c k w a r d} \in R^{B, H, W, C}

, respectively.

y_{f o r w a r d} = {S S 2 D}_{f o r w a r d} (f_{d w})

(5)

y_{b a c k w a r d} = {S S 2 D}_{b a c k w a r d} (f_{d w})

(6)

y_{f u s i o n} = y_{f o r w a r d} + y_{b a c k w a r d}

(7)

Afterwards, the gating variables

f_{g a t e}

are multiplied with

y_{f u s i o n}

to modulate the impact of the input variables on the current state

y_{g a t e}

. The resulting state is then fused with the input information

f_{i n}

through a linear layer, yielding the output of the SS2D module,

y_{o u t}

:

y_{g a t e} = y_{f u s i o n} \times f_{g a t e}

(8)

y_{o u t} = L i n e a r (L N (y_{g a t e})) + f_{i n}

(9)

Through the SS2D module, the correlation between different patches can be effectively modeled. Finally, the VSS block utilizes a Feed Forward Network (FFN) to perform non-linear mapping and feature extraction. As shown in Figure 4, VMamba achieves synchronous improvements in performance and speed.

3.3. CNN Encoding Network

By splitting the image into fixed-sized patches, Mamba-based methods achieve global feature modeling on the image dimension, enhancing the model’s learning capabilities. However, these approaches exhibit two limitations in OWTs detection tasks. Firstly, the patch division approach can cause targets to span multiple patches, resulting in the dissociation of object features and local information. Additionally, since the current detected image is already a segment of the original remote sensing image, further patch division may increase the likelihood of feature fragmentation, leading to missed detections. As exemplified in Figure 7, a comparison between the CNN method and the Mamba approach can illustrate these issues. While Transformer or Mamba detectors can partially address this problem through patch merging operations, the lack of local feature modeling ability remains a hindrance to their application in remote sensing detection tasks.

To enable comprehensive feature representation in marine infrastructure monitoring systems, a parallel CNN encoding branch is integrated with the Mamba-based processing pathway. While conventional convolutional neural networks employ localized filter operations for feature extraction, spatial resolution reduction is typically achieved through stride-based downsampling or max pooling layers. This architectural characteristic has been observed to degrade translational invariance in sliding window detection frameworks, particularly when analyzing satellite imagery with multi-scale OWTs. The inherent inductive bias of convolutional operations, though effective for local pattern recognition, introduces sensitivity to minor spatial displacements in marine infrastructure targets. Such translational variance becomes especially problematic when processing tiled image patches through conventional CNN architectures, where positional discrepancies as small as 5–10 pixels between adjacent patches have been shown to cause huge reductions in detection performance in controlled experiments [51].

In the context of OWTs training and prediction stages, the sliding window splitting method causes significant left-right shifts for the same target, exacerbating the inherent uncertainty of the original CNN structure and leading to fluctuations in detection scores during the prediction phase. Although data augmentation can mitigate this issue, it does not address the inherent limitations of the CNN structure. In this paper, we propose a new lightweight encoding network, termed B-Mv3, to enhance the robustness of the CNN structure for translational transformations with minimal computation overhead. Its structure is presented in Figure 8.

B-Mv3 is composed of convolutional blocks, blurring convolution modules, and Squeeze-and-Excitation (SE) [52] modules. Given an input feature map denoted as

f \in R^{B, C, H, W}

, B-Mv3 initially employs a convolutional operation with a kernel size of one and a stride size of one to perform channel information fusion. Subsequently, B-Mv3 incorporates two selective branches, contingent on whether downsampling is necessary during this stage. If the stride is two, the network utilizes blur convolution modules to facilitate shift-equivariance and invariance learning. This process can be summarized as follows:

f_{c o n v} = R e L U (B N (C o n v 2 d (f))) \in R^{B, C, H, W}

(10)

f_{c o n v 2} = \{\begin{matrix} R e L U (B N (C o n v 2 d (f_{c o n v}))), i f s t r i d e = 1 \\ B l u r i n g C o n v (f_{c o n v}), i f s t r i d e = 2 \end{matrix}

(11)

R e L U (x) = m a x (0, x)

(12)

In the blurring convolution block, we start by using ReflectionPad2d and MaxPooling with a stride of one to determine the maximum response of the input feature, ensuring that the resulting feature map maintains the same dimensions as the input feature. Next, we compute the input weights for the subsequent convolution operation using Formula (13). Finally, we apply group convolution with a stride of two to calculate the responses across different channel groups. These processes can be summarized as follows:

f_{m a x} = M a x P o o l i n g (R e f l e c t i o n P a d 2 d (f_{c o n v}))

(13)

f_{k e r n e l} = \frac{f_{i, j}}{\sum_{i = 0}^{H} \sum_{j = 0}^{W} f_{i, j}}, i \leq H, j \leq W

(14)

f_{c o n v 2} = C o n d 2 d (f_{m a x}, w e i g h t = f_{k e r n e l}, s t r i d e = 2, g r o u p s = C)

(15)

The downsampled feature map generated from the blur pooling block can achieve shift invariance by employing dynamic kernel convolution with cross steps. Furthermore, we integrate a lightweight channel attention mechanism using the SE module to enhance the model’s feature representation and anti-interference capabilities. The SE module enables the CNN network to prioritize salient channels, while suppressing irrelevant information.

3.4. Feature Alignment Module

3.4.1. Channel Alignment Process

In our OWTs detection system, global contextual relationships are modeled by the Mamba processing branch, while localized structural patterns are extracted through the parallel CNN pathway. To address the inherent divergence between these different feature representations, a dual-stage feature alignment module is proposed for cross-model integration, and its structure is presented in Figure 9.

The channel alignment module consists of two key submodules: channel alignment and channel enhancement. The channel alignment submodule is designed to merge information from Mamba and CNN features, while the channel enhancement submodule amplifies the responses of weaker features across all channels, thereby enhancing the model’s capturing ability.

Given input features

f_{m a m b a} \in R^{B, C, H, W}

and

f_{c n n} \in R^{B, C, H, W}

, the feature alignment process proceeds as follows: Initially, AdaptiveAvgPooling is applied to both

f_{m a m b a}

and

f_{c n n}

, transforming them into a

1 \times 1 \times C

format to remove positional influence on the channel alignment process. Subsequently, point-wise convolution independently integrates channel information, and the results are combined through an addition operation to produce

f_{f u s e}^{c h}

. To dynamically adjust the contributions of the Mamba and CNN branches, a sigmoid function is introduced to compute response weight for

f_{f u s e}^{c h}

. Finally, these weights are multiplied with the Mamba input feature, yielding the channel alignment output

f_{a l i g n}^{c h}

.

As illustrated in Figure 10, water surface turbulence features in marine environments were misclassified as wind turbine components by the Mamba processing branch due to excessive reliance on global contextual dependencies. This phenomenon is attributed to the Mamba architecture’s inherent bias toward modeling long-range spatial correlations at the expense of local discriminative features, particularly when wind energy infrastructure appears in adjacent maritime regions. The misidentification challenge was effectively mitigated through synergistic integration with CNN-derived local pattern representations, which leverage convolutional operations to capture local texture distinctions between transient water splashes and OWTs.

To address the low signal-to-noise characteristics of OWTs in marine environments, a channel enhancement submodule was developed for amplifying weakly responsive features in remote sensing data. The module architecture begins with adaptive max pooling operations applied along the channel dimension to extract peak activation patterns corresponding to OWTs. These extremal features are subsequently normalized through sigmoidal attention weighting, where channel-wise importance coefficients are computed via a parametric softmax function to suppress noise from interferences. The derived attention map is then element-wise multiplied with the channel-aligned feature tensor

f_{a l i g n}^{c h}

to produce recalibrated representations.

3.4.2. Positional Alignment Process

Due to the limited receptive field of convolutional operations, CNNs struggle to model long-distance correlations. Consequently, directly fusing CNN features with Mamba features from the perspective of the positional dimension can create competition between these two branches, resulting in degraded performance. To address this issue, we introduce dilated convolutions to enlarge the receptive field of the CNN branch before fusion, enabling it to capture more long-distance features and reducing the discrepancy with Mamba features from the positional dimension.

Initially, we apply multi-scale modeling to the input CNN features using convolutions with dilations of 1, 3, and 5, generating

f_{c n n}^{1}

,

f_{c n n}^{3}

, and

f_{c n n}^{5}

. These features are then aggregated through point-wise addition. Subsequently, we perform channel compression on both the Mamba and CNN features using

1 \times 1

convolutional kernels, obtaining feature maps of size

H \times W \times 1

, denoted as

f_{m a m b a}^{p o s}

and

f_{c n n}^{p o s}

. By spatially adding these features together, we obtain the position alignment feature map

f_{f u s e}^{p o s}

. To determine the importance of each position, we employ a sigmoid function to compute weights at each pixel position, which are then multiplied with the Mamba features to yield the final position alignment output

f_{a l i g n}^{p o s}

. Like the channel alignment module, we incorporate a position enhancement submodule to emphasize the positional relevance of OWTs in remote sensing images. Finally, we fuse the outputs of the channel and position alignment modules using a

1 \times 1

convolution, generating an enhanced feature representation that incorporates both channel and positional cues.

As demonstrated in Figure 11, convolutional neural networks are observed to exhibit localized attention biases, misclassifying terrestrial linear features such as pale infrastructure corridors as potential OWTs due to their textural similarity to turbine arrays. Conversely, the Mamba architecture is engineered to prioritize global geospatial relationships, enabling robust differentiation between oceanic regions and coastal zones through large-scale pattern analysis of synthetic aperture radar (SAR) imagery. This global contextual awareness is leveraged to enhance maritime-specific target discrimination by hierarchically integrating Mamba-derived positional priors with CNN-extracted local texture descriptors.

3.5. Loss Functions

A multi-task loss

{l o s s}_{t o t a l}

is introduced into our OWTDNet, formed as follows:

{l o s s}_{t o t a l} = w_{m} {l o s s}_{m a m b a} + w_{c} {l o s s}_{C N N}

(16)

Here,

{l o s s}_{m a m b a}

represents the loss values in the Mamba branch, and the CNN branch’s loss is denoted as

{l o s s}_{C N N}

.

w_{m}

and

w_{c}

are hyper-parameters used to balance the different loss weights, set to 1.0 and 0.2, respectively. The loss function for the Mamba and CNN branches comprises a box loss and a classification loss, defined by the following formula:

{l o s s}_{m a m b a / C N N} = {l o s s}_{b o x} + {l o s s}_{c l s}

(17)

{l o s s}_{c l s} = \{\begin{matrix} - α {(1 - y^{p r e d})}^{β} \log (y^{p r e d}), y^{G T} = 1 \\ - (1 - α) {y^{p r e d}}^{β} \log (1 - y^{p r e d}), y^{G T} = 0 \end{matrix}

(18)

I o U (A, B) = \frac{| A \cap B |}{| A \cup B |}

(19)

{l o s s}_{b o x} = 1 - I o U (y^{p r e d}, y^{G T}) + \frac{ρ^{2} (y_{c e n t e r}^{p r e d}, y_{c e n t e r}^{G T})}{c^{2}} + γ v

(20)

Here,

{l o s s}_{b o x}

refers to the box regression loss, and

{l o s s}_{c l s}

is the box classification loss. In this paper, we employ the CIoU loss for box regression and the focal loss for box classification. A detailed discussion regarding the loss functions is provided in Section 4.4.5.

α

and

β

indicate the penalty factors of focal loss, with values of 0.25 and 5, respectively. c represents the diagonal distance between the smallest box that encloses the predicted box and the ground truth box.

y^{p r e d}

and

y^{G T}

represent the network predictions and the ground truths, respectively.

y_{c e n t e r}^{p r e d}

and

y_{c e n t e r}^{G T}

denote the center points of the predictions and the ground truths, respectively. The calculation formula for CIoU loss are as follows:

ρ^{2} (y_{c e n t e r}^{p r e d} = (x 1, y 1), y_{c e n t e r}^{G T} = (x 2, y 2)) = \sqrt{{(x 2 - x 1)}^{2} + {(y 2 - y 1)}^{2}}

(21)

γ = \frac{v}{(1 - I o U (y^{p r e d}, y^{G T})) + v}

(22)

v = \frac{4}{π^{2}} (\arctan (\frac{w_{G T}}{h_{G T}}) - a r c t a n (\frac{w_{p r e d}}{h_{p r e d}}))

(23)

Here,

w_{G T}

and

h_{G T}

represent the width and height of the ground truth box, while

w_{p r e d}

and

h_{p r e d}

indicate the width and height of the prediction box. Additionally, the detection head of the CNN branch is only active during the training phase and is used to guide the gradient descent of the CNN network.

4. Results

4.1. Satellite Datasets

For our OWTs monitoring system, we collected 276 satellite remote sensing images of ocean engineering construction near Shanghai, Zhejiang, and Fujian provinces in China from November 2022 to March 2024. These images were obtained using satellites named GaoFen-1 (GF1), GaoFen-2 (GF2), GaoFen-7 (GF7), ZiYuan-1 (ZY1), and ZiYuan-3 (ZY3).

To ensure the robustness and generalization of our analysis, we meticulously partitioned the dataset into the following subsets: A total of 180 images, corresponding to 294,300 patch images, were utilized as the training set. The remaining 96 remote sensing images, comprising 97,776 patches, were reserved for validation and testing purposes. The collection parameters and the corresponding latitude and longitude ranges of these satellites are shown in Table 2.

4.2. Experimental Setups

The hardware and software configurations used in this study are detailed in Table 3. To validate the efficacy of our method, we incorporated various object detection approaches for comparative analysis. The training configurations specific to these methods are presented in Table 4. All models used consistent training and test data to ensure the fairness of the comparison. In addition, the optimizer used in all models was SGD, and the weight decay value was 0.0005.

4.3. Main Results

In this section, we will discuss the performance comparison across the different detection models, focusing on the following metrics: AP, speed, FLOPs, and GPU memory usage. For the AP metric, we employed three specific indicators:

{A P}_{50 - 95}

,

{A P}_{50}

, and

{A P}_{75}

.

Specifically,

{A P}_{50}

and

{A P}_{75}

represent the precision at IoU thresholds exceeding 0.5 and 0.75, respectively.

{A P}_{50} = \frac{T P}{T P + F P}, \{\begin{matrix} T P, I o U ({P r e d}_{b o x}, {G T}_{b o x}) \geq 0.5 \\ F P, I o U ({P r e d}_{b o x}, {G T}_{b o x}) < 0.5 \end{matrix}

(24)

{A P}_{75} = \frac{T P}{T P + F P}, \{\begin{matrix} T P, I o U ({P r e d}_{b o x}, {G T}_{b o x}) \geq 0.75 \\ F P, I o U ({P r e d}_{b o x}, {G T}_{b o x}) < 0.75 \end{matrix}

(25)

{A P}_{50 - 95} = \frac{{A P}_{50} + {A P}_{55} + \dots + {A P}_{95}}{10}

(26)

where TP indicates true positives, and FP indicates false positives. Similarly,

{A P}_{50 - 95}

calculates the average precision from 0.5 to 0.95 with an increment of 0.05 in IoU threshold. For the speed evaluation, we measured the inference time for both the individual patch image (

S_{p a t c h}

) and the entire remote sensing image (

S_{r s}

):

S_{p a t c h} = \frac{\sum_{i = 0}^{K_{p a t c h}} S_{p a t c h}^{i}}{K_{p a t c h}}

(27)

S_{r s} = \frac{\sum_{j = 0}^{K_{r s}} S_{p a t c h}^{j}}{K_{r s}}

(28)

where

K_{p a t c h}

is the number of patch images that need to be calculated (set to 100 in this study), and

S_{p a t c h}^{i}

represents the prediction time for the i-th patch image.

K_{r s}

is the number of remote sensing images counted (set to 10 in this study), and

S_{p a t c h}^{j}

represents the cumulative prediction time for all patch images within the j-th remote sensing data.

By introducing these metrics, we aimed to provide a comprehensive understanding of the advantages and disadvantages of the different detection models in the OWT detection task. Meanwhile, visualized comparison results are shown in Figure 12. Additionally, Figure 13 shows PR curves of our model at small, medium, and large scales.

Based on the comparison results presented in Table 5, DINO with the Swin-L backbone emerged as the top performer in terms of AP indicators, achieving 47.8%, closely followed by our proposed method at 47.1%. Regarding speed metrics, the YOLO series demonstrated the fastest performance, particularly YOLOv8 with the small backbone, which achieved an inference speed of 8.7 ms and 115 frames per second (FPS) at a resolution of 1024 × 1024. From the perspective of GPU memory usage, methods employing the Mamba approach exhibited advantages over those based on the Transformer architecture, with VMamba-Large only requiring 2.8 GB of GPU memory in the prediction phase.

Among CNN-based methods, the combination of the Cascade structure with the Swin backbone achieved the optimal performance, albeit with a decrease in speed metrics. The ConvNext model, a pure CNN method introduced in 2022, did not attain the highest AP but offered the best balance between performance and speed among all CNN-based detectors, achieving 33.6 FPS and 37.7% AP. Through the integration of various training techniques with attention mechanisms, YOLOv8 significantly improved its ability in OWTs detection. For instance, YOLOv8-xLarge achieved an AP of 39.5% and an FPS of 24. However, the YOLO models still struggled to achieve satisfactory detection performance for remote sensing images, limiting their widespread applicability in practical systems.

Exemplified by DINO and Cascade under the R50 and Swin-L configurations, the integration of Transformer methods in both encoding and decoding stages significantly improved the AP metric for DINO, achieving a 14.6% increase. However, it is worth noting that incorporating a Transformer, such as DINO with the Swin-L backbone, significantly reduced the prediction speed. Meanwhile, the Mamba method achieved long-dependency feature capture, while maintaining lower computational costs. Compared to DINO with Swin-L, VMamba-Large offered a fourfold increase in speed. Furthermore, our proposed OWTDNet lagged behind the DINO network by only 0.7% in AP, while being three times faster, achieving 15 FPS.

In comparison to the original VMamba method, OWTDNet demonstrated improvements across all precision metrics with small, medium, and large backbones, without incurring substantial speed costs. Specifically, using the Large backbone as an example, OWTDNet achieved a 2.6% improvement in AP metrics compared to the original VMamba, while experiencing only an 8.5 ms reduction in speed. This underscores the efficacy of our lightweight encoding and alignment structure, which facilitated local feature modeling and feature fusion with minimal computational overhead. A more intuitive comparison of results across the different methods is shown in Figure 14.

In Figure 12, red boxes represent the manually labeled OWTs in the original image. In scenario 3, both ConvNeXt and YOLOv8 exhibited notable false positives, incorrectly identifying the coastline as a OWTs. This observation underscores the limitations of CNN architectures in capturing global feature relationships in remote sensing images. In contrast, DETR and VMamba enhanced feature learning robustness by effectively extracting long-dependency relations, thereby mitigating interference caused by similarities on land. However, scenarios 4 and 5 presented challenges for the Transformer and VMamba methods due to the sparse nature of the OWTs features, leading to missed detections. In contrast, the CNN-based methods demonstrated strong performance in these scenarios. This demonstrates that while Transformer methods excel in considering the impact of long-dependency features on detection performance, they may overlook the contribution of local features in detecting weak and small targets. Therefore, effectively integrating these two feature modeling approaches is crucial for enhancing the detection performance in remote sensing monitoring tasks.

4.4. Ablation Study

4.4.1. Discussion of Splitting Strategy

In Section 3, we discussed the theoretical merits and drawbacks of two image splitting strategies. Here, our objective is to analyze the practical implications of these strategies on detection performance, drawing insights and conclusions from the experimental results presented in Table 6.

In Table 6, several discoveries and conclusions can be drawn. Firstly, the results do not fully align with the theoretical analysis presented in Section 3.1, as most models exhibited a decline in the AP indicator when applying Strategy 1. Notably, the YOLOv8 model showed the most significant drop, decreasing by 1.3%, from 39.5% to 38.2%. The visualization results presented in Figure 15 reveal that the NMS method struggled with the box merging process for small targets, demonstrating poor robustness. Specifically, two prediction boxes (refer to A and B) differing by only one pixel had an IoU value of only 0.33. Since this IoU value did not exceed the NMS threshold of 0.6, the two target boxes could not be merged and eliminated, leading to redundant detections in results. The experimental results demonstrated an improvement in AR metrics across all evaluated models when employing Strategy 1. This systematic enhancement confirms that the overlapping sampling methodology effectively boosts model recall rates through redundant boundary region detection. However, this performance gain comes at the expense of increased false positive instances and subsequent post-processing requirements.

4.4.2. Discussion of Feature Alignment

Multi-branch architectures have been widely used in various fields, such as CNNs and Transformers or Mamba and CNNs. These techniques exhibit higher performance than a single encoding structure. However, a common approach adopted by these methods involves directly fusing features, which overlooks the huge differences between CNN, Transformer, and Mamba features. In this section, we delve into the significance of feature alignment modules in bridging the gap between these different methodologies. To this end, we have designed several distinct encoding structures, as depicted in Figure 16, that aim to illustrate this issue.

In Figure 16, the design strategies for different encoding structures can be categorized into two scenarios: the integration of CNNs and Transformer, and the integration of CNNs and VMamba. Focusing on the VMamba integration, we devised four architectures. Specifically, strategies 2 and 4 abstain from any form of feature fusion during the encoding phase, only integrating CNN and Mamba features in the final encoding layer through a concatenate operation or our alignment module. Conversely, strategies 6 and 8 employ concatenation or alignment for feature fusion between different branches at each encoding stage. The comparative results derived from these strategies are presented in Table 7.

Firstly, feature fusion during the encoding stage in a multi-branch structure can significantly enhance OWTs detection performance. Taking VMamba as an example, strategy 8 demonstrated a 1.9% improvement in AP metrics compared to strategy 4. Additionally, the Align module proposed in this paper outperformed the direct concatenation of different types of features, obtaining higher AP metrics. By comparing Strategy 1 and Strategy 3, as well as Strategy 5 and Strategy 7, within the same model structure, the network utilizing the Align structure achieved AP improvements of 1.1% and 1.3%, respectively. This indicates that the feature disparities existing in different methods damage the model’s ability under a direct fusion approach.

The feature alignment module introduced in this paper addresses these differences from both channel and position dimensions, thereby enhancing the effectiveness of the feature fusion process. Furthermore, when comparing the performance of VMamba and Transformer models in the OWTs detection task, it is evident that the proposed Align module can work effectively in both approaches. Specifically, comparing strategies 6 and 8, the algorithm achieved a 1.4% improvement in the AP metric.

Among various CNN-based methods, the B-Mv3 proposed in this paper exhibited performance comparable to R50. In strategy 8, the R50 method only marginally improved the AP metric by 0.4% compared to B-Mv3, while sacrificing a speed of 11.6 ms per patch image. Figure 17 clearly demonstrates that regardless of using concatenation or alignment methods, information fusion during the encoding stage effectively enhanced the confidence score of OWTs in the detection results.

Finally, from the perspective of efficiency, the VMamba-based upper branch demonstrated superior cost–benefit performance. Comparative analysis between Strategy 7 and 8 reveals that, when integrated with the R50 backbone architecture, VMamba not only enhanced AP metrics but also reduced the patch processing time from 114.9 ms to 77.9 ms, representing a 32.2% acceleration. Meanwhile, the proposed B-MV3 architecture further improved the processing efficiency by reducing the patch time by 11.6 ms compared to the R50 structure. For large-scale remote sensing images (40,000 × 40,000 pixels), processing 1520 patch images monthly, the detection time decreased from 118.4 s (R50) to 100 s (B-MV3), achieving a 15.5% speed improvement. Notably, the Swin-R50 combination under Strategy 7 required 174.6 s for equivalent processing. These results demonstrate that our proposed framework achieves dual optimization of detection performance and computational efficiency, particularly benefiting resource-intensive remote sensing applications.

4.4.3. Discussion of Blurring Convolution

In the preceding section, we examined the influence of different CNN methods on the model’s performance. In this section, we will discuss the role of Blurring Convolution in various CNN architectures and its effect on the OWTs detection task. To achieve this, we conducted the following comparative experiments: (1) integrating the Blurring Convolution module into the R50 architecture; (2) removing the Blurring Convolution module from the MobileNetv3 (Mv3) structure; (3) applying these two modified structures to different object detection frameworks; (4) omitting the data augmentation technique of random shift during the training phase; and (5) randomly introducing up, down, left, and right shifts to the testing image during the prediction stage. The comparative results of Experiments 1 to 4 are presented in Table 8, while the results of Experiment 5 are visualized in Figure 18.

As depicted in Table 8, the incorporation of Blurring Convolution significantly enhanced the performance of OWTs detection in both Mv3 and R50 architectures. Specifically, under the Cascade R-CNN framework, this approach resulted in an average improvement of 3.3% in the AP metric. Compared to traditional data augmentation techniques, the performance gains achieved by Blurring Convolution were more substantial and consistent. For instance, in the context of Cascade R-CNN utilizing the Mv3 backbone, Blurring Convolution improved the AP metric by 4.3% compared to solely utilizing image shift augmentation.

However, it is noteworthy that this approach did not yield performance gains within the DINO framework. Specifically, the presence or absence of Blurring Convolution only marginally influenced the AP indicator, with a mere 0.1% difference observed in DINO with R50. This suggests that the transformer encoding method within the DINO framework can achieve a similar effect to Blurring Convolution, imparting translational robustness to CNN features. Additionally, when comparing Mv3 and R50, while R50 with Blurring Convolution achieved higher AP metrics, it did so at the cost of compromised speed. In summary, this article proposes a novel encoding branch that combines Blurring Convolution with the Mv3 structure to enhance the stability of OWTs detection under sliding windows, addressing both performance and cost perspectives.

4.4.4. Discussion of Feature Enhance in Alignment Module

In the alignment module, we have designed two distinct processes: alignment and enhancement. The alignment process aims to facilitate the interaction of feature fusion between different branches. Meanwhile, the enhancement process focuses on amplifying the feature responses related to OWTs. In this section, we will discuss the effect of feature enhancement in the context of the detection task.

From the results presented in Table 9, we can see that both channel enhancement and position enhancement are crucial in the OWTs detection task. Eliminating either of these two processes results in a significant reduction in performance. Specifically, channel enhancement had a more profound impact on the AP indicators compared to position enhancement. This can be attributed to the fact that channel enhancement primarily aims to boost the response of OWTs in the channel dimension, which has a more pronounced effect on targets with weaker features. In contrast, position enhancement focuses on the spatial information of OWTs, crucial for improving resilience against interference from similar objects on non-sea surfaces.

In Figure 19, we have enlarged the area of interest using a dashed box to facilitate reading and the red dashed box represents the OWTs. In scenarios where OWTs features are weak, methods that incorporate channel enhancement demonstrated an improved detection and achieved higher scores. On the other hand, in scenarios involving interference from similar targets, methods utilizing position information effectively distinguished OWTs from white contour objects on land, thereby reducing false detections.

4.4.5. Discussion of Loss Functions

In the domain of object detection tasks, numerous variations of loss functions have been proposed, each designed to address specific challenges. Drawing upon insights from the introductory chapter, this section aims to conduct a comparative analysis of these various loss functions within the context of the OWTs detection task. This analysis seeks to identify the most suitable loss function tailored to the marine monitoring application domain.

As shown in Table 10, the focal loss approach outperformed cross-entropy across all methods from the perspective of box classification loss. In the Cascade network, focal loss enhanced the AP metric by an average of 2.32%. This enhancement was achieved through a penalty factor that intensified the network’s focus on challenging samples, thereby improving performance on difficult instances during training, without disproportionately favoring easier examples.

As shown in Figure 20, the remote sensing images presented varying levels of difficulty in detecting OWTs. In scenarios where the OWTs exhibited clear features and minimal interference, detection is straightforward. However, in complex scenarios influenced by similar objects, OWTs may exhibit weaker characteristics, leading to misclassification or confusion with other categories. The original cross-entropy classification method treats all samples equally during training, which can result in unsatisfactory prediction outcomes, despite decreasing classification loss. For example, as depicted in Figure 20, the Cascade method misidentified an interfering target as a boat and incorrectly classified a OWTs target as a boat as well. In contrast, integrating focal loss significantly enhanced the model’s robustness in handling complex scenarios. This comparison underscores the critical role of focal loss in improving classification accuracy in remote sensing detection tasks.

In contrast to classification loss functions, box regression loss functions exhibit notable variations, particularly between IoU-based methods and coordinate-based methods like smooth L1. Initially popular in box regression tasks, smooth L1’s poor robustness with bounding boxes has led to its gradual replacement by IoU-based methods. As detailed in Table 10, IoU-based methods consistently outperform smooth L1. Among IoU-based methods, CIoU stands out as the most effective, while the original IoU method demonstrates lower performance. In the context of OWTs detection tasks, focal loss significantly enhances classification accuracy, while the differences among IoU methods in the box regression task are relatively minor. Figure 21 provides a visual comparison of various IoU methods, highlighting their respective strengths. Based on the experimental results and discussions, focal loss was selected for classification tasks, while CIoU was chosen as the regression loss function for our detection model.

5. Conclusions

In this paper, we addressed the challenges encountered in detecting OWTs in remote sensing images by proposing a novel OWTs detection framework called OWTDNet. OWTDNet integrates Mamba’s ability to capture long-range relations with a CNN’s proficiency in learning local features, enhancing both the detection performance and speed.

Firstly, to mitigate the sensitivity of CNN structures to translation variations under sliding windows, we introduced a novel encoding network named B-Mv3. B-Mv3 replaces the standard convolution operation with blurring convolution, effectively enhancing the robustness of CNN features to translation changes, and improving the stability of predictions. Given the significant differences in modeling between Mamba and CNN features, we proposed a novel feature alignment module. This module addresses these disparities from both channel and positional dimensions. In the channel dimension, discrepancies are mitigated through average pooling and a channel attention mechanism. For positional alignment, we initially enhanced the receptive field of CNN features using multi-scale dilated convolution. Subsequently, a positional attention mechanism is introduced to fuse responses at the same position. Additionally, to tackle the challenge of missed detections caused by the weak features of OWTs, we incorporate a feature enhancement process grounded in the alignment module. Specifically, we employ max pooling to determine the maximum response across both channel and positional dimensions. We then use sigmoid activation to amplify the weight of this response, thereby enhancing the network’s sensitivity to the targets of interest. These combined advancements yielded compelling results, showcasing that our proposed OWTDNet achieved a 2.6% increase in AP compared to the original VMamba approach. Furthermore, OWTDNet demonstrated a threefold improvement in running speed over Transformer models (DINO). These performance enhancements validated the successful application of OWTDNet in a real-world offshore engineering monitoring system.

Certainly, our OWTDNet still exhibits limitations in its current iteration. Primarily, training OWTDNet requires a substantial number of labeled samples, which presents challenges in the sample collection and labeling stage. Additionally, while OWTDNet surpasses Transformers in terms of speed, it obviously falls behind CNN-based detection methods. Recently, researchers [54,55] have proposed using distillation techniques to transfer the capabilities of Transformers to CNN architectures. These approaches have shown significant improvements in both performance and speed, reducing the model’s dependence on extensive training samples. In future research, we intend to explore the application of distillation technology to facilitate knowledge transfer between the Mamba method and CNN structures.

Author Contributions

Methodology, P.S. and Y.Z.; software, Y.Z. and L.Z.; validation, Z.X. and J.Y.; formal analysis, P.S. and L.L.; data curation, S.L.; writing—original draft preparation, L.Z.; writing—review and editing, P.S.; project administration, P.S. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by the Longyuan Electric Power Technology Innovation Project of China under Grant LYX-2025-07.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the editors and the reviewers for their valuable suggestions.

Conflicts of Interest

Authors Pengcheng Sha, Sujie Lu, Zongjie Xu, Jianhai Yu were employed by the company Longyuan Qidong Wind Power Generation Co., Ltd. Author Lei Li was employed by the company Shanghai Enshu Data Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zhang, J.; Jia, X.; Hu, J.; Tan, K. Moving vehicle detection for remote sensing video surveillance with nonstationary satellite platform. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5185–5198. [Google Scholar] [CrossRef]
Melillos, G.; Themistocleous, K.; Danezis, C.; Michaelides, S.; Hadjimitsis, D.G.; Jacobsen, S.; Tings, B. The use of remote sensing for maritime surveillance for security and safety in Cyprus. In Detection and Sensing of Mines, Explosive Objects, and Obscured Targets XXV; SPIE: Washington, DC, USA, 2020; Volume 11418, pp. 141–152. [Google Scholar]
Zhou, H.; Yuan, X.; Zhou, H.; Shen, H.; Ma, L.; Sun, L.; Sun, H. Surveillance of pine wilt disease by high resolution satellite. J. For. Res. 2022, 33, 1401–1408. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Wang, Y.; Bashir, S.M.A.; Khan, M.; Ullah, Q.; Wang, R.; Song, Y.; Niu, Y. Remote sensing image super-resolution and object detection: Benchmark and state of the art. Expert Syst. Appl. 2022, 197, 116793. [Google Scholar] [CrossRef]
Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9657–9666. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 10–17 October 2021; pp. 3520–3529. [Google Scholar]
Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented reppoints for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 1829–1838. [Google Scholar]
Li, X.; Deng, J.; Fang, Y. Few-shot object detection on remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5601614. [Google Scholar] [CrossRef]
Lu, X.; Ji, J.; Xing, Z.; Miao, Q. Attention and feature fusion SSD for remote sensing object detection. IEEE Trans. Instrum. Meas. 2021, 70, 1010309. [Google Scholar] [CrossRef]
Shivappriya, S.N.; Priyadarsini, M.J.P.; Stateczny, A.; Puttamadappa, C.; Parameshachari, B.D. Cascade object detection and remote sensing object detection method based on trainable activation function. Remote Sens. 2021, 13, 200. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Kaur, R.; Singh, S. A comprehensive review of object detection with deep learning. Digit. Signal Process. 2023, 132, 103812. [Google Scholar] [CrossRef]
Amjoud, A.B.; Amrouch, M. Object detection using deep learning, CNNs and vision Transformers: A review. IEEE Access 2023, 11, 35479–35516. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 580–587. [Google Scholar]
Uijlings, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision, ICCV, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 11976–11986. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Xie, T.; Fang, J.; Michael, K.; Lorna; Abhiram, V.; et al. ultralytics/yolov5: v6.1—TensorRT, TensorFlow Edge TPU and OpenVINO Export and Inference; Zenodo: Geneva, Switzerland, 2022. [Google Scholar] [CrossRef]
Chen, Q.; Wang, Y.; Yang, T.; Zhang, X.; Cheng, J.; Sun, J. You only look one-level feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 13039–13048. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Wu, T.; Dong, Y. YOLO-SE: Improved YOLOv8 for remote sensing object detection and recognition. Appl. Sci. 2023, 13, 12977. [Google Scholar] [CrossRef]
Park, M.H.; Choi, J.H.; Lee, W.J. Object detection for various types of vessels using the YOLO algorithm. J. Adv. Mar. Eng. Technol. 2024, 48, 81–88. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar] [CrossRef]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End to-end object detection with transformers. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Wang, J. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 10–17 October 2021; pp. 3651–3660. [Google Scholar]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhang, L. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv 2022, arXiv:2201.12329. [Google Scholar] [CrossRef]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Wang, Z.; Li, C.; Xu, H.; Zhu, X. Mamba YOLO: SSMs-Based YOLO For Object Detection. arXiv 2024, arXiv:2406.05835. [Google Scholar] [CrossRef]
Zhou, W.; Kamata, S.I.; Wang, H.; Wong, M.S. Mamba-in-Mamba: Centralized Mamba-Cross-Scan in Tokenized Mamba Model for Hyperspectral Image Classification. arXiv 2024, arXiv:2405.12003. [Google Scholar] [CrossRef]
Dong, W.; Zhu, H.; Lin, S.; Luo, X.; Shen, Y.; Liu, X.; Zhang, B. Fusion-mamba for cross-modality object detection. arXiv 2024, arXiv:2404.09146. [Google Scholar] [CrossRef]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Xie, J.; Tian, T.; Hu, R.; Yang, X.; Xu, Y.; Zan, L. A Novel Detector for Wind Turbines in Wide-Ranging, Multi-Scene Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 17725–17738. [Google Scholar] [CrossRef]
Zhao, S.; Chen, H.; Zhang, X.; Xiao, P.; Bai, L.; Ouyang, W. Rs-mamba for large remote sensing image dense prediction. arXiv 2024, arXiv:2404.02668. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Volume 18, pp. 234–241. [Google Scholar]
Liu, M.; Dan, J.; Lu, Z.; Yu, Y.; Li, Y.; Li, X. CM-UNet: Hybrid CNN-Mamba UNet for Remote Sensing Image Semantic Segmentation. arXiv 2024, arXiv:2405.10530. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Zhang, R. Making convolutional networks shift-invariant again. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 7324–7334. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Adam, H. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Zheng, K.; Chen, Y.; Wang, J.; Liu, Z.; Bao, S.; Zhan, J.; Shen, N. Enhancing Remote Sensing Semantic Segmentation Accuracy and Efficiency Through Transformer and Knowledge Distillation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 4074–4092. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, T.; Zhao, L.; Hu, L.; Wang, Z.; Niu, Z.; Cheng, P.; Chen, K.; Zeng, X.; Wang, Z. Ringmo-lite: A remote sensing lightweight network with cnn-transformer hybrid framework. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5608420. [Google Scholar] [CrossRef]

Figure 1. OWTs in remote sensing image.

Figure 2. The architecture of OWTDNet.

Figure 3. Remote sensing image splitting strategies.

Figure 4. Comparison of existing object detection algorithms.

Figure 5. Difference between Transformer and Mamba in modeling relations.

Figure 6. The architecture of VSS block.

Figure 7. The limitations of the Mamba model.

Figure 8. The architecture of Blurring-MobileNetv3.

Figure 9. Overview of feature alignment and enhancement module.

Figure 10. Illustration of channel alignment module.

Figure 11. Illustration of position alignment module.

Figure 12. Visualization of different detection models.

Figure 13. PR curves of OWTDNet at small, medium, and large scales.

Figure 14. Visual comparison of different detection models.

Figure 15. Illustration of NMS in post-processing.

Figure 16. Different encoding structures.

Figure 17. Detection results under different encoding structures.

Figure 18. Illustration of prediction robustness under the image shit operation.

Figure 19. Visual detection results with or w/o channel and position enhancement.

Figure 20. Comparison of detection results using different classification losses in Cascade R-CNN network.

Figure 21. The comparison of detection results using different box regression loss.

Table 1. The advantages and disadvantages of different splitting strategies.

Splitting Strategies	Advantages	Disadvantages
1	(1) Good for small targets like wind power facilities. (2) Ensures no missed detections caused by splitting strategy.	(1) Duplicate detections in the gap will lead to performance decline. (2) Different patch images contain the same targets, deduplication post-processes are required, increasing the system’s overhead.
2	(1) Improves the system’s speed and efficacy. (2) No post-processing stage, lower deployment costs.	(1) Missed detections caused by splitting strategy.

Table 2. Captured parameters of different satellites.

Satellite Name	Spatial Resolution	Longitude Range	Latitude Range
GF1	Full color with 2 m	${117.5}^{°}$ $~ {122.9}^{°}$	${23.5}^{°}$ $~ {34.7}^{°}$
GF2	Full color with 1 m	${119.5}^{°}$ $~ {120.1}^{°}$	${33.5}^{°}$ $~ {35.3}^{°}$
GF7	Full color with 0.8 m	${117.4}^{°}$ $~ {119.8}^{°}$	${23.9}^{°}$ $~ {34.6}^{°}$
ZY1	Full color with 5 m	${118.6}^{°}$ $~ {122.6}^{°}$	${24.2}^{°}$ $~ {35.0}^{°}$
ZY3	Full color with 2.1 m	${118.0}^{°}$ $~ {121.8}^{°}$	${24.7}^{°}$ $~ {34.6}^{°}$

Table 3. Hardware and software configurations.

Platform	Name	Description
Train Hardware	CPU	Intel Xeon 6330 2.0 GHz
	GPU	NVIDIA RTX4090 48 GB × 4
	Memory	DDR5 6800 MHz 64 GB
Deploy Hardware	CPU	Intel 12700KF 3.6 GHz
	GPU	NVIDIA RTX3090 24 GB × 1
	Memory	DDR4 3200MHz 64 GB
Software	Anaconda	Version 4.12.0
	Python	Version 3.8.5
	CUDA	Version 12.1
	cuDNN	Version 8.9.3
	Pytorch	Version 2.1.0
	MMCV	Version 2.1.0
	ultralytics	Version 8.0.6
	VMamba	Version 20240525
	MMDetection	Version 3.3.0

Table 4. Training configurations of different methods.

Method Category	Method Name	Proposed Year	Learning Rate	Train Epochs	Batch Size	Augmentations
CNN	Faster R-CNN	2015	0.001	30	32
	RetinaNet	2017	0.005	30	32	RandomResize
	Cascade R-CNN [53]	2018	0.002	30	32	RandomCrop
	CenterNet	2019	0.001	30	32	RandomFlip
	ConvNext	2022	0.0001	30	32
Transformer	DETR	2020	0.0001	45	8
	Deformable DETR	2021	0.0001	45	8	RandomResize
	Conditional DETR	2021	0.0001	45	8	RandomCrop
	DAB-DETR	2022	0.0001	45	8	RandomFlip
	DINO	2023	0.0001	45	8
YOLO	YOLOv3	2018	0.001	50	64	RandomResize
	YOLOF	2021	0.001	50	64	RandomCrop
	YOLOX	2021	0.001	50	64	RandomFlip
	YOLOv8	2023	0.001	50	64
Mamba	VMamba	2024	0.0001	35	12	RandomResize
Mamba	Mamba-YOLO	2025	0.01	35	12	RandomCrop
OWTDNet	CNN+Mamba		0.0001	35	12	RandomFlip

Table 5. Comparison of different OWT detection methods.

Method Name	Backbone	Params /MB	${A P}_{50 - 95}$ /%	${A P}_{50}$ /%	${A P}_{75}$ /%	FLOPs /T	$S_{p a t c h}$ /ms	$S_{r s}$ /s	GPU /GB
Faster R-CNN	R50	80.29	28.3 ± 0.2	42.3 ± 0.27	34.9 ± 0.21	0.35	51.43	99.4	4.8
Faster R-CNN	R101	99.28	32.7 ± 017	50.9 ± 0.22	36.8 ± 0.19	0.43	55.6	107.3	5.3
RetinaNet	R50	37.96	31.3 ± 0.26	45.7 ± 0.27	34.2 ± 0.23	0.26	22.7	43.8	2.8
	R101	56.96	33.6 ± 0.22	51.3 ± 0.24	37.5 ± 0.22	0.32	29.5	57.1	3.8
	Swin-T	38.47	36.3 ± 0.23	55.1 ± 0.26	42.0 ± 0.27	0.31	48.3	92.8	4.4
	Swin-S	59.77	39.7 ± 0.19	59.4 ± 0.22	46.5 ± 0.24	0.44	68.9	132.9	4.7
Cascade R-CNN	R50	69.39	33.2 ± 0.18	50.7 ± 0.22	36.3 ± 0.21	0.24	35.3	64.2	3.3
	R101	88.39	34.7 ± 0.15	52.8 ± 0.21	38.4 ± 0.19	0.32	38.9	75.7	4.3
	Swin-T	71.93	37.5 ± 0.21	56.4 ± 0.26	40.7 ± 0.24	0.29	52.6	101.6	3.6
	Swin-S	93.25	41.7 ± 0.16	62.2 ± 0.19	48.3 ± 0.16	0.37	70.2	135.7	3.9
CenterNet	R50	32.29	30.4 ± 0.27	47.4 ± 0.27	33.6 ± 0.29	0.21	21.3	41.1	2.9
CenterNet	R101	51.28	35.1 ± 0.22	53.5 ± 0.25	39.2 ± 0.23	0.29	26.4	50.9	3.9
ConvNeXt	Tiny	48.09	37.7 ± 0.21	57.3 ± 0.21	41.7 ± 0.21	0.27	29.7	57.3	4.6
ConvNeXt	Small	69.73	38.6 ± 0.15	58.1 ± 0.19	42.1 ± 0.22	0.36	38.1	73.5	5.3
DETR	R50	41.57	36.4 ± 0.19	52.0 ± 0.22	38.1 ± 0.18	0.09	26.2	50.5	3.9
DETR	R101	60.56	39.8 ± 0.17	56.7 ± 0.21	43.2 ± 0.19	0.18	35.1	67.8	4.3
Deformable DETR	R50	40.11	35.8 ± 0.24	53.9 ± 0.23	40.4 ± 0.22	0.2	37.2	71.8	4.5
Conditional DETR	R50	43.47	36.9 ± 0.22	56.1 ± 0.22	39.4 ± 0.24	0.11	35.3	68.1	4.1
DAB-DETR	R50	43.72	37.2 ± 0.25	55.7 ± 0.24	41.3 ± 0.26	0.11	38.4	74.2	4.2
DINO	R50	47.71	38.4 ± 0.21	56.2 ± 0.27	40.9 ± 0.18	0.28	51.3	99.6	4.6
DINO	Swin-L	218.33	47.8 ± 0.14	67.8 ± 0.16	53.2 ± 0.12	0.49	212.8	411.3	5.7
YOLOv3	DarkNet-53	61.95	29.7 ± 0.35	43.5 ± 0.42	34.3 ± 0.34	0.2	14.1	27.2	4.1
YOLOF	R50	44.16	33.7 ± 0.27	51.7 ± 0.35	36.6 ± 0.29	0.11	15.4	29.7	2.8
YOLOX	Tiny	5.06	31.2 ± 0.31	49.6 ± 0.32	35.2 ± 0.27	0.02	11.2	21.6	2.5
	Small	8.97	33.1 ± 0.29	51.8 ± 0.33	35.6 ± 0.28	0.03	13.1	25.3	2.8
	Large	54.21	34.8 ± 0.31	53.3 ± 0.26	38.7 ± 0.24	0.19	19.8	38.2	5.4
	xLarge	99.07	39.2 ± 0.25	58.2 ± 0.22	43.1 ± 0.22	0.36	26.7	51.5	7.5
YOLOv8	Small	11.2	31.9 ± 0.31	47.1 ± 0.29	37.6 ± 0.26	0.03	8.7	16.8	1.6
	Medium	25.9	35.3 ± 0.33	54.2 ± 0.26	39.2 ± 0.25	0.08	16.9	32.6	1.8
	Large	43.7	37.7 ± 0.29	56.5 ± 0.22	41.4 ± 0.29	0.17	26.3	50.7	2.1
	xLarge	68.2	39.5 ± 0.27	59.1 ± 0.24	45.3 ± 0.23	0.26	41.6	80.3	3.3
VMamba	Tiny	53.86	38.3 ± 0.22	57.6 ± 0.23	42.3 ± 0.22	0.16	35.2	68.2	2.2
	Small	68.18	43.3 ± 0.17	62.6 ± 0.24	48.7 ± 0.17	0.19	41.3	79.8	2.4
	Large	82.47	44.5 ± 0.16	66.4 ± 0.19	51.6 ± 0.14	0.21	57.8	110.4	2.8
OWTDNet	Tiny	66.23	40.7 ± 0.18	61.5 ± 0.23	46.4 ± 0.19	0.22	46.3	89.4	3.7
	Small	82.68	45.6 ± 0.15	68.7 ± 0.24	50.7 ± 0.16	0.25	50.6	97.6	4.2
	Large	97.58	47.1 ± 0.16	67.1 ± 0.18	52.4 ± 0.12	0.28	66.3	127.7	4.5

Table 6. Performance of different splitting strategies.

Method Name	Backbone	Split	${A P}_{50 - 95}$ /%	${A R}_{50 - 95}$ /%	Pre-Process Time/s	Post-Process Time/s	$S_{r s}$ /s
Cascade R-CNN	Swin-S	1	41.4	0.74	5.1	3.8	144.6
Cascade R-CNN	Swin-S	2	41.7	0.71	4.5	0	140.0
DINO	Swin-L	1	48.1	0.86	5.1	4.3	420.7
DINO	Swin-L	2	47.8	0.82	4.5	0	415.8
YOLOv8	xLarge	1	38.2	0.71	5.1	4.6	89.2
YOLOv8	xLarge	2	39.5	0.65	4.5	0	84.8
VMamba	Large	1	43.9	0.77	5.1	3.8	119.3
VMamba	Large	2	44.5	0.74	4.5	0	114.9
OWTDNet	Large	1	46.8	0.79	5.1	3.7	136.5
OWTDNet	Large	2	47.1	0.76	4.5	0	132.2

Table 7. Performance comparison under different encoding structures.

Strategy	Upper Branch	Lower Branch	${A P}_{50 - 95}$ /%	$S_{p a t c h}$ /ms
1	Swin Transformer [53]	R50	44.8	89.6
1	Swin Transformer [53]	B-Mv3	44.3	83.2
2	VMamba	R50	45.2	60.7
2	VMamba	B-Mv3	44.9	52.1
3	Swin Transformer	R50	45.2	96.7
3	Swin Transformer	B-Mv3	45.4	88.2
4	VMamba	R50	45.3	63.2
4	VMamba	B-Mv3	45.2	55.3
5	Swin Transformer	R50	45.8	101.6
5	Swin Transformer	B-Mv3	45.9	91.2
6	VMamba	R50	45.5	68.4
6	VMamba	B-Mv3	45.7	57.1
7	Swin Transformer	R50	46.9	114.9
7	Swin Transformer	B-Mv3	46.5	101.7
8	VMamba	R50	47.5	77.9
8	VMamba	B-Mv3	47.1	66.3

Table 8. Performance comparison under different CNN structures.

Method Name	Backbone	Blurring Convolution	Data Augmentation	${A P}_{50 - 95}$ /%	$S_{p a t c h}$ /ms
Cascade R-CNN	R50	$\sqrt$	random shift	35.1	39.3
		$\sqrt$	$\times$	34.6	39.3
		$\times$	random shift	33.2	35.3
		$\times$	$\times$	31.6	35.3
	Mv3	$\sqrt$	random shift	34.3	26.1
		$\sqrt$	$\times$	33.8	26.1
		$\times$	random shift	29.5	22.4
		$\times$	$\times$	28.7	22.4
DINO	R50	$\sqrt$	random shift	38.6	55.7
		$\sqrt$	$\times$	38.3	55.7
		$\times$	random shift	38.4	51.3
		$\times$	$\times$	38.2	51.3
	Mv3	$\sqrt$	random shift	37.7	44.2
		$\sqrt$	$\times$	37.5	44.2
		$\times$	random shift	37.4	41.1
		$\times$	$\times$	37.1	41.1
OWTDNet	R50	$\sqrt$	random shift	47.9	82.3
		$\sqrt$	$\times$	48.1	82.3
		$\times$	random shift	47.5	77.9
		$\times$	$\times$	46.8	77.9
	Mv3	$\sqrt$	random shift	47.1	66.3
		$\sqrt$	$\times$	46.9	66.3
		$\times$	random shift	46.7	61.2
		$\times$	$\times$	46.3	61.2

Table 9. Performance comparison under different channel and position enhancements.

Backbone	Channel Enhance	Position Enhance	${A P}_{50 - 95}$ /%	${A P}_{50}$ /%	${A P}_{75}$ /%
Tiny	$\times$	$\times$	39.1	59.1	44.6
	$\sqrt$	$\times$	40.2	61.1	45.9
	$\times$	$\sqrt$	39.8	60.3	45.2
	$\sqrt$	$\sqrt$	40.7	61.5	46.4
Small	$\times$	$\times$	44.2	65.9	48.7
	$\sqrt$	$\times$	45.4	67.5	50.2
	$\times$	$\sqrt$	44.8	66.5	49.3
	$\sqrt$	$\sqrt$	45.6	68.7	50.7
Large	$\times$	$\times$	45.7	66.5	51.8
	$\sqrt$	$\times$	46.8	66.9	49.8
	$\times$	$\sqrt$	46.2	66.2	52.1
	$\sqrt$	$\sqrt$	47.1	67.1	52.4

Table 10. Performance comparison with different loss functions.

Method Name	Backbone	Class Loss	Reg Loss	${A P}_{50 - 95}$ /%
Cascade R-CNN	Swin-S	Cross Entropy	Smooth L1	37.4
			IoU	39.1
			GIoU	39.7
			DIoU	39.5
			CIoU	39.6
		Focal	Smooth L1	40.6
			IoU	41.4
			GIoU	41.5
			DIoU	41.5
			CIoU	41.7
DINO	Swin-L	Cross Entropy	Smooth L1	45.8
			IoU	46.6
			GIoU	46.8
			DIoU	46.9
			CIoU	47.1
		Focal	Smooth L1	45.7
			IoU	46.9
			GIoU	47.4
			DIoU	47.5
			CIoU	47.8
VMamba	Large	Cross Entropy	Smooth L1	42.2
			IoU	43.5
			GIoU	43.8
			DIoU	44.2
			CIoU	44.1
		Focal	Smooth L1	42.9
			IoU	43.7
			GIoU	44.1
			DIoU	44.4
			CIoU	44.5
OWTDNet	Large	Cross Entropy	Smooth L1	44.2
			IoU	45.1
			GIoU	45.8
			DIoU	46.2
			CIoU	46.4
		Focal	Smooth L1	45.2
			IoU	46.3
			GIoU	46.9
			DIoU	47.0
			CIoU	47.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sha, P.; Lu, S.; Xu, Z.; Yu, J.; Li, L.; Zou, Y.; Zhao, L. OWTDNet: A Novel CNN-Mamba Fusion Network for Offshore Wind Turbine Detection in High-Resolution Remote Sensing Images. J. Mar. Sci. Eng. 2025, 13, 2124. https://doi.org/10.3390/jmse13112124

AMA Style

Sha P, Lu S, Xu Z, Yu J, Li L, Zou Y, Zhao L. OWTDNet: A Novel CNN-Mamba Fusion Network for Offshore Wind Turbine Detection in High-Resolution Remote Sensing Images. Journal of Marine Science and Engineering. 2025; 13(11):2124. https://doi.org/10.3390/jmse13112124

Chicago/Turabian Style

Sha, Pengcheng, Sujie Lu, Zongjie Xu, Jianhai Yu, Lei Li, Yibo Zou, and Linlin Zhao. 2025. "OWTDNet: A Novel CNN-Mamba Fusion Network for Offshore Wind Turbine Detection in High-Resolution Remote Sensing Images" Journal of Marine Science and Engineering 13, no. 11: 2124. https://doi.org/10.3390/jmse13112124

APA Style

Sha, P., Lu, S., Xu, Z., Yu, J., Li, L., Zou, Y., & Zhao, L. (2025). OWTDNet: A Novel CNN-Mamba Fusion Network for Offshore Wind Turbine Detection in High-Resolution Remote Sensing Images. Journal of Marine Science and Engineering, 13(11), 2124. https://doi.org/10.3390/jmse13112124

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

OWTDNet: A Novel CNN-Mamba Fusion Network for Offshore Wind Turbine Detection in High-Resolution Remote Sensing Images

Abstract

1. Introduction

2. Related Works

2.1. CNN-Based Object Detection

2.2. Transformer-Based Object Detection

2.3. Mamba-Based Object Detection

3. Materials and Methods

3.1. Pre-Processing of High-Resolution Remote Images

3.2. Review of Mamba

3.3. CNN Encoding Network

3.4. Feature Alignment Module

3.4.1. Channel Alignment Process

3.4.2. Positional Alignment Process

3.5. Loss Functions

4. Results

4.1. Satellite Datasets

4.2. Experimental Setups

4.3. Main Results

4.4. Ablation Study

4.4.1. Discussion of Splitting Strategy

4.4.2. Discussion of Feature Alignment

4.4.3. Discussion of Blurring Convolution

4.4.4. Discussion of Feature Enhance in Alignment Module

4.4.5. Discussion of Loss Functions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI